Lost in transit.
You're welcome.

a better mouse trap #1: persistence, document storage and couchdb

Ever considered building a better mouse trap? Well. I do, actually, each and every day anew. I will leave out most of the more in-depth considerations related to this, as for now they don’t matter – the only thing worth knowing, initially, is that the process of “migrating”, “updating”, “rebuilding”, … an existing, fairly complex application system is something that is painful beyond all imagination, and this even while “just” considering the mere technical effects and aspects of this procedure…

General considerations

Quite a while ago already I have been vaguely thinking about rebuilding the system storage infrastructure by completely dumping the RDBMS for storing productive data, keeping it only to keep track of indexing and search information for performance reasons, and store data in XML structures instead, all along with schema definitions to allow for data validation and XSLT “magic” to do data processing and transformation. Not much later, I learnt that I’m not the only one to be haunted by problems and thoughts like these, seeing the term NoSQL gaining speed, different storage paradigms quickly becoming more popular. Taking a closer look at things, document oriented databases eventually seem closer to our data model and understanding than an RDBMS is. There is a a good presentation to be found at slideshare, enumerating rather well some of the problems I am to see here each and every day (just go looking for the address / telephone number example).

All along the way, I’ve been into this for a couple of years already, by now, and, at the very least somewhat frustrated with the current state of how things move along from a technical, architectural, conceptual point of view, I have more or less decided to step back for a moment, literally relax (as good as somewhat possible), have a look at what we got, at what problems we’re repeatedly dealing with, and whether there might be “better” solutions, assuming there is some way to judge what “better” actually means. Following this path, I hope to gain and collect some insights on how to possibly do things more effectively, to eventually get work done more quickly, to simply do more in the time at hand. And, so, starting out with a rather simple yet fundamental issue: Persisting data.

Persistence: The good, the bad, the annoying

I hate to admire that, in 2011, we still have to consider persistence and persistency technologies something worth thinking about. But actually, it indeed is, for a bunch of reasons both technically and conceptually. To outline just some of them:

  • First and foremost, our data store is grown out of an existing, legacy document management application which makes use of an SQL database (which, in our case, happens to be SAP MaxDB for storing user, metadata, process as well as its own structural information, and a file system globally shared by all hosts for storing the actual document content, mainly binary files. This kind of setup, though it used to be reasonable back when the system was invented, these days mainly is a source of trouble at the very least because it always requires the two data sources to be sync’ed both talking about backup and recovery and about how to deal with application logic to modify data in either of these places. The document management system has a modestly sophisticated logic of doing so, but its functionality is limited and it offers no external interfaces of any kind in order to attach external applications, so to add functionality required yet not provided by the legacy system, it’s all about rebuilding access to these data sources manually in some external environment (we chose server-sided Java) and hope for the best, given there for sure are some caveats in this setup (just to mention the effects of one of these systems modifying data cached in the other system, without any ability to let the system holding the cache know that some of the cached objects have changed).
  • The database schema used in the RDBMS is fairly complex and complicated and, well, could be somewhat improved and cleaned up. There are vast loads of “large” tables including loads of columns and relations containing “NULL” or empty “” values for most of the columns. Reason for that is that, in the backend DMS, adding another column to an existing table is pretty easy, whereas adding a new kind of “class” with a table storage of its own is more work and can’t be done “online” on the fly. So, by now and in the end, whenever an entity that has to be stored comes with information not yet stored in the table, chances that another column will be added are pretty good. Surely not the way how to handle an RDBMS, and a cause of pain in some ways, but pragmatically it works…
  • In the current setup, with the RDBMS system given, accessing and working with the database technically is made more difficult because, unfortunately, in most of the use cases it means falling back to the “lowes level” possible, as most of the state-of-the-art tools and frameworks like hibernate or EclipseLink do not (yet/fully) support this database backend, which means falling back to plain JDBC in most cases, which is not really what you want. Outside the Java world (in example while trying to use Python for quick-and-dirty prototypes), things look same as bad – the Python modules for SAP MaxDB never really worked for us.
  • Adding to these problems, however, as far as the Java environment is concerned, there are as well some problems that arise from the “dynamic” nature of our table structure – having a column added to a table is something to surely require a little more work while, in example, using an O/R-mapper and entity classes mapping to a given database table. In most situations, so far, we fell back to working with somewhat simple Map result sets simply to pay respect to the “dynamic” nature of data in our tables…
  • Sure, the RDBMS is capable of doing quite some interesting things, in example handling column constraints (NOT NULL), of, this way, ensuring mandatory fields to actually be present in a data entity, and to also take care of things like references and relations between elements (after all, that’s what the “R” stands for…). Unfortunately, in some situations you end up with a structure in which, due to the very nature of how data is handled, most of these features go unused and most of the logic to “validate” whether mandatory information is there, whether constraints are satisfied and so on, ends up to be part of the application logic, woven into business logic, becoming hard to understand, maintain, use or even migrate, doubling effort by re-building something the core system (the RDBMS) would be capable of doing out of the box, after all.

So much for that. There are, overally, enough minor and different annoyances to consider a different software platform to do the job, or maybe even a completely different storage strategy. General things I’d like to see improved involves a better way of handling, dealing with “dynamic” data structures, an easier means of accessing data stored inside the database, and, best of all of course, an easier, more “unified” way of storing binary and metadata.

Document storage

All along dealing with these things, I sooner or later stumbled across CouchDB, an open-source software project implementing a “document oriented database that can be queried using JavaScript”. Agreed, started in 2005, CouchDB has been around for quite a while, and I can’t go on without admitting that, so far, I considered it, amidst some other similar technologies, to be “just one more buzzword” next to the whole Cloud Computing hype trend which still seems to be raging, waiting for the marketing dust to settle a little. Yet, taking a closer look, I found the CouchDB data model to actually be pretty close to our ideas and considerations in some ways:

  • At first, there is, of course, the “document based” approach: Everything to be stored in there is a “document”, being a collection of key/value pairs, with having both simple data types, lists, maps allowed as value data types.
  • The whole database is schema-less, meaning that the structure of data objects to be stored is not enforced to adhere to certain formal definitions made in advance – which happens to be the case in example in an RDBMS even while mis-interpreting it to be “just a vast container for large tables” – even having badly modeled tables, you will initially have to define which columns are there. In CouchDB, you easily can add fields, data sets to an existing document, or even add / mix documents completely different in its structure into one database. Sure, this does require some care to be taken because, this way, you can’t be sure the information needed in a document is actually there, but then again, given our current approach and having at the very least 95% of this kind of data integrity checks in custom business logic anyway, it doesn’t make things drastically worse.
  • Same as storing new kinds of document is easy (lacking the need to create new tables), it is easy to store new documents in custom “databases”, as creating (and using) new databases at runtime without too much ado also is just a breeze and easy as can be. And, yes, this is the case even compared to SAP MaxDB in which, given the excellent management tools, creating databases is way easier than in most other RDBMS’s I have worked with so far.
  • Talking about being forced to use certain specialized tools in order to enter or retrieve data, one surely can’t say this about CouchDB as, basically, it’s all HTTP and JSON. Plain and simple. Though there’s a bunch of convenience libraries and wrappers available for a set of different environments and languages, and though CouchDB itself comes with a rather straightforward, web-based user interface named “futon”, in the end it all boils down to issuing HTTP requests to a running CouchDB server, which allows for doing anything the database is capable of doing, from simple CRUD to doing database administration tasks. This way, linking a custom application to a CouchDB database is same as easy as using good old friends like curl.

It didn’t obviously take long to make me want to take a closer look, after reading some general papers on the Hows and Whats of CouchDB…

Wet feet…

First off, let it be known that CouchDB documentation is simply outstanding – same as the Wiki, the Getting-Started – resources and the online version of “CouchDB – The Definite Guide” should give you all needed to get started without too much ado. Actually, I haven’t seem much pieces of technology the last couple of years which were as incredibly easy to get started with as CouchDB. On my Ubuntu 10.10 installation, getting CouchDB up and running prove to be rather simple: Install via aptitude, configured it to not start automatically all along with the machine booting up, and started the server. Not much to be said about it, overally – pretty straightforward, and the installation manual has some information for those who get stuck and/or try installing CouchDB on top of Microsoft Windows which, these days, seems a little more tricky compared to other platforms.

A next good step is to follow through the Getting Started Tour chapter in the “Definite Guide”. Again, not much to be said about it. A very quick walk-through, just as an example to get the idea:

# Create an empty database, let's say, "bookmarks" to keep track of our web site bookmarks:
kr@n428:~$ curl -X PUT http://localhost:5984/bookmarks

#Add a "document" (consisting of fields "url", "category", "language") with an id "123456789" to the "bookmarks" database:
kr@n428:~$ curl -X PUT -d '{"url":"http://www.heise.de","category":"news", "language":"de"}'

# Load and dump the document again to see whether it actually has been stored in the database:
kr@n428:~$ curl -X GET
{"_id":"123456789","_rev":"1-8140250daa5e9e0fb28089273a40811b", "url":"http://www.heise.de","category":"news","language":"de"}

# Update the document (in its current revision) by adding a new field "visited":
kr@n428:~$ curl -X PUT -d '{"_id":"123456789","_rev":"1-8140250daa5e9e0fb28089273a40811b", "url":"http://www.heise.de","category":"news","language":"de", "visited":"false"}'

# Fetch the document again to see the changes have been stored in the database:
kr@n428:~$ curl -X GET

# Dump the database and all its content  (not that it actually did contain useful data):
kr@n428:~$ curl -X DELETE

All along the way, you can use your web browser to view the document you just created or stored in the futon web user interface:

What should be seen: (a) Except for the futon tool which is optional, interaction with CouchDB completely has happened using the Unix shell and the curl utility, issuing HTTP requests. (b) The communication does not just utilize HTTP GET but makes use of different HTTP verbs to get work done, a behaviour usually filed under RESTful these days. And finally (c): All data exchange, be that sending or recieving, happened using JSON syntax. This way, making an existing application work with CouchDB should be pretty clear once the application provides an HTTP client and means of (de-)serializing to/from JSON.

Code to this…

Playing with the database using curl and futon in some way is almost enjoyable, and, given some familiarity with HTTP, not too complicated, but in a real-life environment eventually it won’t help much – unless these are the tools your end users, customers want to make use of, which I don’t think. So, in order to attach CouchDB to an existing application (or use it in a newly built system), the Basics page in the CouchDB wiki comes with a bunch of tutorials introducing how to use CouchDB from within various programming languages, including Java and Python. As we mainly use Java, the first shot was to play through the jcouchdb tutorial, which got me working, writing documents to the CouchDB and reading again from there within just a couple of minutes. Though jcouchdb itself seems to lag a little behind the recent versions of CouchDB and is an open-source project yet not an official part of the CouchDB releases, it works rather well and so far allowed for doing anything I tried without issues.

Most of the clients out I had a look at were quite simple and easy to get started with, so, just as an easy starting point and a quick look on how a CouchDB client in an existing language does look and feel like, an example similar to the curl one above, using the python-couchdb client, according to the documented overview on how to use couchdb-python:

>>> import couchdb
>>> server=couchdb.Server()
>>> server
 server 'http://localhost:5984/'
>>> db=server.create("bookmarks")
>>> db
 database 'bookmarks' 
>>> doc={"url":"http://www.heise.de","title":"heise.de","cat":"news"}
>>> db.create(doc)
>>> db['92bca31a8094426daed37cce1f000db8']
 document '92bca31a8094426daed37cce1f000db8'@'1-3f2b0c4ade2686c8d91a7911d253c9e2' {'url': 'http://www.heise.de', 'title': 'heise.de', 'cat': 'news'} 
>>> del server["bookmarks"]

Same as the structure of the Python code is convenient and familiar if you’re in-depth with Python, the jcouchdb API works out well if you’re a Java developer so hurdles to get started with programmatically filling CouchDB instances with data should be not all that high.

And now…?

As pointed out earlier, it’s not just too much about (open/free) tools but mainly about how to fit them into an overall enterprise application environment in a sane and straightforward way. As far as CouchDB is concerned, it seems that this kind of integration is anything but difficult, given the fact it solely relies upon technologies which could be considered “old”/”mature (depending upon your point of view). Some thoughts, so far, on adopting this kind of technology:

  • Some energy needs to be spent on building a somewhat concise abstraction for the local business aplication to have data stored inside CouchDB in a sane, reliable way, especially as far as data validation is concerned, ensuring that “documents” stored in there provide all the information they need to have while including or eventually omitting those which are considered “optional”. The only challenge, about this, will be to have this validation layer just as dynamic as required to deal with document structures which might be subject to change dynamically.
  • The database will have to undergo a set of more extensive testings as far as infrastructure integration and data load is concerned. Whereas, as far as I learnt by now, CouchDB seems to be thought to run in a clustered environment while, in our setup, storage is centralized to a decently sized storage system, and the CouchDB of course needs to play along well with this setup. Likewise, in our case the database is likely to grow to a size up to > 2 TB if choosing a “unified” storage approach and including binary data (files) as document “attachments” (another neat feature CouchDB provides). I am not sure how “large” this is compared to other installations and facilities, but it is something that eventually needs to be checked to see how it works.
  • Compared to the most common SQL query statements, the CouchDB way of querying data using views is at the very least something to get used to. It eventually will take a couple of days trying to figure out how to map most of the current SQL statements (which are somewhat complex yet mostly include massively chained WHERE clauses) to CouchDB view / key combinations.

So, overally, a good approach, and questions left to be answered. Both being pretty good, eventually. Let’s see where it gets us…

27. Januar 2011

Filed under:

architecture , couchdb , development , english , persistence , tech , thoughts , tools