Monday, December 24, 2007

Database vs. scientist

One of the founding principles of Physion Consultants is to bring cutting edge technology to scientists. When the computational tools cease to be a bottleneck in the scientific process, we feel that we've done our job. For many scientific applications, a relational database system (RDBMS) is the appropriate data repository. After all, great database engineers have already solved many of the data storage and query problems typically faced by custom-written scientific software. 

When I start a project, I often first consider the data repository. As a scientist, the data is everything so it makes sense to start there. The project requirements often include searching, indexing, and retrieving data collected across many experiments or observations. Clients can often tell me pretty clearly what entities they expect to be measuring. "Well," I think, "a RDMS sounds ideal." Inevitably, I run right into the RDBMS brick wall when I realize that the flexibility a scientist wants is directly at odds with the rigid certainty that a RDBMS needs to do its job.

As a researcher, I want to have unlimited flexibility in the parameters of my experiment. These parameters, of course, have to go into whatever data store I choose. We can't store arbitrary key=>value pairs in a table in the database because the database engine needs to know the type (and size) of the key and value columns. Sure we could dump a huge list of key=>value pairs into a binary BLOB in the database, but then blamo, we've lost all ability to easily query those parameters in the database engine. And the database engine is the right place to do the query. Darn. 

I spent the weekend working on a dictionary-like construct for Apple's CoreData framework. It defines a class/entity cluster headed by an entity called a KeyValuePair that, you guessed it, stores a key and a value. The idea to provide an API for the app developer that allows creation of an NSDictionary (key=>value map) from a set of KeyValueEntities and visa versa. In addition, the developer/user can query KeyValuePairs directly via a SUBQUERY expression in a CoreData predicate. The only hitch is that the query must specify the type of the value (like 'key == "myKey" AND intValue==3'). It's a hack at this point, but I haven't found any better solution out there. If anyone knows other ways to solve this problem, I'm all ears. In the mean time, we'll get this code cleaned up make it available in case it saves some folks a few days of hair pulling. Stay tuned.