The NoSQL movement has spawned a slew of alternative data stores, all of which attempt to fill voids left by traditional relational database implementations. But while it's easy to fit the various relational databases (MySQL, Oracle, DB2 and so on) under a single categorical umbrella, the NoSQL world is much more diverse, and the NoSQL label is too general.

NoSQL data stores such as MongoDB and Cassandra are so vastly different from each other that apples-to-apples comparisons are practically impossible. Thus, within the world of NoSQL, there are subcategories such as key-value stores, graph databases and document-oriented stores.

Document-oriented stores, or document stores for short, aren't new to the world of computing. Industry graybeards will quickly recognise Lotus Notes as one of the first successful NoSQL document stores from the late '80s. Document stores encapsulate data into loosely defined documents, rather than tables with columns and rows. Implementations of the underlying document vary by data store, with some representing a document as XML and others as JSON, for instance.

But in general, documents aren't rigidly defined, and in fact they offer a high degree of flexibility when it comes to defining data. This flexibility has costs. For example, these data stores do not support SQL, instead supporting custom query languages better suited to the underlying document structure (such as XPath-like query languages for XML data stores). But the lack of rigidity in data definition has many benefits as well.

In many cases, compared to traditional relational databases, the more flexible document stores enable faster iterative-style development where data requirements are evolving more rapidly than the pace of development.

Flexible, scalable NoSQL

In recent years, a number of document stores have come out and garnered a high degree of developer mind share. One of the most popular of these is MongoDB, an open source, schema-free document store written in C++ that boasts support for a wide array of programming languages, a SQL-like query language and a number of intriguing features related to performance and scalability.

Out of the box, Mongo supports sharding, which permits horizontal scaling by divvying up a collection of documents across a cluster of nodes, thus making reads faster. What's more, Mongo offers replication in two modes: master-slave and replica sets. In a replica set, there is no master node; instead, all nodes are copies of one another and there is no single point of failure. Replica sets therefore bring more fault tolerance to larger environments supporting massive amounts of data. These features and more don't require an army of DBAs to implement, nor do they need massive hardware expenditures. Mongo can run on commodity hardware platforms, provided there is a healthy amount of memory.

Mongo is schema-less, it'll store any document you decide to put into it. There is no upfront document definition requirement. Ultimately, documents are grouped into collections, which are akin to tables in a relational database. Collections can be defined on the fly as well. Documents are stored in a binary JSON format, dubbed BSON, and encapsulate data represented as name-value pairs (which are somewhat like columns and rows).

JSON document store

JSON is an extremely understandable format. Humans can easily read it (as opposed to XML, for example) and machines can efficiently parse it. A document in Mongo representing a business card, for example, would look something like this:

{

"_id" : ObjectId("4efb731168ee6a18692d86cd"),

"name" : "Andrew Glover",

"cell" : "703-555-5555",

"fax" : "703-555-0555",

"address" : "29210 Corporate Dr, Suite 100, Anywhere USA"

}

In this case, the _id attribute in the document above represents a primary key in Mongo. Like a relational database, Mongo can index data and force uniqueness on data attributes. By default, the _id attribute is indexed; moreover, this document can further index individual fields or even a combination of them (for example, the name and address). Additionally, when defining an index, you can specify that its value be unique.

Mongo, however, doesn't provide for constraints or triggers. Documents in Mongo are free to refer to each other. For example, a document in a contact_log collection could refer back to a business card's _id above, thus providing a foreign keylike link. But there is no way, currently, to specify corresponding actions to be taken should a related document be removed, such as remove all referencing documents as well, which you can do in a typical RDBMS. You can, of course, add this sort of logic in application code, and triggers are planned for a future release.

JSON documents in Mongo do not force particular data types on attribute values. That is, there is no need to define upfront the format of a particular attribute. The data can be a string, an integer, or even an object type, provided it makes sense. By default, data types in Mongo documents include string, integer, boolean, double, array, date, object ID (which you can see in action in the business card example above), binary data (similar to a blob) and regular expression, although support for these latter types varies by driver.