Choosing a NoSQL Database - technology comparison matrix

Analytics Oct 17, 2014

The big data technology landscape consists of such a large number of choices, that often the most critical step in successfully implementing a solution is choosing the right platform that will address the requirements of the problem at hand, and that is sustainable in the long term.

With such a large number of choices though, doing a feature-wise comparison of all individual platforms is just too complex and time consuming. However, it is possible to group these options based on the data models they support.

The following technology comparison matrix compares different predominant data models, their capabilities, typical applications as well as limitations.

Data Model, Examples	Capabilities	Applications	Limitations
Key-Value BerkleyDB, MemcacheDB, Redis, DynamoDB	- The simplest model where each object is retrieved with a unique key, with values having no inherent model - Utilize in-memory storage to provide fast access with optional persistence - Other data models built on top of this model to provide more complex objects	- Applications requiring fast access to a large number of objects, such as caches or queues - Applications that require fast-changing data environments like mobile, gaming, online ads	- Cannot update subset of a value - Does not provide querying - As number of objects becomes large, generating unique keys could become complex
Document-oriented MongoDB, CouchDB, Apache Solr, Elastic Search	- Extension of key-value model, where value is a structured document - Documents can be highly complex, hierarchical data structures without requiring pre-defined “schema” - Supports queries on structured documents - Search platforms are also document-oriented	- Applications that need to manage a large variety of objects that differ in structure - Large product catalogs in e-commerce, customer profiles, content management applications	- No standard query syntax - Query performance not linearly scalable - Join queries across collections not efficient
Column-Oriented Cassandra, BigTable, HBase, Apache Accumulo	- Extension of key-value model, where the value is a set of columns (column-family) - A column can have multiple time-stamped versions - Columns can be generated at run-time and not all rows need to have all columns	- Storing a large number of time-stamped data like event logs, sensor data - Analytics that involve querying entire columns of data such as trends or time series analytics	- No join queries or sub-queries - Limited support for aggregation - Ordering is done per partition, specified at table creation time
Graph-oriented Neo4J, OrientDB, Apache Giraph, AllegroGraph	- Models graphs consisting of nodes and edges with properties (meta-data) describing them - Implement very fast graph traversal operations - Also support indexing of meta data to enable graph traversal combined with search queries	- Applications that deal with objects with a large number of inter-relations - Applications like social networking friends-networks, hierarchical role based permissions, complex decision trees, maps, network topologies	- Difficult to scale for large data sets for generic graphs - [Giraph](http://giraph.apache.org) uses the [Bulk Synchronous Parallel ](http://en.wikipedia.org/wiki/Bulk_synchronous_parallel)model to overcome some of the scalability limitations
Relational MySql, PostgreSQL, MariaDB, Oracle, SQL Server	- Conventional RDBMS structure consisting of fixed schema with [ACID](http://en.wikipedia.org/wiki/ACID) properties - Provides well documented and widely supported SQL syntax - Capable of complex queries including subqueries and joins	- Transactional data applications like ERP, CRM, Banking etc. - Applications where data volume is limited and schema are by and large fixed	- Lacks horizontal scalability and hence limited in handling “big data” - Not efficient at handling complex multi-level nested data - Cannot handle “unstructured” data where structure is not known at design time

Choosing the right platform would involve mapping the requirements in terms of the data model, and the type of querying and data access patterns required. The table above includes a brief overview of common features supported by various engines and it can help shorten the list of options. However, it surely requires a much finer analysis of the individual engines to make a final choice for any application.

It may seem like there is a glaring omission of Hadoop in the above table. Hadoop is so widely known in the context of big data and NoSql, that often it is mistaken for a database.

Hadoop fundamentally consists of:

Hadoop Distributed File System (HDFS) – a distributed file storage system with built-in replication and fault tolerance,
Hadoop YARN – a framework for job scheduling and cluster resource management
Map Reduce – a distributed programming model to process a large number of objects

There are database engines that are built on top of Hadoop such as HBase, Hive, Giraph etc. that provide different database models, and these are included in the above table.

Please feel free to share your experience in selecting the right database engine for your application.