The big data technology landscape consists of such a large number of choices, that often the most critical step in successfully implementing a solution is choosing the right platform that will address the requirements of the problem at hand, and that is sustainable in the long term.
With such a large number of choices though, doing a feature-wise comparison of all individual platforms is just too complex and time consuming. However, it is possible to group these options based on the data models they support.
The following technology comparison matrix compares different predominant data models, their capabilities, typical applications as well as limitations.
|**Data Model, Examples**||**Capabilities**||**Applications**||**Limitations**|
|**Key-Value**||- The simplest model where each object is retrieved with a unique key, with values having no inherent model - Utilize in-memory storage to provide fast access with optional persistence - Other data models built on top of this model to provide more complex objects||- Applications requiring fast access to a large number of objects, such as caches or queues - Applications that require fast-changing data environments like mobile, gaming, online ads||- Cannot update subset of a value - Does not provide querying - As number of objects becomes large, generating unique keys could become complex|
|**Document-oriented**||- Extension of key-value model, where value is a structured document - Documents can be highly complex, hierarchical data structures without requiring pre-defined “schema” - Supports queries on structured documents - Search platforms are also document-oriented||- Applications that need to manage a large variety of objects that differ in structure - Large product catalogs in e-commerce, customer profiles, content management applications||- No standard query syntax - Query performance not linearly scalable - Join queries across collections not efficient|
|**Column-Oriented**||- Extension of key-value model, where the value is a set of columns (column-family) - A column can have multiple time-stamped versions - Columns can be generated at run-time and not all rows need to have all columns||- Storing a large number of time-stamped data like event logs, sensor data - Analytics that involve querying entire columns of data such as trends or time series analytics||- No join queries or sub-queries - Limited support for aggregation - Ordering is done per partition, specified at table creation time|
|**Graph-oriented**||- Models graphs consisting of nodes and edges with properties (meta-data) describing them - Implement very fast graph traversal operations - Also support indexing of meta data to enable graph traversal combined with search queries||- Applications that deal with objects with a large number of inter-relations - Applications like social networking friends-networks, hierarchical role based permissions, complex decision trees, maps, network topologies||- Difficult to scale for large data sets for generic graphs - [Giraph](http://giraph.apache.org) uses the [Bulk Synchronous Parallel ](http://en.wikipedia.org/wiki/Bulk_synchronous_parallel)model to overcome some of the scalability limitations|
|**Relational**||- Conventional RDBMS structure consisting of fixed schema with [ACID](http://en.wikipedia.org/wiki/ACID) properties - Provides well documented and widely supported SQL syntax - Capable of complex queries including subqueries and joins||- Transactional data applications like ERP, CRM, Banking etc. - Applications where data volume is limited and schema are by and large fixed||- Lacks horizontal scalability and hence limited in handling “big data” - Not efficient at handling complex multi-level nested data - Cannot handle “unstructured” data where structure is not known at design time|
It may seem like there is a glaring omission of Hadoop in the above table. Hadoop is so widely known in the context of big data and NoSql, that often it is mistaken for a database.
Hadoop fundamentally consists of:
- Hadoop Distributed File System (HDFS) – a distributed file storage system with built-in replication and fault tolerance,
- Hadoop YARN – a framework for job scheduling and cluster resource management
- Map Reduce – a distributed programming model to process a large number of objects
Please feel free to share your experience in selecting the right database engine for your application.