MongoDB Hadoop Connector

Big Data Apr 22, 2012

10gen, creator of MongoDB announced the availability of version 1.0 of the MongoDB-Hadoop connector.

The core feature of the Connector is to provide the ability to read MongoDB data into Hadoop MapReduce jobs, as well as writing the results of MapReduce jobs out to MongoDB. Users may choose to use MongoDB reads and writes together or separately, as best fits each use case. Our goal is to continue to build support for the components in the Hadoop ecosystem which our users find useful, based on feedback and requests.

For this initial release, we have also provided support for:

writing to MongoDB from Pig (thanks to Russell Jurney for all of his patches and improvements to this feature)

writing to MongoDB from the Flume distributed logging system

using Python to MapReduce to and from MongoDB via Hadoop Streaming.

Though it is quite early in it’s evolution, an exciting possibility this introduces is to write MapReduce scripts for MongoDB using Hadoop Streaming. This was surely an important part and as they explain, the toughest part of the connector:

Hadoop Streaming was one of the toughest features for the 10gen team to build. To that end, look for a more technical post on the MongoDB blog in the next week or two detailing the issues we encountered and how to utilize this feature effectively.

We are really looking forward to using the connector. The ability to use MongoDB as a document store and write MapReduce scripts to process them sounds like a promising capability!