In this article relational database experts David DeWitt and Michael Stonebraker compare MapReduce to traditional relational database systems (RDBMSs) and find MapReduce wanting. They make some strong points in favor or relational databases, but the comparison is not appropriate. When I finished reading the article I was thinking that the authors did not understand MapReduce or the idea of data in the cloud, or why programmers might be excited about non-RDBMS ways to manage data.
The article makes five points:
1. MapReduce is a step backwards in database access
They’re right about that, but MapReduce is not a database system. In the the Wikipedia article describing MapReduce that the authors link to from the article the word “database” isn’t mentioned, and only one of the footnotes points to a paper about MapReduce applications in relational data processing. MapReduce is not a data storage or management system — it’s an algorithmic technique for the distributed processing of large amounts of data. Google’s web crawler is a real-life example. I’m not sure what MapReduce has to do with schemas or separating the data from the application but apparently MapReduce is just Codasyl all over again. The authors know relational databases but I wonder if they know how useful a hash map of key:value pairs can be.
MapReduce has the same relationship to RDBMSs as my motorcycle has to a snowplow — it’s a step backwards in snowplow technology if you look at it that way.
2. MapReduce is a poor implementation
The argument is that MapReduce doesn’t have indexes so it is inferior to RDBMSs. MapReduce is one way to generate indexes from a large volume of data, but it’s not a data storage and retrieval system. One of the applications mentioned in the Wikipedia article is inverted index construction — what Google’s web crawler is doing. MapReduce is used to process a huge amount of unstructured data (like the world wide web) and generate sanitary structured data, perhaps for a relational database system.
The other implementation criticisms concern performance: “When the reduce phase starts, each of the 500 reduce instances needs to read its 1,000 input files and must use a protocol like FTP to ‘pull’ each of its input files from the nodes on which the map instances were run.” I don’t think the authors understand distributed processing and distributed file systems when they think reduce must rely on FTP. The Wikipedia article says “each [reduce] node is expected to report back periodically with completed work and status updates,” pretty much the opposite of the “pull” DeWitt and Stonebraker criticize. When they write “we have serious doubts about how well MapReduce applications can scale” I wonder if they have any idea how Google works.
3. MapReduce is not novel
This just comes off as sour grapes. I don’t know who the “MapReduce community” is or what claims they might be making. I thought MapReduce was more in the realm of algorithms, like QuickSort. I didn’t know that it had a community or advocates or was trying to impose a paradigm shift. DeWitt and Stonebraker are right that hashing, parallel processing, data partitioning, and user-defined functions are all old hat in the RDBMS world, but so what? The big innovation MapReduce enables is distributing data processing across a network of cheap and possibly unreliable computers, pretty much the opposite of the ideal RDBMS ecosystem.
4. MapReduce is missing features
5. MapReduce is incompatible with the DBMS tools
Two laundry lists of features missing from MapReduce that sum up the disconnect between what MapReduce actually is, and what DeWitt and Stonebraker think it is. It’s as if they downloaded an old version of dBase and thought it was a MapReduce program.
RDBMSs are great tools for managing large sets of structured data, enforcing integrity, optimizing queries, and separating the data structure and schema from the application. I’ve written before that programmers should learn how to use RDBMSs. But RDBMSs aren’t the only way to process and manage data, and they aren’t the best tool for every data processing job. My understanding of MapReduce is that it would fit between a pile of unstructured data and a RDBMS, not as a replacement for an RDBMS.
For a certain applications — web crawling and log analysis are two that come to mind — the ability to process a huge volume of data quickly is more important than guaranteeing 100% data integrity and completeness. Relational databases dominate data management because they can make those guarantees. But those guarantees have their own costs and limitations. If I need to churn through gigabytes of web server log files to generate some numbers to make a business decision now, I don’t care if a few of the log entries are ignored; I just need to filter and categorize the data fast. There are relational solutions for that kind of problem — Stonebraker’s own StreamBase is one of them — but even that doesn’t fill the same need as MapReduce.
The authors do make a good point that mixing the structure of the data with application code is a step backwards, but it’s not a very big step given how most applications are written around databases. Ideally schema changes wouldn’t disrupt applications, and multiple applications could have different views and allowed operations on the data. However I don’t see that MapReduce does anything to make the situation worse.
What the authors really want to gripe about is distributed “cloud” data management systems like Amazon’s SimpleDB; in fact if you change “MapReduce” to “SimpleDB” the original article almost makes sense.