Relational Database Experts Jump The MapReduce Shark

In this article relational database experts David DeWitt and Michael Stonebraker compare MapReduce to traditional relational database systems (RDBMSs) and find MapReduce wanting. They make some strong points in favor or relational databases, but the comparison is not appropriate. When I finished reading the article I was thinking that the authors did not understand MapReduce or the idea of data in the cloud, or why programmers might be excited about non-RDBMS ways to manage data.

The article makes five points:

1. MapReduce is a step backwards in database access
They’re right about that, but MapReduce is not a database system. In the the Wikipedia article describing MapReduce that the authors link to from the article the word “database” isn’t mentioned, and only one of the footnotes points to a paper about MapReduce applications in relational data processing. MapReduce is not a data storage or management system — it’s an algorithmic technique for the distributed processing of large amounts of data. Google’s web crawler is a real-life example. I’m not sure what MapReduce has to do with schemas or separating the data from the application but apparently MapReduce is just Codasyl all over again. The authors know relational databases but I wonder if they know how useful a hash map of key:value pairs can be.

MapReduce has the same relationship to RDBMSs as my motorcycle has to a snowplow — it’s a step backwards in snowplow technology if you look at it that way.

2. MapReduce is a poor implementation
The argument is that MapReduce doesn’t have indexes so it is inferior to RDBMSs. MapReduce is one way to generate indexes from a large volume of data, but it’s not a data storage and retrieval system. One of the applications mentioned in the Wikipedia article is inverted index construction — what Google’s web crawler is doing. MapReduce is used to process a huge amount of unstructured data (like the world wide web) and generate sanitary structured data, perhaps for a relational database system.

The other implementation criticisms concern performance: “When the reduce phase starts, each of the 500 reduce instances needs to read its 1,000 input files and must use a protocol like FTP to ‘pull’ each of its input files from the nodes on which the map instances were run.” I don’t think the authors understand distributed processing and distributed file systems when they think reduce must rely on FTP. The Wikipedia article says “each [reduce] node is expected to report back periodically with completed work and status updates,” pretty much the opposite of the “pull” DeWitt and Stonebraker criticize. When they write “we have serious doubts about how well MapReduce applications can scale” I wonder if they have any idea how Google works.

3. MapReduce is not novel
This just comes off as sour grapes. I don’t know who the “MapReduce community” is or what claims they might be making. I thought MapReduce was more in the realm of algorithms, like QuickSort. I didn’t know that it had a community or advocates or was trying to impose a paradigm shift. DeWitt and Stonebraker are right that hashing, parallel processing, data partitioning, and user-defined functions are all old hat in the RDBMS world, but so what? The big innovation MapReduce enables is distributing data processing across a network of cheap and possibly unreliable computers, pretty much the opposite of the ideal RDBMS ecosystem.

4. MapReduce is missing features
5. MapReduce is incompatible with the DBMS tools
Two laundry lists of features missing from MapReduce that sum up the disconnect between what MapReduce actually is, and what DeWitt and Stonebraker think it is. It’s as if they downloaded an old version of dBase and thought it was a MapReduce program.

RDBMSs are great tools for managing large sets of structured data, enforcing integrity, optimizing queries, and separating the data structure and schema from the application. I’ve written before that programmers should learn how to use RDBMSs. But RDBMSs aren’t the only way to process and manage data, and they aren’t the best tool for every data processing job. My understanding of MapReduce is that it would fit between a pile of unstructured data and a RDBMS, not as a replacement for an RDBMS.

For a certain applications — web crawling and log analysis are two that come to mind — the ability to process a huge volume of data quickly is more important than guaranteeing 100% data integrity and completeness. Relational databases dominate data management because they can make those guarantees. But those guarantees have their own costs and limitations. If I need to churn through gigabytes of web server log files to generate some numbers to make a business decision now, I don’t care if a few of the log entries are ignored; I just need to filter and categorize the data fast. There are relational solutions for that kind of problem — Stonebraker’s own StreamBase is one of them — but even that doesn’t fill the same need as MapReduce.

The authors do make a good point that mixing the structure of the data with application code is a step backwards, but it’s not a very big step given how most applications are written around databases. Ideally schema changes wouldn’t disrupt applications, and multiple applications could have different views and allowed operations on the data. However I don’t see that MapReduce does anything to make the situation worse.

What the authors really want to gripe about is distributed “cloud” data management systems like Amazon’s SimpleDB; in fact if you change “MapReduce” to “SimpleDB” the original article almost makes sense.

35 thoughts on “Relational Database Experts Jump The MapReduce Shark

  1. jaaron

    Thank you for writing this. The original article was _horrible_. I’m glad someone took the time to clear up the obvious mistakes.

  2. Pingback: |** urls that purr **|

  3. Pingback: | The social sites' most interesting urls

  4. Pingback: My daily readings 01/18/2008 « Strange Kite

  5. Franchu

    When I was reading the original post I was wondering if the authors understanding was totally disconnected from what MapReduce is, or whether I was missing something.

    Reading your post was very gratifying to see that I was not the only one thinking what you brilliantly expressed.

  6. James Urquhart

    When reading the article a few minutes ago, a lot of it came across as utter hogwash. Especially when they mixed MapReduce up with being a database.

    Just another case of writing without thought i guess.

  7. Pingback: » 18 Jan 2008

  8. planetmcd

    I couldn’t disagree with you more. If there is anything “For Your Eyes Only” has taught me is that by adding spikes to motorcycle wheels makes them deadly and, by improving their traction, obviates the need for snow plows.

  9. kbob

    What happened? I don’t know DeWitt, but Stonebraker is a smart guy. He’s done a lot of good work over the years and isn’t in the habit of writing articles that are so obviously flawed.

    I think the original article is a forgery. (-:

  10. Tom Ritchford

    Well, I mistakenly left this comment on their site instead of yours — but here it is for you, too…

    Great article! I use MapReduce every day and *for what it does* it’s the best.

    There are two advantages that you missed.

    1. If you set it up properly, the records get sorted and appear in the reducer in sorted order, for free!

    but even more important:

    2. MapReduce is extremely light.

    It doesn’t mean that a MapReduce won’t use a lot of machines — it means that you can run a MapReduce you already have on brand-new data in a few minutes, and you can write a brand-new one, run it and get good output in an afternoon — because you don’t have to load up a database.

  11. Sam

    Well, I think you’re missing the point of their argument – they claim Map-Reduce is being taught in schools (e.g. Berkeley where Stonebreaker teaches) as a valid database methodology. There’s nothing wrong with Map-Reduce, it just seems some folks are ‘Google-blind’ and think since Google does it, then there’s no reason to use typical RDBMS.

    For me, the authors jump the shark by not showing the whole picture. Google’s Map-Reduce needs GFS and a custom scheduler to make things reliable (Consistent and Durable). Why not just focus on how much extra work is necessary to attempt ACID?

  12. Tom Ritchford

    “There’s nothing wrong with Map-Reduce, it just seems some folks are ‘Google-blind’ and think since Google does it, then there’s no reason to use typical RDBMS.”

    I don’t believe even one person has ever tried to do that — there’s such a mismatch between the technologies. In a MapReduce, there really isn’t a way to read and write one record! MapReduce is only a batch job.

    If you can come up with a plausible scenario where someone would mistakenly use MapReduce instead of a database, please lay it on us.

  13. Joe

    Other articles I’m inspired to write based on the referenced article:

    I tried using MapReduce to create a website, and it sucks compared to markup with CSS. It doesn’t have any concept of how to style a website. MapReduce is a serious step backward in terms of web design.

    I also tried to have MapReduce babysit my kids, and I came back half an hour later to find that it was just sitting there crunching data, and wasn’t watching them at all. This thing can’t do anything at all.

    Also, compared to a standard hammer, this MapReduce things is really crappy at pounding nails into things.

  14. Nathan Fiedler

    Very pleased to see, based on the comments on the original article, and your blog, that _no one_ was fooled by these supposed experts. They seem to lack even a basic understanding of how MR is used. I feel they must have read the white paper and thought they knew more than anyone else who had read the same paper. Thank you for the rebuttal.

  15. jeff hammerbacher

    one may certainly have issues with the article conflating data management with query execution strategies but you make zero interesting claims while introducing actual inaccuracies yourself:

    I don’t think the authors understand distributed processing and distributed file systems when they think reduce must rely on FTP. The Wikipedia article says “each [reduce] node is expected to report back periodically with completed work and status updates,” pretty much the opposite of the “pull” DeWitt and Stonebraker criticize

    the wikipedia article is referring to the reduce worker reporting task metadata back to the master. separately, the reduce worker must obtain data to process from the map workers. because no ordering guarantees are made on the data within the input splits, keys for a specific reduce worker could come from any map worker. hence the process of copying the data from the map workers to the reduce worker is intensive for large datasets and could possibly leverage ftp, though it more likely leverages a custom data transmission protocol similar to UDT.

  16. William Pietri

    Very nicely put.

    I’m amazed that anybody could sincerely write an article built on the premise that the people at Google are fools who don’t understand scaling computation or managing large amounts of data, but that’s the overall feeling I took away from it.

    The part that really pains me is that some of Stonebraker’s recent papers make clear that the modern SQL database needs to be thrown out and a fresh start made. I’d think he’d look at Google and say, “Hey, maybe they’re doing something right.”

    Apparently they haven’t stepped back and realized that databases aren’t ends in themselves, but just one of many tools you might use to solve real-world problems. I hope they get there!

  17. Morris

    From referenced article: “and must use a protocol like FTP to “pull” each of its input files from the nodes on which the map instances were run.”

    Plonkers – sounds like they are just making stuff up… Ones suspect Google might have actually *thought* about the issue and either have a solution or compromise to meet their needs… Perhaps this is part of what Google File System helps manage? or

  18. mypalmike

    After I read that bizarre article a couple of times, I finally realized what was going on. The authors believe MapReduce is a distributed query processor, with unstructured data as backend storage. They picture this database where, every time you run a query, MapReduce kicks in and spits out the results. Reread the article with that in mind, and the article start to make sense. So, they’re not completely insane, just writing about something they fundamentally misunderstand.

  19. Pingback: afongen » links for 2008-01-19

  20. c4c1e4c4

    This is just astounding. I’ve never seen a claim that MapReduce is a replacement for RDBMS. I can’t imagine why Dewitt is attacking the idea.

  21. Pingback: MapReduce Reading

  22. zhaolinjnu

    I don’t think MapReduce,RDBMS which is a replacement of the other.And “MapReduce is a step backwards in database access” is ridiculous.Because the areas which the two technology applies are very different.

  23. Pingback: Mapreduce: a major disruption to database dogma

  24. Offbeatmammal

    I love how people take their “baby” and infer that everything else is inferiour. Many years ago I used an inverted hierarchy based database… very powerful and only now with SQL 2008 are we starting to see some of the capabilities appear in mainstream RDBMSs
    MapReduce is a different animal (and at least as fit for purpose) to a RDMBS (also ideally suited to types of processing) and it’s a shame insecure academics feel the need to misrepresent the stengths and weaknesses and pass it off as research.

  25. Colyn

    Greg, I’m not sure how you can use the colloquialism of “jumping the shark” in this context. Please explain. 🙂

  26. Pingback: Web2NewYork (beta) | Blog

  27. Pingback: facts about sharks

  28. Pingback: The art of Information Engineering » MapReduce and Scale

  29. Pingback: MapReduce patented « Drawing Blanks

  30. Pingback: MapReduce Introduction | 采石工人的大教堂

Comments are closed.