Typical Programmer

In this article relational database experts David DeWitt and Michael Stonebraker compare MapReduce to traditional relational database systems (RDBMSs) and find MapReduce wanting. They make some strong points in favor or relational databases, but the comparison is not appropriate. When I finished reading the article I was thinking that the authors did not understand MapReduce or the idea of data in the cloud, or why programmers might be excited about non-RDBMS ways to manage data.

The article makes five points:

1. MapReduce is a step backwards in database access

They’re right about that, but MapReduce is not a database system. In the the Wikipedia article describing MapReduce that the authors link to from the article the word “database” isn’t mentioned, and only one of the footnotes points to a paper about MapReduce applications in relational data processing. MapReduce is not a data storage or management system — it’s an algorithmic technique for the distributed processing of large amounts of data. Google’s web crawler is a real-life example. I’m not sure what MapReduce has to do with schemas or separating the data from the application but apparently MapReduce is just Codasyl all over again. The authors know relational databases but I wonder if they know how useful a hash map of key:value pairs can be.

MapReduce has the same relationship to RDBMSs as my motorcycle has to a snowplow — it’s a step backwards in snowplow technology if you look at it that way.

2. MapReduce is a poor implementation

The argument is that MapReduce doesn’t have indexes so it is inferior to RDBMSs. MapReduce is one way to generate indexes from a large volume of data, but it’s not a data storage and retrieval system. One of the applications mentioned in the Wikipedia article is inverted index construction — what Google’s web crawler is doing. MapReduce is used to process a huge amount of unstructured data (like the world wide web) and generate sanitary structured data, perhaps for a relational database system.

The other implementation criticisms concern performance: “When the reduce phase starts, each of the 500 reduce instances needs to read its 1,000 input files and must use a protocol like FTP to ‘pull’ each of its input files from the nodes on which the map instances were run.” I don’t think the authors understand distributed processing and distributed file systems when they think reduce must rely on FTP. The Wikipedia article says “each [reduce] node is expected to report back periodically with completed work and status updates,” pretty much the opposite of the “pull” DeWitt and Stonebraker criticize. When they write “we have serious doubts about how well MapReduce applications can scale” I wonder if they have any idea how Google works.

3. MapReduce is not novel

This just comes off as sour grapes. I don’t know who the “MapReduce community” is or what claims they might be making. I thought MapReduce was more in the realm of algorithms, like QuickSort. I didn’t know that it had a community or advocates or was trying to impose a paradigm shift. DeWitt and Stonebraker are right that hashing, parallel processing, data partitioning, and user-defined functions are all old hat in the RDBMS world, but so what? The big innovation MapReduce enables is distributing data processing across a network of cheap and possibly unreliable computers, pretty much the opposite of the ideal RDBMS ecosystem.

4. MapReduce is missing features

5. MapReduce is incompatible with the DBMS tools

Two laundry lists of features missing from MapReduce that sum up the disconnect between what MapReduce actually is, and what DeWitt and Stonebraker think it is. It’s as if they downloaded an old version of dBase and thought it was a MapReduce program.

RDBMSs are great tools for managing large sets of structured data, enforcing integrity, optimizing queries, and separating the data structure and schema from the application. I’ve written before that programmers should learn how to use RDBMSs. But RDBMSs aren’t the only way to process and manage data, and they aren’t the best tool for every data processing job. My understanding of MapReduce is that it would fit between a pile of unstructured data and a RDBMS, not as a replacement for an RDBMS.

For a certain applications — web crawling and log analysis are two that come to mind — the ability to process a huge volume of data quickly is more important than guaranteeing 100% data integrity and completeness. Relational databases dominate data management because they can make those guarantees. But those guarantees have their own costs and limitations. If I need to churn through gigabytes of web server log files to generate some numbers to make a business decision now, I don’t care if a few of the log entries are ignored; I just need to filter and categorize the data fast. There are relational solutions for that kind of problem — Stonebraker’s own StreamBase is one of them — but even that doesn’t fill the same need as MapReduce.

The authors do make a good point that mixing the structure of the data with application code is a step backwards, but it’s not a very big step given how most applications are written around databases. Ideally schema changes wouldn’t disrupt applications, and multiple applications could have different views and allowed operations on the data. However I don’t see that MapReduce does anything to make the situation worse.

What the authors really want to gripe about is distributed “cloud” data management systems like Amazon’s SimpleDB; in fact if you change “MapReduce” to “SimpleDB” the original article almost makes sense.

Comments

jaaron, 18 January 2008 at 2:15 am

Thank you for writing this. The original article was horrible. I’m glad someone took the time to clear up the obvious mistakes.

Ben, 18 January 2008 at 5:09 am

Excellent and well written article. I don’t know how the “experts” got it so wrong!

Franchu, 18 January 2008 at 5:27 am

When I was reading the original post I was wondering if the authors understanding was totally disconnected from what MapReduce is, or whether I was missing something.

Reading your post was very gratifying to see that I was not the only one thinking what you brilliantly expressed.

James Urquhart, 18 January 2008 at 6:00 am

When reading the article a few minutes ago, a lot of it came across as utter hogwash. Especially when they mixed MapReduce up with being a database.

Just another case of writing without thought i guess.

planetmcd,18 January 2008 at 8:50 am

I couldn’t disagree with you more. If there is anything “For Your Eyes Only” has taught me is that by adding spikes to motorcycle wheels makes them deadly and, by improving their traction, obviates the need for snow plows.

kbob, 18 January 2008 at 9:01 am

What happened? I don’t know DeWitt, but Stonebraker is a smart guy. He’s done a lot of good work over the years and isn’t in the habit of writing articles that are so obviously flawed.

I think the original article is a forgery. (-:

Tom Ritchford, 18 January 2008 at 9:16 am

Well, I mistakenly left this comment on their site instead of yours — but here it is for you, too…

Great article! I use MapReduce every day and for what it does it’s the best.

There are two advantages that you missed.

If you set it up properly, the records get sorted and appear in the reducer in sorted order, for free!

but even more important:

MapReduce is extremely light.

It doesn’t mean that a MapReduce won’t use a lot of machines — it means that you can run a MapReduce you already have on brand-new data in a few minutes, and you can write a brand-new one, run it and get good output in an afternoon — because you don’t have to load up a database.

Sam, 18 January 2008 at 9:57 am

Well, I think you’re missing the point of their argument – they claim Map-Reduce is being taught in schools (e.g. Berkeley where Stonebreaker teaches) as a valid database methodology. There’s nothing wrong with Map-Reduce, it just seems some folks are ‘Google-blind’ and think since Google does it, then there’s no reason to use typical RDBMS.

For me, the authors jump the shark by not showing the whole picture. Google’s Map-Reduce needs GFS and a custom scheduler to make things reliable (Consistent and Durable). Why not just focus on how much extra work is necessary to attempt ACID?

Tom Ritchford, 18 January 2008 at 10:38 am

“There’s nothing wrong with Map-Reduce, it just seems some folks are ‘Google-blind’ and think since Google does it, then there’s no reason to use typical RDBMS.”

I don’t believe even one person has ever tried to do that — there’s such a mismatch between the technologies. In a MapReduce, there really isn’t a way to read and write one record! MapReduce is only a batch job.

If you can come up with a plausible scenario where someone would mistakenly use MapReduce instead of a database, please lay it on us.

Joe, 18 January 2008 at 10:48 am

Other articles I’m inspired to write based on the referenced article:

I tried using MapReduce to create a website, and it sucks compared to markup with CSS. It doesn’t have any concept of how to style a website. MapReduce is a serious step backward in terms of web design.

I also tried to have MapReduce babysit my kids, and I came back half an hour later to find that it was just sitting there crunching data, and wasn’t watching them at all. This thing can’t do anything at all.

Also, compared to a standard hammer, this MapReduce things is really crappy at pounding nails into things.

Nathan Fiedler, 18 January 2008 at 3:08 pm

Very pleased to see, based on the comments on the original article, and your blog, that no one was fooled by these supposed experts. They seem to lack even a basic understanding of how MR is used. I feel they must have read the white paper and thought they knew more than anyone else who had read the same paper. Thank you for the rebuttal.

jeff hammerbacher, 18 January 2008 at 4:01 pm

one may certainly have issues with the article conflating data management with query execution strategies but you make zero interesting claims while introducing actual inaccuracies yourself:

I don’t think the authors understand distributed processing and distributed file systems when they think reduce must rely on FTP. The Wikipedia article says “each [reduce] node is expected to report back periodically with completed work and status updates,” pretty much the opposite of the “pull” DeWitt and Stonebraker criticize.

the wikipedia article is referring to the reduce worker reporting task metadata back to the master. separately, the reduce worker must obtain data to process from the map workers. because no ordering guarantees are made on the data within the input splits, keys for a specific reduce worker could come from any map worker. hence the process of copying the data from the map workers to the reduce worker is intensive for large datasets and could possibly leverage ftp, though it more likely leverages a custom data transmission protocol similar to UDT.

William Pietri, 18 January 2008 at 4:02 pm

Very nicely put.

I’m amazed that anybody could sincerely write an article built on the premise that the people at Google are fools who don’t understand scaling computation or managing large amounts of data, but that’s the overall feeling I took away from it.

The part that really pains me is that some of Stonebraker’s recent papers make clear that the modern SQL database needs to be thrown out and a fresh start made. I’d think he’d look at Google and say, “Hey, maybe they’re doing something right.”

Apparently they haven’t stepped back and realized that databases aren’t ends in themselves, but just one of many tools you might use to solve real-world problems. I hope they get there!

Morris, 18 January 2008 at 5:16 pm

From referenced article: “and must use a protocol like FTP to “pull” each of its input files from the nodes on which the map instances were run.”

Plonkers – sounds like they are just making stuff up… Ones suspect Google might have actually thought about the issue and either have a solution or compromise to meet their needs… Perhaps this is part of what Google File System helps manage? http://en.wikipedia.org/wiki/Google_File_System or http://labs.google.com/papers/gfs.html

mypalmike, 18 January 2008 at 5:32 pm

After I read that bizarre article a couple of times, I finally realized what was going on. The authors believe MapReduce is a distributed query processor, with unstructured data as backend storage. They picture this database where, every time you run a query, MapReduce kicks in and spits out the results. Reread the article with that in mind, and the article start to make sense. So, they’re not completely insane, just writing about something they fundamentally misunderstand.

Jos Visser, 19 January 2008 at 12:29 am

Thanks for writing this. The original article s*cked.

matt, 19 January 2008 at 1:02 am

Many years ago, Prof. Stonebraker mounted a similar campaign against object oriented DBMS (http://dlweinreb.wordpress.com/2007/12/31/object-oriented-database-management-systems-succeeded/). OODMBS back then were trying to provide solutions for niche markets (CAD and such), where using a RDBMS just wasn’t feasible. I’ve got the impression that the very idea is an offense to him.

c4c1e4c4, 19 January 2008 at 4:38 am

This is just astounding. I’ve never seen a claim that MapReduce is a replacement for RDBMS. I can’t imagine why Dewitt is attacking the idea.

zhaolinjnu, 22 January 2008 at 12:23 am

I don’t think MapReduce,RDBMS which is a replacement of the other.And “MapReduce is a step backwards in database access” is ridiculous.Because the areas which the two technology applies are very different.

Mike Schinkel, 23 January 2008 at 8:23 pm

The DeWitt and Stonebraker article was absolutely brilliant! They published an incredibly successful piece of linkbait… ‘-)

Offbeatmammal, 26 January 2008 at 3:30 pm

I love how people take their “baby” and infer that everything else is inferiour. Many years ago I used an inverted hierarchy based database… very powerful and only now with SQL 2008 are we starting to see some of the capabilities appear in mainstream RDBMSs

MapReduce is a different animal (and at least as fit for purpose) to a RDMBS (also ideally suited to types of processing) and it’s a shame insecure academics feel the need to misrepresent the stengths and weaknesses and pass it off as research.

Colyn, 3 February 2008 at 11:50 am

Greg, I’m not sure how you can use the colloquialism of “jumping the shark” in this context. Please explain. 🙂

Larry, 19 February 2008 at 12:17 pm

I think Joe’s post (i.e.#15) expertly lightens up the entire issue. Good job Joe.

Relational Database Experts Jump The MapReduce Shark