Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Brian Bockelman Tue, 14 Apr 2009 10:27:23 -0700

Hey Guilherme,

It's good to see comparisons, especially as it helps folks understandbetter what tool is the best for their problem. As you show in yourpaper, a MapReduce system is hideously bad in performing tasks thatcolumn-store databases were designed for (selecting a single valuealong an index, joining tables).


Some comments:

1) For some of your graphs, you show Hadoop's numbers in half-grey,half-white. I can't figure out for the life of me what thissignifies! What have I overlooked?2) I see that one of your co-authors is the CEO/inventor of theVertica DB. Out of curiosity, how did you interact with Verticaversus Hadoop versus DBMS-X? Did you get help tuning the systems fromthe experts? I.e., if you sat down with a Hadoop expert for a fewdays, I'm certain you could squeeze out more performance, just likewhenever I sit down with an Oracle DBA for a few hours, my DB queriesare much faster. You touch upon the sociological issues (having toprogram your own code versus having to only know SQL, as well as thecomparative time it took to set up the DB) - I'd like to hear how muchtime you spent "tweaking" and learning the best practices for thethree, very different approaches. If you added a 5th test, what's themarginal effort required?3) It would be nice to see how some of your more DB-like tasks performon something like HBase. That'd be a much more apples-to-applescomparison of column-store DBMS versus column-store data system,although the HBase work is just now revving up. I'm a bit uninformedin that area, so I don't have a good gut in how that'd do.4) I think that the UDF aggregation task (calculating the inlink countfor each document in a sample) is interesting - it's a more Map-Reduceoriented task, and it sounds like it was fairly miserable to hackaround the limitations / bugs in the DBMS.5) I really think you undervalue the benefits of replication andreliability, especially in terms of cost. As someone who helps with asmall site (about 300 machines) that range from commodity workers toSun Thumpers, if your site depends on all your storage nodesfunctioning, then your costs go way up. You can't make cheap hardwarescale unless your software can account for it.- Yes, I realize this is a different approach than you take. Thereare pros and cons to large expensive hardware versus lots of cheaphardware ... the argument has been going on since the dawn of time.However, it's a bit unfair to just outright dismiss one approach. Iam a bit wary of the claims that your results can scale up to Google/Yahoo scale, but I do agree that there are truly few users that arethat large!

I love your last paragraph, it's a very good conclusion. It kind ofreminds me of the grid computing field which was (is?) completelyshocked by the emergence of cloud computing. After you cut throughthe hype surrounding the new fads, you find (a) that there are somevery good reasons that the fads are popular - they have definitestrengths that the existing field was missing (or didn't want to hear)and (b) there's a lot of common ground and learning that has to bedone, even to get a good common terminology :)


Enjoy your conference!

Brian

On Apr 14, 2009, at 9:16 AM, Guilherme Germoglio wrote:

(Hadoop is used in the benchmarks)

http://database.cs.brown.edu/sigmod09/

There is currently considerable enthusiasm around the MapReduce
(MR) paradigm for large-scale data analysis [17]. Although the
basic control ﬂow of this framework has existed in parallel SQL
database management systems (DBMS) for over 20 years, some
have called MR a dramatically new computing model [8, 17]. In
this paper, we describe and compare both paradigms. Furthermore,
we evaluate both kinds of systems in terms of performance and de-
velopment complexity. To this end, we deﬁne a benchmark con-
sisting of a collection of tasks that we have run on an open source
version of MR as well as on two parallel DBMSs. For each task,
we measure each system’s performance for various degrees of par-
allelism on a cluster of 100 nodes. Our results reveal some inter-
esting trade-offs. Although the process to load data into and tune
the execution of parallel DBMSs took much longer than the MR
system, the observed performance of these DBMSs was strikingly
better. We speculate about the causes of the dramatic performance
difference and consider implementation concepts that future sys-
tems should take from both kinds of architectures.


--
Guilherme

msn: [email protected]
homepage: http://germoglio.googlepages.com

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Reply via email to