Re: Adding Elasticity to Hadoop MapReduce

Steve Loughran Thu, 15 Sep 2011 02:17:14 -0700

On 15/09/11 02:01, Bharath Ravi wrote:

Thanks a lot, all!


An end goal of mine was to make Hadoop as flexible as possible.
Along the same lines, but unrelated to the above idea, was another I
encountered,
courtesy http://hadoopblog.blogspot.com/2010/11/hadoop-research-topics.html

The blog mentions the ability to dynamically append Input.
Specifically, can I append input to the Map and Reduce tasks after they've
been started?

Dhruba is referring to something that they've actually implemented intheir version of Hive, which is the ability to gradually increase thedata input to a running Hive job.

This lets them do a query like "find 8 friends in california" withoutsearching the entire dataset; pick a subset, search that, and if thereare enough results, stop. If not: feed in some more data.

I have a paper on it that shows that for data with little or no skew,this is much faster than a full scan; for skewed data where all theresults are in a subset of blocks it is about the same as a full scan-it depends on which block size is found.

I haven't been able to find something like this at a precursory glance, but
could someone
advice me on this before I dig deeper?

1. Does such functionality exist, or is it being attempted?

It exists for Hive though not in trunk, to get it in there would bemostly a matter of taking the existing code and slotting it in.

2. I would assume most cases would simply require starting a second Job for
the new input.

No, because that loses all existing work and requires rescheduling morework. The goal of this is to execute one job that can bail out early.

The Facebook code runs with Hive, for classic MR jobs the first stepwould be to allow Map tasks to finish early. I think there may be a wayto do that and plan to do some experiments to see if I'm right.

What would be more dramatic would be for the JT to be aware that jobsmay finish early and have it slowly ramp up the map operations if theydon't set some "finished" flag (which would presumably be a sharedcounter), until the entire dataset gets processed if the early finishdoesn't work. This slow-start could be taken into account in thescheduler which could than know that the initial resource needs of theJob are quite low, but may increase.

However, are there practical use cases to such a feature?


See above

3. Are there any other ideas on such "flexibility" of the system that I
could contribute to?

While it's great that you want to do big things in Hadoop, I'd recommendyou start using it and learning your way around the codebase -especiallyof SVN trunk or the unreleased 0.23 branch, as they are where all majorchanges will go, and the MR engine has been radically reworked forbetter scheduling.

Start writing MR jobs that work under the new engine, using existingpublic datasets, or look at the layers above, then think how thingscould be improved.

Re: Adding Elasticity to Hadoop MapReduce

Reply via email to