On 15/09/11 02:01, Bharath Ravi wrote:
Thanks a lot, all!
An end goal of mine was to make Hadoop as flexible as possible.
Along the same lines, but unrelated to the above idea, was another I
encountered,
courtesy http://hadoopblog.blogspot.com/2010/11/hadoop-research-topics.html
The blog mentions the ability to dynamically append Input.
Specifically, can I append input to the Map and Reduce tasks after they've
been started?
Dhruba is referring to something that they've actually implemented in
their version of Hive, which is the ability to gradually increase the
data input to a running Hive job.
This lets them do a query like "find 8 friends in california" without
searching the entire dataset; pick a subset, search that, and if there
are enough results, stop. If not: feed in some more data.
I have a paper on it that shows that for data with little or no skew,
this is much faster than a full scan; for skewed data where all the
results are in a subset of blocks it is about the same as a full scan
-it depends on which block size is found.
I haven't been able to find something like this at a precursory glance, but
could someone
advice me on this before I dig deeper?
1. Does such functionality exist, or is it being attempted?
It exists for Hive though not in trunk, to get it in there would be
mostly a matter of taking the existing code and slotting it in.
2. I would assume most cases would simply require starting a second Job for
the new input.
No, because that loses all existing work and requires rescheduling more
work. The goal of this is to execute one job that can bail out early.
The Facebook code runs with Hive, for classic MR jobs the first step
would be to allow Map tasks to finish early. I think there may be a way
to do that and plan to do some experiments to see if I'm right.
What would be more dramatic would be for the JT to be aware that jobs
may finish early and have it slowly ramp up the map operations if they
don't set some "finished" flag (which would presumably be a shared
counter), until the entire dataset gets processed if the early finish
doesn't work. This slow-start could be taken into account in the
scheduler which could than know that the initial resource needs of the
Job are quite low, but may increase.
However, are there practical use cases to such a feature?
See above
3. Are there any other ideas on such "flexibility" of the system that I
could contribute to?
While it's great that you want to do big things in Hadoop, I'd recommend
you start using it and learning your way around the codebase -especially
of SVN trunk or the unreleased 0.23 branch, as they are where all major
changes will go, and the MR engine has been radically reworked for
better scheduling.
Start writing MR jobs that work under the new engine, using existing
public datasets, or look at the layers above, then think how things
could be improved.