Re: Adding Elasticity to Hadoop MapReduce

Steve Loughran Thu, 15 Sep 2011 02:25:21 -0700

On 14/09/11 22:20, Ted Dunning wrote:

This makes a bit of sense, but you have to worry about the inertia of the
data.  Adding compute resources is easy.  Adding data resources, not so
much.

I've done it. Like Ted says, pure compute nodes generate more networktraffic on both reads and writes, if you bring up Datanodes then youhave to leave them around. The strength is that the infrastructure cansell them to you for a lower $/hour with the condition that they cantake them back when demand gets high; these compute-only nodes would beinfrastructure-pre-emptible. Look for the presentation "farming hadoopin the cloud" for more details, though I don't discuss pre-emption orinfrastructure-specifics.



> if the computation is not near the data, then it is likely to be
> much less effective.

Which implies your infrastructure needs to be data aware and know tobring up the new VMs on the same racks as the HDFS nodes.

There are some infrastructure-aware runtimes -I am thinking of theTechnical University of Berlin's Stratosphere project- which takes a jobat higher level than MR commands -more like Pig or Hive- and comes upwith an execution plan that can schedule for optimal acquisition andre-use of VMs, knowing both the cost of machines and the hysteresis ofVM setup/teardown and hourly costs. You can then impose policies like"fast execution" or "lowest cost", and have different planscreated/executed.

This is all PhD grade research with a group of highly skilled postgradsled by a professor who has worked in VLDBs, I would not attempt toretrofit this into Hadoop on my own. That said, if you want a PhD, youcould contact that team or the UC Irvine people working on Algebricksand Hyracks and convince them that you should joint their teams.


-steve

Re: Adding Elasticity to Hadoop MapReduce

Reply via email to