On 14/09/11 22:20, Ted Dunning wrote:
This makes a bit of sense, but you have to worry about the inertia of the
data. Adding compute resources is easy. Adding data resources, not so
much.
I've done it. Like Ted says, pure compute nodes generate more network
traffic on both reads and writes, if you bring up Datanodes then you
have to leave them around. The strength is that the infrastructure can
sell them to you for a lower $/hour with the condition that they can
take them back when demand gets high; these compute-only nodes would be
infrastructure-pre-emptible. Look for the presentation "farming hadoop
in the cloud" for more details, though I don't discuss pre-emption or
infrastructure-specifics.
> if the computation is not near the data, then it is likely to be
> much less effective.
Which implies your infrastructure needs to be data aware and know to
bring up the new VMs on the same racks as the HDFS nodes.
There are some infrastructure-aware runtimes -I am thinking of the
Technical University of Berlin's Stratosphere project- which takes a job
at higher level than MR commands -more like Pig or Hive- and comes up
with an execution plan that can schedule for optimal acquisition and
re-use of VMs, knowing both the cost of machines and the hysteresis of
VM setup/teardown and hourly costs. You can then impose policies like
"fast execution" or "lowest cost", and have different plans
created/executed.
This is all PhD grade research with a group of highly skilled postgrads
led by a professor who has worked in VLDBs, I would not attempt to
retrofit this into Hadoop on my own. That said, if you want a PhD, you
could contact that team or the UC Irvine people working on Algebricks
and Hyracks and convince them that you should joint their teams.
-steve