RE: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

Ricky Ho Tue, 05 May 2009 10:49:12 -0700

The slide deck talks about possible bundling of various existing Apache 
technologies in distributed systems as well as some Java API to access Amazon 
cloud services.

What hasn't been discussed is the difference between a "traditional distributed 
architecture" and "the cloud".  They are "close" but not close enough to be 
treated the "same".  In my opinion, some of the distributed technology in 
Apache need to be enhanced in order to fit into the cloud more effectively.

Let me focus in some cloud characteristics that our existing Apache distributed 
technologies hasn't been paying attention to:  Extreme elasticity, Trust 
boundary, and cost awareness.

Extreme elasticity
===================
Most distributed technologies treat machine shutdown/startup a relatively 
infrequent operation and hasn't tried hard to minimize the cost of handling 
this situations.  Look at Hadoop as an example, although it can handle machine 
crashes gracefully, it doesn't handle cloud bursting scenario well (ie: when a 
lot of machines is added to Hadoop cluster).  You need to run a data 
redistribution task in the background and slow down your existing job.

Another example is that many scripts in Hadoop relies on config file that 
specify each cluster member's IP address.  In a cloud environment, IP address 
is unstable so we need to have a discovery mechanism and also rework the 
scripts.

Trust boundary
===============
Most distributed technologies are assuming a homogeneous environment (every 
member has the same degree of trust), which is not the case in the cloud 
environment.  Additional processing (cryptographic operation for data transfer 
and storage) may be necessary when dealing with machines running in the cloud.

Cost awareness
===============
Same reason as they are assuming a homogeneous environment, the scheduler is 
not aware of the involved cost when they move data across the cloud boundary 
(especially bandwidth cost is relatively high).  The Hadoop MapReduce scheduler 
need to be more sophisticated when scheduling where to start the Mapper and 
Reducer.  Similarly, when making the replica placement decision, HDFS needs to 
be aware of which machine is located in which cloud.

That said, I am not discounting the existing Apache technology.  In fact, we 
have already made a good step.  We just need to go further.

Rgds,
Ricky

-----Original Message-----
From: Bradford Stephens [mailto:[email protected]] 
Sent: Tuesday, May 05, 2009 9:53 AM
To: [email protected]
Subject: Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

I read through the deck and sent it around the company. Good stuff!
It's going to be a big help for trying to get the .NET Enterprise
people wrapping their heads around web-scale data.

I must admit "Apache Cloud Computing Edition" is sort of unwieldy to
say verbally, and frankly "Java Enterprise Edition" is a taboo phrase
at a lot of projects I've had. Guilt by association. I think I'll call
it "Apache Cloud Stack", and reference "Apache Cloud Computing
Edition" in my deck. When I think "Stack", I think of a suite of
software that provides all the pieces I need to solve my problem :)

On Tue, May 5, 2009 at 7:00 AM, Steve Loughran <[email protected]> wrote:
> Bradford Stephens wrote:
>>
>> Hey all,
>>
>> I'm going to be speaking at OSCON about my company's experiences with
>> Hadoop and Friends, but I'm having a hard time coming up with a name
>> for the entire software ecosystem. I'm thinking of calling it the
>> "Apache CloudStack". Does this sound legit to you all? :) Is there
>> something more 'official'?
>
> We've been using "Apache Cloud Computing Edition" for this, to emphasise
> this is the successor to Java Enterprise Edition, and that it is cross
> language and being built at apache. If you use the same term, even if you
> put a different stack outline than us, it gives the idea more legitimacy.
>
> The slides that Andrew linked to are all in SVN under
> http://svn.apache.org/repos/asf/labs/clouds/
>
> we have a space in the apache labs for "apache clouds", where we want to do
> more work integrating things, and bringing the idea of deploy and test on
> someone else's infrastructure mainstream across all the apache products. We
> would welcome your involvement -and if you send a draft of your slides out,
> will happily review them
>
> -steve
>

RE: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

Reply via email to