The slide deck talks about possible bundling of various existing Apache technologies in distributed systems as well as some Java API to access Amazon cloud services.
What hasn't been discussed is the difference between a "traditional distributed architecture" and "the cloud". They are "close" but not close enough to be treated the "same". In my opinion, some of the distributed technology in Apache need to be enhanced in order to fit into the cloud more effectively. Let me focus in some cloud characteristics that our existing Apache distributed technologies hasn't been paying attention to: Extreme elasticity, Trust boundary, and cost awareness. Extreme elasticity =================== Most distributed technologies treat machine shutdown/startup a relatively infrequent operation and hasn't tried hard to minimize the cost of handling this situations. Look at Hadoop as an example, although it can handle machine crashes gracefully, it doesn't handle cloud bursting scenario well (ie: when a lot of machines is added to Hadoop cluster). You need to run a data redistribution task in the background and slow down your existing job. Another example is that many scripts in Hadoop relies on config file that specify each cluster member's IP address. In a cloud environment, IP address is unstable so we need to have a discovery mechanism and also rework the scripts. Trust boundary =============== Most distributed technologies are assuming a homogeneous environment (every member has the same degree of trust), which is not the case in the cloud environment. Additional processing (cryptographic operation for data transfer and storage) may be necessary when dealing with machines running in the cloud. Cost awareness =============== Same reason as they are assuming a homogeneous environment, the scheduler is not aware of the involved cost when they move data across the cloud boundary (especially bandwidth cost is relatively high). The Hadoop MapReduce scheduler need to be more sophisticated when scheduling where to start the Mapper and Reducer. Similarly, when making the replica placement decision, HDFS needs to be aware of which machine is located in which cloud. That said, I am not discounting the existing Apache technology. In fact, we have already made a good step. We just need to go further. Rgds, Ricky -----Original Message----- From: Bradford Stephens [mailto:[email protected]] Sent: Tuesday, May 05, 2009 9:53 AM To: [email protected] Subject: Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc.... I read through the deck and sent it around the company. Good stuff! It's going to be a big help for trying to get the .NET Enterprise people wrapping their heads around web-scale data. I must admit "Apache Cloud Computing Edition" is sort of unwieldy to say verbally, and frankly "Java Enterprise Edition" is a taboo phrase at a lot of projects I've had. Guilt by association. I think I'll call it "Apache Cloud Stack", and reference "Apache Cloud Computing Edition" in my deck. When I think "Stack", I think of a suite of software that provides all the pieces I need to solve my problem :) On Tue, May 5, 2009 at 7:00 AM, Steve Loughran <[email protected]> wrote: > Bradford Stephens wrote: >> >> Hey all, >> >> I'm going to be speaking at OSCON about my company's experiences with >> Hadoop and Friends, but I'm having a hard time coming up with a name >> for the entire software ecosystem. I'm thinking of calling it the >> "Apache CloudStack". Does this sound legit to you all? :) Is there >> something more 'official'? > > We've been using "Apache Cloud Computing Edition" for this, to emphasise > this is the successor to Java Enterprise Edition, and that it is cross > language and being built at apache. If you use the same term, even if you > put a different stack outline than us, it gives the idea more legitimacy. > > The slides that Andrew linked to are all in SVN under > http://svn.apache.org/repos/asf/labs/clouds/ > > we have a space in the apache labs for "apache clouds", where we want to do > more work integrating things, and bringing the idea of deploy and test on > someone else's infrastructure mainstream across all the apache products. We > would welcome your involvement -and if you send a draft of your slides out, > will happily review them > > -steve >
