Re: [VOTE] accept Tashi into the Incubator

Doug Cutting Thu, 07 Aug 2008 15:02:04 -0700

Niclas Hedhman wrote:

On Tuesday 05 August 2008 01:48:13 Doug Cutting wrote:


-1. You get my +1 vote when the proposal text is part of the [VOTE] thread.
;-)


See below.

The wiki page has not been changed since the vote was called.

Doug

-----------------------------

= Tashi Proposal =

A proposal to the Apache Software Foundation Incubator PMC by

David O'Hallaron^*+^, Michael Kozuch^*^, Michael Ryan^*^, StevenSchlosser^*^, Jim Cipar^+^, Greg Ganger^+^, Garth Gibson^+^, JulioLopez^+^, Michael Strouken^+^, Wittawat Tantisiriroj^+^, DougCutting^#^, Jay Kistler^#^, Thomas Kwan^#^


^*^Intel Research Pittsburgh, ^+^Carnegie Mellon University, ^#^Yahoo!


July 10, 2008


== 1. Abstract ==


Tashi is a cluster management system for cloud computing on Big Data.

== 2. Proposal ==

The Tashi project aims to build a software infrastructure for cloudcomputing on massive internet-scale datasets (what we call ''BigData''). The idea is to build a cluster management system that enablesthe Big Data that are stored in a cluster/data center to be accessed,shared, manipulated, and computed on by remote users in a convenient,efficient, and safe manner. The system aims to provide the followingbasic capabilities:

(a) ''On-demand provisioning of storage and compute resources.'' Usersrequest a number of compute nodes, which can be either virtual orphysical machines, and a set of disk images to boot up on the nodes. Inresponse they receive their own persistent logical cluster of computeand storage nodes, which they can then manage and use.

(b) ''Extensible end-to-end system management.'' Tashi will define opennon-proprietary interfaces for management tasks such as observation,inference, planning, and actuation. This will keep the systemvendor-neutral and allow different research and development groups toplug in different implementations of different management modules.

(c) ''Cooperative storage and compute management.'' The system willdefine new non-proprietary interfaces and methods that will allowcompute and storage management to work together in concert.

(d) ''Flexible storage models.'' The system will support a range ofdifferent storage models, such as network-attached storage, per-nodestorage, and hybrids, to allow developers, researchers, and large scalecluster/data center operators to experiment with different kinds of filesystems.

(e) ''Flexible machine models.'' The system will support differentmachine models. In particular, it will be VMM-agnostic, able to rundifferent virtual machine monitors such as KVM and Xen. Also, in orderto address the cluster squatting problem (when clusters are balkanizedby users who reserve and hold nodes for their exclusive use) the systemwill support a novel bi-model booting capability, in which virtualmachine and physical machine instances can boot from the same disk image.


== 3. Rationale and Approach ==

Digital media, pervasive sensing, web authoring, mobile computing,scientific and medical instruments, physical simulations, and virtualworlds are all delivering vast new datasets relating to every aspect ofour lives. A growing fraction of this Big Data is going unused or beingunderexploited due to the overwhelming scale of the data involved.Effective sharing, understanding, and use of this new wealth of rawinformation poses one of the great challenges for the new century.

In order to compute on this emerging Big Data, many research anddevelopment groups are purchasing their own racks of compute and storageservers. The goal of the Tashi project is to develop a layer of utilitysoftware that turns these raw racks of servers into easily managed cloudcomputers that will allow remote users to share and explore their Big Data.

To our knowledge there are no open source projects addressing clustermanagement for Big Data applications. We need a project such as Tashifor a number of reasons: (1) No cloud computing cluster managementsystems have tackled the problem of having both compute and storagemanagement working together in concert, which we believe will benecessary to support Big Data. (2) We need non-proprietary interfacesfor cloud computing, and open source is the way to develop these. Forexample, Google's new App Engine and Amazon's web services requirepeople to build to proprietary API's, so that their applications are nolonger vendor neutral, but are tied to a particular service provider.(3) We need an extensible system that can serve as a platform tostimulate research in cluster management for cloud computing.


The Tashi system is targeted at two (not always distinct) communities:

(1) As a production system for organizations who want to offer medium tolarge scale clusters to their users. For example, many companies anduniversity departments are purchasing such clusters, and a system likeTashi would help them provide their users with access to the cycles andstorage in the clusters.


(2) As an extensible research platform for distributed systems researchers.

The approach for the project is to build on existing cluster managementwork pioneered by projects such as Usher (UCSD), Cluster on Demand(Duke), and EC2/S3 (Amazon), and then develop the new capabilities thatwill be required to support Big Data cloud computing.


== 4. Need for a Community Effort ==

A number of events at Yahoo, Carnegie Mellon, and Intel ResearchPittsburgh motivated the development of Tashi and convinced us to worktogether in the context of an open-source community:

(a) In 2006 the Parallel Data Lab (PDL) at Carnegie Mellon built acluster of 400 nodes from industry donations, with a goal of creating a"Data Center Observatory" that would allow systems researchers to studyand monitor applications running on the cluster. This dream has beenslow to materialize because of the cost and complexity of supporting andmanaging multiple applications and systems groups.

(b) In Fall 2007, Yahoo began offering access to their M45 researchcluster to researchers at Carnegie Mellon, and in order to support M45as well as their own internal production clusters, began to develop somecloud computing infrastructure on their own.

(c) In Fall 2007, Intel Research Pittsburgh purchased a moderate-sized100-node cluster and made it available to applications groups atCarnegie Mellon working on various Big Data applications such ascomputational photography, machine translation, automatic speechrecognition, and event detection in spatio-temporal video streams.Provisioning and scheduling the cluster in the face of so many differentapplication demands has proven to be difficult.

The difficulties of managing and provisioning these different clustersconvinced us that the problem was too big for any one of us to solvecompletely on our own, and that we needed to band together create aopen-source community effort focused on developing a single software system.

Another important reason to develop an open-source community aroundTashi is that we need non-proprietary vendor-neutral APIs for theemerging area of cloud computing, and open source is the best way toachieve that.


== 5. Known Risks ==

''Commitment to future development.'' The risk of the developersabandoning the project is small, mainly because they all own and managemoderate to large scale clusters, and desperately need something likeTashi to provision and manage those clusters. We also need a system likeTashi to serve as an extensible platform for our research.

''Experience with open source.'' Yahoo has had a significant andpositive experience with the Apache Software Foundation (ASF) andHadoop. While Intel and Carnegie Mellon have developed some non-ASFstyle open source projects in the past (e.g., Internet Suspend/Resume,OpenDHT, and Open``Diamond), they have no experience with ASF-style opensource communities. However, they hope to benefit from Yahoo'sconsiderable experience in this area.

''Diversity of developer community.'' The initial code base for Tashiwas developed by a single research programmer, Michael Ryan, at IntelResearch Pittsburgh. An important reason for putting Tashi in theincubator is to expand the set of developers to include programmers fromCarnegie Mellon and Yahoo, initially, and later, hopefully, from othergroups such as Usher at UCSD, Eucalyptus from UCSB, Cluster-on-Demandfrom Duke University, and the RAD Lab at University of California, Berkeley.

''Relationship to other Apache projects.'' There are no Apache projectssuch as Tashi that focus on systems support for cloud computing.However, the Tashi project is closely related to Hadoop/HDFS. TheVM-based provisioning of Tashi will subsume the nowdeprecated sub-clustering functionality of Hadoop-on-demand. The Tashiprototype uses HDFS to host the cluster boot images. Also, we expectthat many Tashi logical clusters will run Hadoop jobs.

''Reasons that Tashi is an ASF project.'' There are three main reasonsfor developing Tashi through Apache rather than, say, Source``Forge. (1)Our Yahoo partner has had a very positive experience with the Hadoopproject. (2) We recognize the need to build a strong developercommunity, and Apache is centered around building such communities. (3)The ASF also offers substantial legal oversight that makes it attractivefor cross-organizational collaborative efforts such as Tashi. WithSource``Forge, for example, you have few guarantee about the title ofthe code. Thus, people can easily post code they don't own, and/orchange the license terms of other open source code that they include intheir projects. So users of code from Source``Forge must be wary. Onthe other hand, Apache vets all contributions, keeping signed documentsfrom every committer on file, etc.


== 6. Related Work ==

A small sampling of some closely related work:

[1] M. Mc``Nett, D. Gupta, A. Bahdat, G. Voelker, "Usher: An ExtensibleFramework for Managing Clusters of Virtual Machines", Proceedings of the21st Large Installation System Administration Conference (LISA 07), 2007.

[2] D. Irwin, J. Chase, L. Grit, A. Yumerefendi, D. Becker, "SharingNetworked Resources with Brokered Leases", Usenix, 2006.

[3] J. Chase, D. Irwin, L. Grit, J. Moore, S. Sprenkle, "Dynamic VirtualClusters in a Grid Site Manager", HPDC, 2003.

[4] S. Garfinkel, "An Evaluation of Amazon's Grid Computing Services:EC2, S3, and SQS", Tech Report TR-08-07, School for Engineering andApplied Sciences, Harvard University, 2007.


[5] Red``Hat oVirt System, http://ovirt.org, 2008

[6] Eucalyptus, Rich Wolski, http://eucalyptus.cs.ucsb.edu

== 7. Source ==

We have working code, a pre-alpha proof-of-concept prototype that wasdeveloped by Michael Ryan at Intel Research Pittsburgh. The prototype iscurrently running on the 100-node cluster there. We will enter theincubator with clean code, developed entirely by Michael Ryan, that isunencumbered by any licensing issues.


== 8. Required Resources  ==

(a) Proposed Mailing lists:

 * tashi-private (with moderated subscriptions)
 * tashi-dev
 * tashi-commits
 * tashi-user

(b) Subversion directory

 * http://svn.apache.org/repos/asf/incubator/tashi

(c) Issue tracking:

 * Tashi will use JIRA for bug tracking.

== 9. Initial Committers ==

Initially, there will be one committer each from Carnegie Mellon andIntel Research:


 * Michael Stroucken ([EMAIL PROTECTED])
 * Michael Ryan ([EMAIL PROTECTED])


== 10. Sponsors ==

 * ''Champion:'' Doug Cutting ([EMAIL PROTECTED])
 * ''Nominated mentors:'' Matthieu Riou <[EMAIL PROTECTED]>
 * ''Sponsoring entity:'' Apache Incubator PMC
= Tashi Proposal =

A proposal to the Apache Software Foundation Incubator PMC by


^*^Intel Research Pittsburgh, ^+^Carnegie Mellon University, ^#^Yahoo!


July 10, 2008


== 1. Abstract ==


Tashi is a cluster management system for cloud computing on Big Data.

== 2. Proposal ==