Non-binding +1. Regarding Owen's concern over licenses, if I recall
correctly, those concerns would block graduation from the incubator,
but not acceptance to it.

I am also interested in being added as a committer to this proposal.
As an HBase committer (but not speaking for the project as a whole) I
think having cross-pollination between the codebases will be
beneficial to everyone, so I'd like to be involved.

Thanks
-Todd

On Fri, Sep 2, 2011 at 8:45 AM, Billie J Rinaldi
<billie.j.rina...@ugov.gov> wrote:
> Greetings,
>
> I would like to propose Accumulo to be an Apache Incubator project.  Accumulo 
> is a distributed key/value store that provides expressive cell-level access 
> labels and a server-side programming mechanism that can modify key/value 
> pairs at various points in the data management process.  It is based on 
> Google's BigTable design and runs over Apache Hadoop and Zookeeper.
>
> Here is a link to the proposal in the Incubator wiki:
> http://wiki.apache.org/incubator/AccumuloProposal
>
> I've also pasted the initial contents below.
>
> Thanks,
> Billie Rinaldi
>
>
> = Accumulo Proposal =
>
> == Abstract ==
> Accumulo is a distributed key/value store that provides expressive, 
> cell-level access labels.
>
> == Proposal ==
> Accumulo is a sorted, distributed key/value store based on Google's BigTable 
> design.  It is built on top of Apache Hadoop, Zookeeper, and Thrift.  It 
> features a few novel improvements on the BigTable design in the form of 
> cell-level access labels and a server-side programming mechanism that can 
> modify key/value pairs at various points in the data management process.
>
> == Background ==
> Google published the design of BigTable in 2006.  Several other open source 
> projects have implemented aspects of this design including HBase, CloudStore, 
> and Cassandra.  Accumulo began its development in 2008.
>
> == Rationale ==
> There is a need for a flexible, high performance distributed key/value store 
> that provides expressive, fine-grained access labels.  The communities we 
> expect to be most interested in such a project are government, health care, 
> and other industries where privacy is a concern.  We have made much progress 
> in developing this project over the past 3 years and believe both the project 
> and the interested communities would benefit from this work being openly 
> available and having open development.
>
> == Current Status ==
>
> === Meritocracy ===
> We intend to strongly encourage the community to help with and contribute to 
> the code.  We will actively seek potential committers and help them become 
> familiar with the codebase.
>
> === Community ===
> A strong government community has developed around Accumulo and training 
> classes have been ongoing for about a year.  Hundreds of developers use 
> Accumulo.
>
> === Core Developers ===
> The developers are mainly employed by the National Security Agency, but we 
> anticipate interest developing among other companies.
>
> === Alignment ===
> Accumulo is built on top of Hadoop, Zookeeper, and Thrift.  It builds with 
> Maven.  Due to the strong relationship with these Apache projects, the 
> incubator is a good match for Accumulo.
>
> == Known Risks ==
> === Orphaned Products ===
> There is only a small risk of being orphaned.  The community is committed to 
> improving the codebase of the project due to its fulfilling needs not 
> addressed by any other software.
>
> === Inexperience with Open Source ===
> The codebase has been treated internally as an open source project since its 
> beginning, and the initial Apache committers have been involved with the code 
> for multiple years.  While our experience with public open source is limited, 
> we do not anticipate difficulty in operating under Apache's development 
> process.
>
> === Homogeneous Developers ===
> The committers have multiple employers and it is expected that committers 
> from different companies will be recruited.
>
> === Reliance on Salaried Developers ===
> The initial committers are all paid by their employers to work on Accumulo 
> and we expect such employment to continue.  Some of the initial committers 
> would continue as volunteers even if no longer employed to do so.
>
> === Relationships with Other Apache Products ===
> Accumulo uses Hadoop, Zookeeper, Thrift, Maven, log4j, commons-lang, -net, 
> -io, -jci, -collections, -configuration, -logging, and -codec.
>
> === Relationship to HBase ===
> Accumulo and HBase are both based on the design of Google's BigTable, so 
> there is a danger that potential users will have difficulty distinguishing 
> the two or that they will not see an incentive in adopting Accumulo.  There 
> are a few key areas in which Accumulo differs from HBase.  Some of the 
> desired features of Accumulo could be incorporated into HBase, however the 
> most important of these may be unlikely to be adopted (see cell-level access 
> labels and iterators below).  It is a possibility that the codebases will 
> ultimately converge, but the number of differences at the current time 
> warrants a separate project for Accumulo.
>
> ==== Access Labels ====
> Accumulo has an additional portion of its key that sorts after the column 
> qualifier and before the timestamp.  It is called column visibility and 
> enables expressive cell-level access control.  Authorizations are passed with 
> each query to control what data is returned to the user.  The column 
> visibilities are boolean AND and OR combinations of arbitrary strings (such 
> as "(A&B)|C") and authorizations are sets of strings (such as {C,D}).
>
> ==== Iterators ====
> Accumulo has a novel server-side programming mechanism that can modify the 
> data written to disk or returned to the user.  This mechanism can be 
> configured for any of the scopes where data is read from or written to disk.  
> It can be used to perform joins on data within a single tablet.
>
> ==== Flexibility ====
> HBase requires the user to specify the set of column families to be used up 
> front.  Accumulo places no restrictions on the column families.  Also, each 
> column family in HBase is stored separately on disk.  Accumulo allows column 
> families to be grouped together on disk, as does BigTable.  This enables 
> users to configure how their data is stored, potentially providing 
> improvements in compression and lookup speeds.  It gives Accumulo a 
> row/column hybrid nature, while HBase is currently column-oriented.
>
> ==== Testing ====
> Accumulo has testing frameworks that have resulted in its achieving a high 
> level of correctness and performance.  We have observed that under some 
> configurations and conditions Accumulo will outperform HBase and provide 
> greater data integrity.
>
> ==== Logging ====
> HBase uses a write-ahead log on the Hadoop Distributed File System.  Accumulo 
> has its own logging service that does not depend on communication with the 
> HDFS NameNode.
>
> ==== Storage ====
> Accumulo has a relative key file format that improves compression.
>
> ==== Areas in which HBase features improvements over Accumulo ====
> in memory tables, upserts, coprocessors, connections to other projects such 
> as Cascading and Pig
>
> === Expectations ===
> There is a risk that Accumulo will be criticized for not providing adequate 
> security.  The access labels in Accumulo do not in themselves provide a 
> complete security solution, but are a mechanism for labeling each piece of 
> data with the authorizations that are necessary to see it.
>
> === Apache Brand ===
> Our interest in releasing this code as an Apache incubator project is due to 
> its strong relationship with other Apache projects, i.e. Hadoop, Zookeeper, 
> and HBase.
>
> == Documentation ==
> There is not currently documentation about Accumulo on the web, but a fair 
> amount of documentation and training materials exists and will be provided on 
> the Accumulo wiki at apache.org.  Also, a paper discussing YCSB results for 
> Accumulo will be presented at the 2011 Symposium on Cloud Computing.
>
> == Initial Source ==
> Accumulo has been in development since spring 2008.  There are hundreds of 
> developers using it and tens of developers have contributed to it.  The core 
> codebase consists of 200,000 lines of code (mainly Java) and 100s of pages of 
> documentation.  There are also a few projects built on top of Accumulo that 
> may be added to its contrib in the future.  These include support for Hive, 
> Matlab, YCSB, and graph processing.
>
> == Source and Intellectual Property Submission Plan ==
> Accumulo core code, examples, documention, and training materials will be 
> submitted by the National Security Agency.
>
> We will also be soliciting contributions of further plugins from MIT Lincoln 
> Labs, Carnegie Mellon University, and others.
>
> Accumulo has been developed by a mix of government employees and private 
> companies under government contract.  Material developed by government 
> employees is in the public domain and no U.S. copyright exists in works of 
> the federal government.  For the contractor developed material in the initial 
> submission, the U.S. Government has sufficient authority per the ICLA from 
> the copyright owner to contribute the Accumulo code to the incubator.
>
> There has been some discussion regarding accepting contributions from US 
> Government sources on [https://issues.apache.org/jira/browse/LEGAL-93 
> LEGAL-93]. We propose that the NSA will sign an ICLA/CCLA if that document 
> could be slightly modified to explicitly address copyright in works of 
> government employees. Specifically, we propose that the definition of “You” 
> be modified to include “the copyright owner, the owner of a Contribution not 
> subject to copyright, or legal entity authorized by the copyright owner that 
> is making this Agreement.” In addition, section 2, the copyright license 
> grant be modified after “You hereby grant” that either states “to the extent 
> authorized by law” or “to the extent copyright exists in the Contribution.”  
> These changes will permit US Government employee developed work to be 
> included.
>
> One proposed solution is to form a Collaborative Research and Development 
> Agreement (CRADA) between the Apache Software Foundation and the US 
> Government, but this will not solve the underlying problem that U.S. law does 
> not grant copyright to works of government employees.  At this time a CRADA 
> is not necessary but should it be determined that a CRADA is necessary, we 
> would like to work through that process during the incubation phase of 
> Accumulo rather than before acceptance as this may take time to enter into an 
> agreement.
>
> == External Dependencies ==
> jetty (Apache and EPL), jline (BSD), jfreechart (LGPL), jcommon (LGPL), slf4j 
> (MIT), junit (CPL)
>
> == Cryptography ==
> none
>
> == Required Resources ==
>  * Mailing Lists
>   * accumulo-private
>   * accumulo-dev
>   * accumulo-commits
>   * accumulo-user
>
>  * Subversion Directory
>   * https://svn.apache.org/repos/asf/incubator/accumulo
>
>  * Issue Tracking
>   * JIRA Accumulo (ACCUMULO)
>
>  * Continuous Integration
>   * Jenkins builds on https://builds.apache.org/
>
>  * Web
>   * http://incubator.apache.org/accumulo/
>   * wiki at http://wiki.apache.org or http://cwiki.apache.org
>
> == Initial Committers ==
>  * Aaron Cordova (aaron at cordovas dot org)
>  * Adam Fuchs (adam.p.fuchs at ugov dot gov)
>  * Eric Newton (ecn at swcomplete dot com)
>  * Billie Rinaldi (billie.j.rinaldi at ugov dot gov)
>  * Keith Turner (keith.turner at ptech-llc dot com)
>  * John Vines (john.w.vines at ugov dot gov)
>  * Chris Waring (christopher.a.waring at ugov dot gov)
>
> == Affiliations ==
>  * Aaron Cordova, The Interllective
>  * Adam Fuchs, National Security Agency
>  * Eric Newton, SW Complete Incorporated
>  * Billie Rinaldi, National Security Agency
>  * Keith Turner, Peterson Technology LLC
>  * John Vines, National Security Agency
>  * Chris Waring, National Security Agency
>
> == Sponsors ==
>  * Champion: Doug Cutting
>  * Nominated Mentors: Benson Margulies, ?, ?
>  * Sponsoring Entity: Apache Incubator
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Reply via email to