Greetings,

I would like to propose Accumulo to be an Apache Incubator project.  Accumulo 
is a distributed key/value store that provides expressive cell-level access 
labels and a server-side programming mechanism that can modify key/value pairs 
at various points in the data management process.  It is based on Google's 
BigTable design and runs over Apache Hadoop and Zookeeper.

Here is a link to the proposal in the Incubator wiki:
http://wiki.apache.org/incubator/AccumuloProposal

I've also pasted the initial contents below.

Thanks,
Billie Rinaldi


= Accumulo Proposal =

== Abstract ==
Accumulo is a distributed key/value store that provides expressive, cell-level 
access labels.

== Proposal ==
Accumulo is a sorted, distributed key/value store based on Google's BigTable 
design.  It is built on top of Apache Hadoop, Zookeeper, and Thrift.  It 
features a few novel improvements on the BigTable design in the form of 
cell-level access labels and a server-side programming mechanism that can 
modify key/value pairs at various points in the data management process.

== Background ==
Google published the design of BigTable in 2006.  Several other open source 
projects have implemented aspects of this design including HBase, CloudStore, 
and Cassandra.  Accumulo began its development in 2008.

== Rationale ==
There is a need for a flexible, high performance distributed key/value store 
that provides expressive, fine-grained access labels.  The communities we 
expect to be most interested in such a project are government, health care, and 
other industries where privacy is a concern.  We have made much progress in 
developing this project over the past 3 years and believe both the project and 
the interested communities would benefit from this work being openly available 
and having open development.

== Current Status ==

=== Meritocracy ===
We intend to strongly encourage the community to help with and contribute to 
the code.  We will actively seek potential committers and help them become 
familiar with the codebase.

=== Community ===
A strong government community has developed around Accumulo and training 
classes have been ongoing for about a year.  Hundreds of developers use 
Accumulo.

=== Core Developers ===
The developers are mainly employed by the National Security Agency, but we 
anticipate interest developing among other companies.

=== Alignment ===
Accumulo is built on top of Hadoop, Zookeeper, and Thrift.  It builds with 
Maven.  Due to the strong relationship with these Apache projects, the 
incubator is a good match for Accumulo.

== Known Risks ==
=== Orphaned Products ===
There is only a small risk of being orphaned.  The community is committed to 
improving the codebase of the project due to its fulfilling needs not addressed 
by any other software.

=== Inexperience with Open Source ===
The codebase has been treated internally as an open source project since its 
beginning, and the initial Apache committers have been involved with the code 
for multiple years.  While our experience with public open source is limited, 
we do not anticipate difficulty in operating under Apache's development process.

=== Homogeneous Developers ===
The committers have multiple employers and it is expected that committers from 
different companies will be recruited.

=== Reliance on Salaried Developers ===
The initial committers are all paid by their employers to work on Accumulo and 
we expect such employment to continue.  Some of the initial committers would 
continue as volunteers even if no longer employed to do so.

=== Relationships with Other Apache Products ===
Accumulo uses Hadoop, Zookeeper, Thrift, Maven, log4j, commons-lang, -net, -io, 
-jci, -collections, -configuration, -logging, and -codec.

=== Relationship to HBase ===
Accumulo and HBase are both based on the design of Google's BigTable, so there 
is a danger that potential users will have difficulty distinguishing the two or 
that they will not see an incentive in adopting Accumulo.  There are a few key 
areas in which Accumulo differs from HBase.  Some of the desired features of 
Accumulo could be incorporated into HBase, however the most important of these 
may be unlikely to be adopted (see cell-level access labels and iterators 
below).  It is a possibility that the codebases will ultimately converge, but 
the number of differences at the current time warrants a separate project for 
Accumulo.

==== Access Labels ====
Accumulo has an additional portion of its key that sorts after the column 
qualifier and before the timestamp.  It is called column visibility and enables 
expressive cell-level access control.  Authorizations are passed with each 
query to control what data is returned to the user.  The column visibilities 
are boolean AND and OR combinations of arbitrary strings (such as "(A&B)|C") 
and authorizations are sets of strings (such as {C,D}).

==== Iterators ====
Accumulo has a novel server-side programming mechanism that can modify the data 
written to disk or returned to the user.  This mechanism can be configured for 
any of the scopes where data is read from or written to disk.  It can be used 
to perform joins on data within a single tablet.

==== Flexibility ====
HBase requires the user to specify the set of column families to be used up 
front.  Accumulo places no restrictions on the column families.  Also, each 
column family in HBase is stored separately on disk.  Accumulo allows column 
families to be grouped together on disk, as does BigTable.  This enables users 
to configure how their data is stored, potentially providing improvements in 
compression and lookup speeds.  It gives Accumulo a row/column hybrid nature, 
while HBase is currently column-oriented.

==== Testing ====
Accumulo has testing frameworks that have resulted in its achieving a high 
level of correctness and performance.  We have observed that under some 
configurations and conditions Accumulo will outperform HBase and provide 
greater data integrity.

==== Logging ====
HBase uses a write-ahead log on the Hadoop Distributed File System.  Accumulo 
has its own logging service that does not depend on communication with the HDFS 
NameNode.

==== Storage ====
Accumulo has a relative key file format that improves compression.

==== Areas in which HBase features improvements over Accumulo ====
in memory tables, upserts, coprocessors, connections to other projects such as 
Cascading and Pig

=== Expectations ===
There is a risk that Accumulo will be criticized for not providing adequate 
security.  The access labels in Accumulo do not in themselves provide a 
complete security solution, but are a mechanism for labeling each piece of data 
with the authorizations that are necessary to see it.

=== Apache Brand ===
Our interest in releasing this code as an Apache incubator project is due to 
its strong relationship with other Apache projects, i.e. Hadoop, Zookeeper, and 
HBase.

== Documentation ==
There is not currently documentation about Accumulo on the web, but a fair 
amount of documentation and training materials exists and will be provided on 
the Accumulo wiki at apache.org.  Also, a paper discussing YCSB results for 
Accumulo will be presented at the 2011 Symposium on Cloud Computing.

== Initial Source ==
Accumulo has been in development since spring 2008.  There are hundreds of 
developers using it and tens of developers have contributed to it.  The core 
codebase consists of 200,000 lines of code (mainly Java) and 100s of pages of 
documentation.  There are also a few projects built on top of Accumulo that may 
be added to its contrib in the future.  These include support for Hive, Matlab, 
YCSB, and graph processing.

== Source and Intellectual Property Submission Plan ==
Accumulo core code, examples, documention, and training materials will be 
submitted by the National Security Agency.

We will also be soliciting contributions of further plugins from MIT Lincoln 
Labs, Carnegie Mellon University, and others.

Accumulo has been developed by a mix of government employees and private 
companies under government contract.  Material developed by government 
employees is in the public domain and no U.S. copyright exists in works of the 
federal government.  For the contractor developed material in the initial 
submission, the U.S. Government has sufficient authority per the ICLA from the 
copyright owner to contribute the Accumulo code to the incubator.

There has been some discussion regarding accepting contributions from US 
Government sources on [https://issues.apache.org/jira/browse/LEGAL-93 
LEGAL-93]. We propose that the NSA will sign an ICLA/CCLA if that document 
could be slightly modified to explicitly address copyright in works of 
government employees. Specifically, we propose that the definition of “You” be 
modified to include “the copyright owner, the owner of a Contribution not 
subject to copyright, or legal entity authorized by the copyright owner that is 
making this Agreement.” In addition, section 2, the copyright license grant be 
modified after “You hereby grant” that either states “to the extent authorized 
by law” or “to the extent copyright exists in the Contribution.”  These changes 
will permit US Government employee developed work to be included.

One proposed solution is to form a Collaborative Research and Development 
Agreement (CRADA) between the Apache Software Foundation and the US Government, 
but this will not solve the underlying problem that U.S. law does not grant 
copyright to works of government employees.  At this time a CRADA is not 
necessary but should it be determined that a CRADA is necessary, we would like 
to work through that process during the incubation phase of Accumulo rather 
than before acceptance as this may take time to enter into an agreement.

== External Dependencies ==
jetty (Apache and EPL), jline (BSD), jfreechart (LGPL), jcommon (LGPL), slf4j 
(MIT), junit (CPL)

== Cryptography ==
none

== Required Resources ==
 * Mailing Lists
   * accumulo-private
   * accumulo-dev
   * accumulo-commits
   * accumulo-user

 * Subversion Directory
   * https://svn.apache.org/repos/asf/incubator/accumulo

 * Issue Tracking
   * JIRA Accumulo (ACCUMULO)

 * Continuous Integration
   * Jenkins builds on https://builds.apache.org/

 * Web
   * http://incubator.apache.org/accumulo/
   * wiki at http://wiki.apache.org or http://cwiki.apache.org

== Initial Committers ==
 * Aaron Cordova (aaron at cordovas dot org)
 * Adam Fuchs (adam.p.fuchs at ugov dot gov)
 * Eric Newton (ecn at swcomplete dot com)
 * Billie Rinaldi (billie.j.rinaldi at ugov dot gov)
 * Keith Turner (keith.turner at ptech-llc dot com)
 * John Vines (john.w.vines at ugov dot gov)
 * Chris Waring (christopher.a.waring at ugov dot gov)

== Affiliations ==
 * Aaron Cordova, The Interllective
 * Adam Fuchs, National Security Agency
 * Eric Newton, SW Complete Incorporated
 * Billie Rinaldi, National Security Agency
 * Keith Turner, Peterson Technology LLC
 * John Vines, National Security Agency
 * Chris Waring, National Security Agency

== Sponsors ==
 * Champion: Doug Cutting
 * Nominated Mentors: Benson Margulies, ?, ?
 * Sponsoring Entity: Apache Incubator

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Reply via email to