Hi, if you read the proposal all the way to the end you will see that - while we do have some community and code - we don’t have mentors. So if you like the proposal, please volunteer.
Cheers, Till > On Jan 14, 2015, at 6:21 PM, Mattmann, Chris A (3980) > <chris.a.mattm...@jpl.nasa.gov> wrote: > > Hi Folks, > > I am pleased to bring forth the Apache AsterixDB proposal to the > Apache Incubator as Champion, working in collaboration with the > team. Please find the wiki proposal here: > > https://wiki.apache.org/incubator/AsterixDBProposal > > > Full text of the proposal is below. Please discuss and enjoy. I’ll > leave the discussion open for a week, and then look to call a VOTE > hopefully end of next week if all is well. > > Cheers! > Chris Mattmann > > ============================================================= > Apache AsterixDB Proposal > > Abstract > > Apache AsterixDB is a scalable big data management system (BDMS) that > provides storage, management, and query capabilities for large > collections of semi-structured data. > > Proposal > > AsterixDB is a big data management system (BDMS) that makes it > well-suited to needs such as web data warehousing and social data > storage and analysis. Feature-wise, AsterixDB has: > > * A NoSQL style data model (ADM) based on extending JSON with object > database concepts. > * An expressive and declarative query language (AQL) for querying > semi-structured data. > * A runtime query execution engine, Hyracks, for partitioned-parallel > execution of query plans. > * Partitioned LSM-based data storage and indexing for efficient > ingestion of newly arriving data. > * Support for querying and indexing external data (e.g., in HDFS) as > well as data stored within AsterixDB. > * A rich set of primitive data types, including support for spatial, > temporal, and textual data. > * Indexing options that include B+ trees, R trees, and inverted > keyword index support. > * Basic transactional (concurrency and recovery) capabilities akin to > those of a NoSQL store. > > > Background and Rationale > > In the world of relational databases, the need to tackle data volumes > that exceed the capabilities of a single server led to the > development of “shared-nothing” parallel database systems several > decades ago. These systems spread data over a cluster based on a > partitioning strategy, such as hash partitioning, and queries are > processed by employing partitioned-parallel divide-and-conquer > techniques. Since these systems are fronted by a high-level, > declarative language (SQL), their users are shielded from the > complexities of parallel programming. Parallel database systems have > been an extremely successful application of parallel computing, and > quite a number of commercial products exist today. > > In the distributed systems world, the Web brought a need to index and > query its huge content. SQL and relational databases were not the > answer, though shared-nothing clusters again emerged as the hardware > platform of choice. Google developed the Google File System (GFS) and > MapReduce programming model to allow programmers to store and process > Big Data by writing a few user-defined functions. The MapReduce > framework applies these functions in parallel to data instances in > distributed files (map) and to sorted groups of instances sharing a > common key (reduce) -- not unlike the partitioned parallelism in > parallel database systems. Apache's Hadoop MapReduce platform is the > most prominent implementation of this paradigm for the rest of the > Big Data community. On top of Hadoop and HDFS sit declarative > languages like Pig and Hive that each compile down to Hadoop > MapReduce jobs. > > The big Web companies were also challenged by extreme user bases > (100s of millions of users) and needed fast simple lookups and > updates to very large keyed data sets like user profiles. SQL > databases were deemed either too expensive or not scalable, so the > “NoSQL movement” was born. The ASF now has HBase and Cassandra, two > popular key-value stores, in this space. MongoDB and Couchbase are > other open source alternatives (document stores). > > It is evident from the rapidly growing popularity of "NoSQL" stores, > as well as the strong demand for Big Data analytics engines today, > that there is a strong (and growing!) need to store, process, *and* > query large volumes of semi-structured data in many application > areas. Until very recently, developers have had to ``choose'' between > using big data analytics engines like Apache Hive or Apache Spark, > which can do complex query processing and analysis over HDFS-resident > files, and flexible but low-function data stores like MongoDB or > Apache HBase. (The Apache Phoenix project, > http://phoenix.apache.org/, is a recent SQL-over-HBase effort that > aims to bridge between these choices.) > > AsterixDB is a highly scalable data management system that can store, > index, and manage semi-structured data, e.g., much like MongoDB, but > it also supports a full-power query language with the expressiveness > of SQL (and more). Unlike analytics engines like Hive or Spark, it > stores and manages data, so AsterixDB can exploit its knowledge of > data partitioning and the availability of indexes to avoid always > scanning data set(s) to process queries. Somewhat surprisingly, there > is no open source parallel database system (relational or otherwise) > available to developers today -- AsterixDB aims to fill this need. > Since Apache is where the majority of the today's most important Big > Data technologies live, the ASF seems like the obvious home for a > system like AsterixDB. > > Current Status > > The current version of AsterixDB was co-developed by a team of > faculty, staff, and students at UC Irvine and UC Riverside. The > project was initiated as a large NSF-sponsored project in 2009, the > goal of which was to combine the best ideas from the parallel > database world, the then new Hadoop world, and the semi-structured > (e.g., XML/JSON) data world in order to create a next-generation > BDMS. A first informal open source release was made four years later, > in June of 2013, under the Apache Software License 2.0. > > > Meritocracy > > The current developers are familiar with meritocratic open source > development at Apache. Apache was chosen specifically because we want > to encourage this style of development for the project. > > > Community > > While AsterixDB started as a university project it has developed into > a community. A number of the initial committers started contributing > in academia and continue to actively participate and contribute after > graduation. And we seek to further develop developer and user > communities. One way to broaden the community that is ongoing is > through academic collaborations (currently with IIT Mumbai in India > and TU Berlin in Germany). During incubation we will also explicitly > seek increased industrial participation. > > Some indicators of the effort's development community and history can > be > found at: > https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo, > https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo > > > Core Developers > > The core developers of the project are diverse, although initially UC > Irvine heavy (roughly 50) due to the project's origins at UCI. The > other 50 are from other academic institutions (UC Riverside and the > Hebrew University in Jerusalem) and companies (Couchbase, Facebook, > IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software). > > > Alignment > > Apache is, by far, the most natural home for taking the AsterixDB > project forward. A large fraction of today's top Big Data > technologies have their homes in Apache, including Hadoop, YARN, Pig, > Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a > significant gap -- the parallel data management system gap -- that > exists in the Big Data open source world. It is well-aligned with a > number of the Apache projects, e.g., it has strong support for > accessing and indexing external data in HDFS, and it uses YARN as an > answer to basic cluster resource management. AsterixDB also seeks to > achieve an Apache-style development model; it is seeking a broader > community of contributors and users in order to achieve its full > potential and value to the Big Data community. > > There are also a number of related Apache projects and dependencies > that will be mentioned below in the Relationships with Other Apache > products section. > > > Known Risks > > Orphaned products > > Given the current level of intellectual investment in AsterixDB, the > risk of the project being abandoned is very small. The UCI/UCR > faculty team leads are highly incentivized to continue development > since the database groups at UC Irvine and UC Riverside are both > reliant on AsterixDB as a platform for long-term graduate research > projects. UC San Diego is also beginning to contribute to the code > base, and a collaboration involving public health applications is > forming with UCLA. The work on AsterixDB is managed via a mix of > mailing list discussions supplemented by weekly project status > meetings which are summarized on the mailing list. Typical (local > plus Skype-in) attendance to the weekly status meetings runs at about > 20 active contributors. > > > Inexperience with Open Source > > AsterixDB and Hyracks were completely developed in Open Source under > the ASL 2.0. The source code repositories, issue tracker, and mailing > lists are available on Google Code and discussions and decisions > happen on the mailing lists (which is necessary due to the geographic > distribution of the current developers). > > Also a few of the initial committers have contributed to Apache > projects. Vinayak Borkar is a committer on the Apache Helix and > Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF > and an IPMC member. Preston Carman and Steven Jacobs are committers > on the Apache VXQuery project. > > > Relationships with Other Apache Products > > Apache VXQuery is based on the Hyracks data-parallel runtime, which > is also included in the AsterixDB code base. > > AsterixDB is closely related to Apache Hadoop. Included in AsterixDB > is support for accessing external data in HDFS (and Hive formats), > and resource management and system administration features are in the > process of being migrated to YARN. > > AsterixDB's AQL query facilities offer comparable query power to > Apache's Pig and Hive systems for big data analytics. AsterixDB > differs in storing and indexing data and thus being able to quickly > answer small and medium queries without large HDFS data scans - > thereby targeting a different class of use cases. > > AsterixDB's data storage and indexing facilities are similar to those > of HBase, but AsterixDB differs in being a much more complete and > queryable BDMS (not just a key-value style store). > > AsterixDB's target use cases are not in-memory processing or > iterative algorithm support, making AsterixDB complementary to the > Apache Spark platform. (Spark interoperability is on our longer-term > to-do wishlist.) > > > Homogeneous Developers > > As mentioned before the current community is already organizationally > and geographically distributed - and we would like to increase the > heterogeneity. > > > Reliance on Salaried Developers > > Of the initial committers only 3 are full-time UCI staff. The other > committers are a mix of students, alumni who continue to contribute > to the effort, and individuals working with permission part-time (or > in spare time) on this project. > > > A Excessive Fascination with the Apache Brand > > We believe in the processes, systems, and framework Apache has put in > place. Apache is also known to foster a great community around their > projects and provide exposure. While brand is important, our > fascination with it is not excessive. We believe that the ASF is the > right home for AsterixDB and that having AsterixDB inside of the ASF > will lead to a better long-term outcome for the Big Data community. > > > Documentation > > Documentation and publications related to AsterixDB can be found at > http://asterixdb.ics.uci.edu/. > > > Initial Source > > Current source resides in Google code: > https://code.google.com/p/asterixdb/ (query language and upper system > layers) and https://code.google.com/p/hyracks/ (dataflow runtime > system and storage management libraries). > > > External Dependencies > > AsterixDB depends on a number of Apache projects: > > - Ant > - Avro > - ApacheDB JDO > - Commons > - Derby > - Hadoop > - Hive > - HTTPComponents > - Jakarta ORO > - Maven > - Tomcat > - Thrift > - Velocity > - Wicket > - Xerces > > and other open source projects (organized by license): > > -- ASL 2.0: > - Jackson > - Google Guava > - Google Guice > - JSON-simple > - BoneCP > - Microsoft Azure SDK > - Netty > - Rome > - JetS3t > - Groovy > - Jettison > - Plexus > - Datanucleus (JDO) > - Jetty > - Twitter4J > - Snappy-java > > -- BSD: > - Antlr > - ObjectWeb ASM > - Protobuf > - JSCH > - JavaCC > - Paranamer > - JLine > - Stax > - StringTemplate > - xmlEnc > > -- MIT > - AppAssembler > - SimpleLog4J > > -- CDDL 1.0 > - Java Activation Framework > - Java Transactions > - Java Servlet API > - Grizzly > - gmbal > - Glassfish > > -- CDDL 1.1 > - Jersey > - JAXB Reference Implementation > > -- JSON License > - JSON > > -- EPL 1.0 > - JUnit > > -- JDOM License > - JDOM > > -- Public Domain > - xz > - AOPAlliance > > As all dependencies are managed using Apache Maven, none of the > external libraries need to be packaged in a source distribution. > > > Required Resources > > Developer and user mailing lists > > priv...@asterixdb.incubator.apache.org (with moderated subscriptions) > comm...@asterixdb.incubator.apache.org > d...@asterixdb.incubator.apache.org > us...@asterixdb.incubator.apache.org > > > A git repository > > https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git > > > A JIRA issue tracker > > https://issues.apache.org/jira/browse/ASTERIXDB > > > Initial Committers > > The following is a list of the planned initial Apache committers (the > active subset of the committers for the current repository at Google > code). > > Abdullah Alamoudi (bamou...@gmail.com) > Cameron Samak (euf...@gmail.com) > Chen Li (che...@gmail.com) > Ian Maxon (ima...@uci.edu) > Ildar Absalyamov (ildar.absalya...@gmail.com) > Jianfeng Jia (jianfeng....@gmail.com) > Karen Ouaknine (ker...@gmail.com) > Markus Dreseler (apa...@dreseler.de) > Mike Carey (dtab...@apache.org) > Murtadha Hubail (hubail...@gmail.com) > Pouria Pirzadeh (pouria.pirza...@gmail.com) > Preston Carman (prest...@apache.org) > Raman Grover (ramangrove...@gmail.com) > Sattam Alsubaiee (salsuba...@gmail.com) > Steven Jacobs (sjaco...@apache.org) > Taewoo Kim (wangs...@gmail.com) > Till Westmann (ti...@apache.org) > Vinayak Borkar (vinay...@apache.org) > Yingyi Bu (buyin...@gmail.com) > Young-Seok Kim (kiss...@gmail.com) > Zach Heilbron (zheilb...@gmail.com) > > > Affiliations > > UC Irvine > - Mike Carey > - Chen Li > - Ian Maxon > - Yingyi Bu > - Raman Grover > - Pouria Pirzadeh > - Young-Seok Kim > - Cameron Samak > - Taewoo Kim > - Jianfeng Jia > - Murtadha Hubail > - Markus Dreseler > > UC Riverside > - Ildar Absalyamov > - Preston Carman > - Steven Jacobs > > Hebrew University > - Keren Ouaknine > > Oracle > - Till Westmann > > X15 Software > - Vinayak Borkar > - Zach Heilbron > > KACST Saudi Arabia > - Sattam Alsubaiee > > Saudi Aramco > - Abdullah Alamoudi > > Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI > (UC Irvine) and UCR (UC Riverside) affiliates being students. The > non-UC committers are a mix of alumni who continue to contribute to > the effort and individuals working with permission part-time (or in > spare time) on this project. > > > Sponsors > > Champion > > Chris Mattmann (NASA/JPL) > > Nominated Mentors > > TBD > > Sponsoring Entity > > The Apache Incubator > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org