Hi Henry, thanks! It’s great that you’ve seen (and liked) AsterixDB before.
Even if your time is very limited we would be very happy to have you on board as a mentor. I’ll add you to the proposal. Cheers, Till > On Jan 19, 2015, at 10:26 AM, Henry Saputra <henry.sapu...@gmail.com> wrote: > > +1 This is GREAT News! > > Was watching and trying AsterixDB last year and looked in awesome shape. > > I have my plate full but would love to help mentor this project to get > it going to ASF if needed! > > - Henry > > On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980) > <chris.a.mattm...@jpl.nasa.gov> wrote: >> Hi Folks, >> >> I am pleased to bring forth the Apache AsterixDB proposal to the >> Apache Incubator as Champion, working in collaboration with the >> team. Please find the wiki proposal here: >> >> https://wiki.apache.org/incubator/AsterixDBProposal >> >> >> Full text of the proposal is below. Please discuss and enjoy. I’ll >> leave the discussion open for a week, and then look to call a VOTE >> hopefully end of next week if all is well. >> >> Cheers! >> Chris Mattmann >> >> ============================================================= >> Apache AsterixDB Proposal >> >> Abstract >> >> Apache AsterixDB is a scalable big data management system (BDMS) that >> provides storage, management, and query capabilities for large >> collections of semi-structured data. >> >> Proposal >> >> AsterixDB is a big data management system (BDMS) that makes it >> well-suited to needs such as web data warehousing and social data >> storage and analysis. Feature-wise, AsterixDB has: >> >> * A NoSQL style data model (ADM) based on extending JSON with object >> database concepts. >> * An expressive and declarative query language (AQL) for querying >> semi-structured data. >> * A runtime query execution engine, Hyracks, for partitioned-parallel >> execution of query plans. >> * Partitioned LSM-based data storage and indexing for efficient >> ingestion of newly arriving data. >> * Support for querying and indexing external data (e.g., in HDFS) as >> well as data stored within AsterixDB. >> * A rich set of primitive data types, including support for spatial, >> temporal, and textual data. >> * Indexing options that include B+ trees, R trees, and inverted >> keyword index support. >> * Basic transactional (concurrency and recovery) capabilities akin to >> those of a NoSQL store. >> >> >> Background and Rationale >> >> In the world of relational databases, the need to tackle data volumes >> that exceed the capabilities of a single server led to the >> development of “shared-nothing” parallel database systems several >> decades ago. These systems spread data over a cluster based on a >> partitioning strategy, such as hash partitioning, and queries are >> processed by employing partitioned-parallel divide-and-conquer >> techniques. Since these systems are fronted by a high-level, >> declarative language (SQL), their users are shielded from the >> complexities of parallel programming. Parallel database systems have >> been an extremely successful application of parallel computing, and >> quite a number of commercial products exist today. >> >> In the distributed systems world, the Web brought a need to index and >> query its huge content. SQL and relational databases were not the >> answer, though shared-nothing clusters again emerged as the hardware >> platform of choice. Google developed the Google File System (GFS) and >> MapReduce programming model to allow programmers to store and process >> Big Data by writing a few user-defined functions. The MapReduce >> framework applies these functions in parallel to data instances in >> distributed files (map) and to sorted groups of instances sharing a >> common key (reduce) -- not unlike the partitioned parallelism in >> parallel database systems. Apache's Hadoop MapReduce platform is the >> most prominent implementation of this paradigm for the rest of the >> Big Data community. On top of Hadoop and HDFS sit declarative >> languages like Pig and Hive that each compile down to Hadoop >> MapReduce jobs. >> >> The big Web companies were also challenged by extreme user bases >> (100s of millions of users) and needed fast simple lookups and >> updates to very large keyed data sets like user profiles. SQL >> databases were deemed either too expensive or not scalable, so the >> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two >> popular key-value stores, in this space. MongoDB and Couchbase are >> other open source alternatives (document stores). >> >> It is evident from the rapidly growing popularity of "NoSQL" stores, >> as well as the strong demand for Big Data analytics engines today, >> that there is a strong (and growing!) need to store, process, *and* >> query large volumes of semi-structured data in many application >> areas. Until very recently, developers have had to ``choose'' between >> using big data analytics engines like Apache Hive or Apache Spark, >> which can do complex query processing and analysis over HDFS-resident >> files, and flexible but low-function data stores like MongoDB or >> Apache HBase. (The Apache Phoenix project, >> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that >> aims to bridge between these choices.) >> >> AsterixDB is a highly scalable data management system that can store, >> index, and manage semi-structured data, e.g., much like MongoDB, but >> it also supports a full-power query language with the expressiveness >> of SQL (and more). Unlike analytics engines like Hive or Spark, it >> stores and manages data, so AsterixDB can exploit its knowledge of >> data partitioning and the availability of indexes to avoid always >> scanning data set(s) to process queries. Somewhat surprisingly, there >> is no open source parallel database system (relational or otherwise) >> available to developers today -- AsterixDB aims to fill this need. >> Since Apache is where the majority of the today's most important Big >> Data technologies live, the ASF seems like the obvious home for a >> system like AsterixDB. >> >> Current Status >> >> The current version of AsterixDB was co-developed by a team of >> faculty, staff, and students at UC Irvine and UC Riverside. The >> project was initiated as a large NSF-sponsored project in 2009, the >> goal of which was to combine the best ideas from the parallel >> database world, the then new Hadoop world, and the semi-structured >> (e.g., XML/JSON) data world in order to create a next-generation >> BDMS. A first informal open source release was made four years later, >> in June of 2013, under the Apache Software License 2.0. >> >> >> Meritocracy >> >> The current developers are familiar with meritocratic open source >> development at Apache. Apache was chosen specifically because we want >> to encourage this style of development for the project. >> >> >> Community >> >> While AsterixDB started as a university project it has developed into >> a community. A number of the initial committers started contributing >> in academia and continue to actively participate and contribute after >> graduation. And we seek to further develop developer and user >> communities. One way to broaden the community that is ongoing is >> through academic collaborations (currently with IIT Mumbai in India >> and TU Berlin in Germany). During incubation we will also explicitly >> seek increased industrial participation. >> >> Some indicators of the effort's development community and history can >> be >> found at: >> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo, >> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo >> >> >> Core Developers >> >> The core developers of the project are diverse, although initially UC >> Irvine heavy (roughly 50) due to the project's origins at UCI. The >> other 50 are from other academic institutions (UC Riverside and the >> Hebrew University in Jerusalem) and companies (Couchbase, Facebook, >> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software). >> >> >> Alignment >> >> Apache is, by far, the most natural home for taking the AsterixDB >> project forward. A large fraction of today's top Big Data >> technologies have their homes in Apache, including Hadoop, YARN, Pig, >> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a >> significant gap -- the parallel data management system gap -- that >> exists in the Big Data open source world. It is well-aligned with a >> number of the Apache projects, e.g., it has strong support for >> accessing and indexing external data in HDFS, and it uses YARN as an >> answer to basic cluster resource management. AsterixDB also seeks to >> achieve an Apache-style development model; it is seeking a broader >> community of contributors and users in order to achieve its full >> potential and value to the Big Data community. >> >> There are also a number of related Apache projects and dependencies >> that will be mentioned below in the Relationships with Other Apache >> products section. >> >> >> Known Risks >> >> Orphaned products >> >> Given the current level of intellectual investment in AsterixDB, the >> risk of the project being abandoned is very small. The UCI/UCR >> faculty team leads are highly incentivized to continue development >> since the database groups at UC Irvine and UC Riverside are both >> reliant on AsterixDB as a platform for long-term graduate research >> projects. UC San Diego is also beginning to contribute to the code >> base, and a collaboration involving public health applications is >> forming with UCLA. The work on AsterixDB is managed via a mix of >> mailing list discussions supplemented by weekly project status >> meetings which are summarized on the mailing list. Typical (local >> plus Skype-in) attendance to the weekly status meetings runs at about >> 20 active contributors. >> >> >> Inexperience with Open Source >> >> AsterixDB and Hyracks were completely developed in Open Source under >> the ASL 2.0. The source code repositories, issue tracker, and mailing >> lists are available on Google Code and discussions and decisions >> happen on the mailing lists (which is necessary due to the geographic >> distribution of the current developers). >> >> Also a few of the initial committers have contributed to Apache >> projects. Vinayak Borkar is a committer on the Apache Helix and >> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF >> and an IPMC member. Preston Carman and Steven Jacobs are committers >> on the Apache VXQuery project. >> >> >> Relationships with Other Apache Products >> >> Apache VXQuery is based on the Hyracks data-parallel runtime, which >> is also included in the AsterixDB code base. >> >> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB >> is support for accessing external data in HDFS (and Hive formats), >> and resource management and system administration features are in the >> process of being migrated to YARN. >> >> AsterixDB's AQL query facilities offer comparable query power to >> Apache's Pig and Hive systems for big data analytics. AsterixDB >> differs in storing and indexing data and thus being able to quickly >> answer small and medium queries without large HDFS data scans - >> thereby targeting a different class of use cases. >> >> AsterixDB's data storage and indexing facilities are similar to those >> of HBase, but AsterixDB differs in being a much more complete and >> queryable BDMS (not just a key-value style store). >> >> AsterixDB's target use cases are not in-memory processing or >> iterative algorithm support, making AsterixDB complementary to the >> Apache Spark platform. (Spark interoperability is on our longer-term >> to-do wishlist.) >> >> >> Homogeneous Developers >> >> As mentioned before the current community is already organizationally >> and geographically distributed - and we would like to increase the >> heterogeneity. >> >> >> Reliance on Salaried Developers >> >> Of the initial committers only 3 are full-time UCI staff. The other >> committers are a mix of students, alumni who continue to contribute >> to the effort, and individuals working with permission part-time (or >> in spare time) on this project. >> >> >> A Excessive Fascination with the Apache Brand >> >> We believe in the processes, systems, and framework Apache has put in >> place. Apache is also known to foster a great community around their >> projects and provide exposure. While brand is important, our >> fascination with it is not excessive. We believe that the ASF is the >> right home for AsterixDB and that having AsterixDB inside of the ASF >> will lead to a better long-term outcome for the Big Data community. >> >> >> Documentation >> >> Documentation and publications related to AsterixDB can be found at >> http://asterixdb.ics.uci.edu/. >> >> >> Initial Source >> >> Current source resides in Google code: >> https://code.google.com/p/asterixdb/ (query language and upper system >> layers) and https://code.google.com/p/hyracks/ (dataflow runtime >> system and storage management libraries). >> >> >> External Dependencies >> >> AsterixDB depends on a number of Apache projects: >> >> - Ant >> - Avro >> - ApacheDB JDO >> - Commons >> - Derby >> - Hadoop >> - Hive >> - HTTPComponents >> - Jakarta ORO >> - Maven >> - Tomcat >> - Thrift >> - Velocity >> - Wicket >> - Xerces >> >> and other open source projects (organized by license): >> >> -- ASL 2.0: >> - Jackson >> - Google Guava >> - Google Guice >> - JSON-simple >> - BoneCP >> - Microsoft Azure SDK >> - Netty >> - Rome >> - JetS3t >> - Groovy >> - Jettison >> - Plexus >> - Datanucleus (JDO) >> - Jetty >> - Twitter4J >> - Snappy-java >> >> -- BSD: >> - Antlr >> - ObjectWeb ASM >> - Protobuf >> - JSCH >> - JavaCC >> - Paranamer >> - JLine >> - Stax >> - StringTemplate >> - xmlEnc >> >> -- MIT >> - AppAssembler >> - SimpleLog4J >> >> -- CDDL 1.0 >> - Java Activation Framework >> - Java Transactions >> - Java Servlet API >> - Grizzly >> - gmbal >> - Glassfish >> >> -- CDDL 1.1 >> - Jersey >> - JAXB Reference Implementation >> >> -- JSON License >> - JSON >> >> -- EPL 1.0 >> - JUnit >> >> -- JDOM License >> - JDOM >> >> -- Public Domain >> - xz >> - AOPAlliance >> >> As all dependencies are managed using Apache Maven, none of the >> external libraries need to be packaged in a source distribution. >> >> >> Required Resources >> >> Developer and user mailing lists >> >> priv...@asterixdb.incubator.apache.org (with moderated subscriptions) >> comm...@asterixdb.incubator.apache.org >> d...@asterixdb.incubator.apache.org >> us...@asterixdb.incubator.apache.org >> >> >> A git repository >> >> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git >> >> >> A JIRA issue tracker >> >> https://issues.apache.org/jira/browse/ASTERIXDB >> >> >> Initial Committers >> >> The following is a list of the planned initial Apache committers (the >> active subset of the committers for the current repository at Google >> code). >> >> Abdullah Alamoudi (bamou...@gmail.com) >> Cameron Samak (euf...@gmail.com) >> Chen Li (che...@gmail.com) >> Ian Maxon (ima...@uci.edu) >> Ildar Absalyamov (ildar.absalya...@gmail.com) >> Jianfeng Jia (jianfeng....@gmail.com) >> Karen Ouaknine (ker...@gmail.com) >> Markus Dreseler (apa...@dreseler.de) >> Mike Carey (dtab...@apache.org) >> Murtadha Hubail (hubail...@gmail.com) >> Pouria Pirzadeh (pouria.pirza...@gmail.com) >> Preston Carman (prest...@apache.org) >> Raman Grover (ramangrove...@gmail.com) >> Sattam Alsubaiee (salsuba...@gmail.com) >> Steven Jacobs (sjaco...@apache.org) >> Taewoo Kim (wangs...@gmail.com) >> Till Westmann (ti...@apache.org) >> Vinayak Borkar (vinay...@apache.org) >> Yingyi Bu (buyin...@gmail.com) >> Young-Seok Kim (kiss...@gmail.com) >> Zach Heilbron (zheilb...@gmail.com) >> >> >> Affiliations >> >> UC Irvine >> - Mike Carey >> - Chen Li >> - Ian Maxon >> - Yingyi Bu >> - Raman Grover >> - Pouria Pirzadeh >> - Young-Seok Kim >> - Cameron Samak >> - Taewoo Kim >> - Jianfeng Jia >> - Murtadha Hubail >> - Markus Dreseler >> >> UC Riverside >> - Ildar Absalyamov >> - Preston Carman >> - Steven Jacobs >> >> Hebrew University >> - Keren Ouaknine >> >> Oracle >> - Till Westmann >> >> X15 Software >> - Vinayak Borkar >> - Zach Heilbron >> >> KACST Saudi Arabia >> - Sattam Alsubaiee >> >> Saudi Aramco >> - Abdullah Alamoudi >> >> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI >> (UC Irvine) and UCR (UC Riverside) affiliates being students. The >> non-UC committers are a mix of alumni who continue to contribute to >> the effort and individuals working with permission part-time (or in >> spare time) on this project. >> >> >> Sponsors >> >> Champion >> >> Chris Mattmann (NASA/JPL) >> >> Nominated Mentors >> >> TBD >> >> Sponsoring Entity >> >> The Apache Incubator >> >> >> >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >>
signature.asc
Description: Message signed with OpenPGP using GPGMail