Chris just asked me under separate cover. I am happy to help out as mentor.
On Mon, Jan 19, 2015 at 8:17 PM, Henry Saputra <henry.sapu...@gmail.com> wrote: > Thanks Till, > > Will try to solicit more mentors to help. > Especially with initial committers mostly have not been exposed to > contributing the Apache way. > > - Henry > > On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann <t...@westmann.org> wrote: > > Hi Henry, > > > > thanks! It’s great that you’ve seen (and liked) AsterixDB before. > > > > Even if your time is very limited we would be very happy to have you on > board as a mentor. > > I’ll add you to the proposal. > > > > Cheers, > > Till > > > >> On Jan 19, 2015, at 10:26 AM, Henry Saputra <henry.sapu...@gmail.com> > wrote: > >> > >> +1 This is GREAT News! > >> > >> Was watching and trying AsterixDB last year and looked in awesome shape. > >> > >> I have my plate full but would love to help mentor this project to get > >> it going to ASF if needed! > >> > >> - Henry > >> > >> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980) > >> <chris.a.mattm...@jpl.nasa.gov> wrote: > >>> Hi Folks, > >>> > >>> I am pleased to bring forth the Apache AsterixDB proposal to the > >>> Apache Incubator as Champion, working in collaboration with the > >>> team. Please find the wiki proposal here: > >>> > >>> https://wiki.apache.org/incubator/AsterixDBProposal > >>> > >>> > >>> Full text of the proposal is below. Please discuss and enjoy. I’ll > >>> leave the discussion open for a week, and then look to call a VOTE > >>> hopefully end of next week if all is well. > >>> > >>> Cheers! > >>> Chris Mattmann > >>> > >>> ============================================================= > >>> Apache AsterixDB Proposal > >>> > >>> Abstract > >>> > >>> Apache AsterixDB is a scalable big data management system (BDMS) that > >>> provides storage, management, and query capabilities for large > >>> collections of semi-structured data. > >>> > >>> Proposal > >>> > >>> AsterixDB is a big data management system (BDMS) that makes it > >>> well-suited to needs such as web data warehousing and social data > >>> storage and analysis. Feature-wise, AsterixDB has: > >>> > >>> * A NoSQL style data model (ADM) based on extending JSON with object > >>> database concepts. > >>> * An expressive and declarative query language (AQL) for querying > >>> semi-structured data. > >>> * A runtime query execution engine, Hyracks, for partitioned-parallel > >>> execution of query plans. > >>> * Partitioned LSM-based data storage and indexing for efficient > >>> ingestion of newly arriving data. > >>> * Support for querying and indexing external data (e.g., in HDFS) as > >>> well as data stored within AsterixDB. > >>> * A rich set of primitive data types, including support for spatial, > >>> temporal, and textual data. > >>> * Indexing options that include B+ trees, R trees, and inverted > >>> keyword index support. > >>> * Basic transactional (concurrency and recovery) capabilities akin to > >>> those of a NoSQL store. > >>> > >>> > >>> Background and Rationale > >>> > >>> In the world of relational databases, the need to tackle data volumes > >>> that exceed the capabilities of a single server led to the > >>> development of “shared-nothing” parallel database systems several > >>> decades ago. These systems spread data over a cluster based on a > >>> partitioning strategy, such as hash partitioning, and queries are > >>> processed by employing partitioned-parallel divide-and-conquer > >>> techniques. Since these systems are fronted by a high-level, > >>> declarative language (SQL), their users are shielded from the > >>> complexities of parallel programming. Parallel database systems have > >>> been an extremely successful application of parallel computing, and > >>> quite a number of commercial products exist today. > >>> > >>> In the distributed systems world, the Web brought a need to index and > >>> query its huge content. SQL and relational databases were not the > >>> answer, though shared-nothing clusters again emerged as the hardware > >>> platform of choice. Google developed the Google File System (GFS) and > >>> MapReduce programming model to allow programmers to store and process > >>> Big Data by writing a few user-defined functions. The MapReduce > >>> framework applies these functions in parallel to data instances in > >>> distributed files (map) and to sorted groups of instances sharing a > >>> common key (reduce) -- not unlike the partitioned parallelism in > >>> parallel database systems. Apache's Hadoop MapReduce platform is the > >>> most prominent implementation of this paradigm for the rest of the > >>> Big Data community. On top of Hadoop and HDFS sit declarative > >>> languages like Pig and Hive that each compile down to Hadoop > >>> MapReduce jobs. > >>> > >>> The big Web companies were also challenged by extreme user bases > >>> (100s of millions of users) and needed fast simple lookups and > >>> updates to very large keyed data sets like user profiles. SQL > >>> databases were deemed either too expensive or not scalable, so the > >>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two > >>> popular key-value stores, in this space. MongoDB and Couchbase are > >>> other open source alternatives (document stores). > >>> > >>> It is evident from the rapidly growing popularity of "NoSQL" stores, > >>> as well as the strong demand for Big Data analytics engines today, > >>> that there is a strong (and growing!) need to store, process, *and* > >>> query large volumes of semi-structured data in many application > >>> areas. Until very recently, developers have had to ``choose'' between > >>> using big data analytics engines like Apache Hive or Apache Spark, > >>> which can do complex query processing and analysis over HDFS-resident > >>> files, and flexible but low-function data stores like MongoDB or > >>> Apache HBase. (The Apache Phoenix project, > >>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that > >>> aims to bridge between these choices.) > >>> > >>> AsterixDB is a highly scalable data management system that can store, > >>> index, and manage semi-structured data, e.g., much like MongoDB, but > >>> it also supports a full-power query language with the expressiveness > >>> of SQL (and more). Unlike analytics engines like Hive or Spark, it > >>> stores and manages data, so AsterixDB can exploit its knowledge of > >>> data partitioning and the availability of indexes to avoid always > >>> scanning data set(s) to process queries. Somewhat surprisingly, there > >>> is no open source parallel database system (relational or otherwise) > >>> available to developers today -- AsterixDB aims to fill this need. > >>> Since Apache is where the majority of the today's most important Big > >>> Data technologies live, the ASF seems like the obvious home for a > >>> system like AsterixDB. > >>> > >>> Current Status > >>> > >>> The current version of AsterixDB was co-developed by a team of > >>> faculty, staff, and students at UC Irvine and UC Riverside. The > >>> project was initiated as a large NSF-sponsored project in 2009, the > >>> goal of which was to combine the best ideas from the parallel > >>> database world, the then new Hadoop world, and the semi-structured > >>> (e.g., XML/JSON) data world in order to create a next-generation > >>> BDMS. A first informal open source release was made four years later, > >>> in June of 2013, under the Apache Software License 2.0. > >>> > >>> > >>> Meritocracy > >>> > >>> The current developers are familiar with meritocratic open source > >>> development at Apache. Apache was chosen specifically because we want > >>> to encourage this style of development for the project. > >>> > >>> > >>> Community > >>> > >>> While AsterixDB started as a university project it has developed into > >>> a community. A number of the initial committers started contributing > >>> in academia and continue to actively participate and contribute after > >>> graduation. And we seek to further develop developer and user > >>> communities. One way to broaden the community that is ongoing is > >>> through academic collaborations (currently with IIT Mumbai in India > >>> and TU Berlin in Germany). During incubation we will also explicitly > >>> seek increased industrial participation. > >>> > >>> Some indicators of the effort's development community and history can > >>> be > >>> found at: > >>> > https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo > , > >>> > https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo > >>> > >>> > >>> Core Developers > >>> > >>> The core developers of the project are diverse, although initially UC > >>> Irvine heavy (roughly 50) due to the project's origins at UCI. The > >>> other 50 are from other academic institutions (UC Riverside and the > >>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook, > >>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software). > >>> > >>> > >>> Alignment > >>> > >>> Apache is, by far, the most natural home for taking the AsterixDB > >>> project forward. A large fraction of today's top Big Data > >>> technologies have their homes in Apache, including Hadoop, YARN, Pig, > >>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a > >>> significant gap -- the parallel data management system gap -- that > >>> exists in the Big Data open source world. It is well-aligned with a > >>> number of the Apache projects, e.g., it has strong support for > >>> accessing and indexing external data in HDFS, and it uses YARN as an > >>> answer to basic cluster resource management. AsterixDB also seeks to > >>> achieve an Apache-style development model; it is seeking a broader > >>> community of contributors and users in order to achieve its full > >>> potential and value to the Big Data community. > >>> > >>> There are also a number of related Apache projects and dependencies > >>> that will be mentioned below in the Relationships with Other Apache > >>> products section. > >>> > >>> > >>> Known Risks > >>> > >>> Orphaned products > >>> > >>> Given the current level of intellectual investment in AsterixDB, the > >>> risk of the project being abandoned is very small. The UCI/UCR > >>> faculty team leads are highly incentivized to continue development > >>> since the database groups at UC Irvine and UC Riverside are both > >>> reliant on AsterixDB as a platform for long-term graduate research > >>> projects. UC San Diego is also beginning to contribute to the code > >>> base, and a collaboration involving public health applications is > >>> forming with UCLA. The work on AsterixDB is managed via a mix of > >>> mailing list discussions supplemented by weekly project status > >>> meetings which are summarized on the mailing list. Typical (local > >>> plus Skype-in) attendance to the weekly status meetings runs at about > >>> 20 active contributors. > >>> > >>> > >>> Inexperience with Open Source > >>> > >>> AsterixDB and Hyracks were completely developed in Open Source under > >>> the ASL 2.0. The source code repositories, issue tracker, and mailing > >>> lists are available on Google Code and discussions and decisions > >>> happen on the mailing lists (which is necessary due to the geographic > >>> distribution of the current developers). > >>> > >>> Also a few of the initial committers have contributed to Apache > >>> projects. Vinayak Borkar is a committer on the Apache Helix and > >>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF > >>> and an IPMC member. Preston Carman and Steven Jacobs are committers > >>> on the Apache VXQuery project. > >>> > >>> > >>> Relationships with Other Apache Products > >>> > >>> Apache VXQuery is based on the Hyracks data-parallel runtime, which > >>> is also included in the AsterixDB code base. > >>> > >>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB > >>> is support for accessing external data in HDFS (and Hive formats), > >>> and resource management and system administration features are in the > >>> process of being migrated to YARN. > >>> > >>> AsterixDB's AQL query facilities offer comparable query power to > >>> Apache's Pig and Hive systems for big data analytics. AsterixDB > >>> differs in storing and indexing data and thus being able to quickly > >>> answer small and medium queries without large HDFS data scans - > >>> thereby targeting a different class of use cases. > >>> > >>> AsterixDB's data storage and indexing facilities are similar to those > >>> of HBase, but AsterixDB differs in being a much more complete and > >>> queryable BDMS (not just a key-value style store). > >>> > >>> AsterixDB's target use cases are not in-memory processing or > >>> iterative algorithm support, making AsterixDB complementary to the > >>> Apache Spark platform. (Spark interoperability is on our longer-term > >>> to-do wishlist.) > >>> > >>> > >>> Homogeneous Developers > >>> > >>> As mentioned before the current community is already organizationally > >>> and geographically distributed - and we would like to increase the > >>> heterogeneity. > >>> > >>> > >>> Reliance on Salaried Developers > >>> > >>> Of the initial committers only 3 are full-time UCI staff. The other > >>> committers are a mix of students, alumni who continue to contribute > >>> to the effort, and individuals working with permission part-time (or > >>> in spare time) on this project. > >>> > >>> > >>> A Excessive Fascination with the Apache Brand > >>> > >>> We believe in the processes, systems, and framework Apache has put in > >>> place. Apache is also known to foster a great community around their > >>> projects and provide exposure. While brand is important, our > >>> fascination with it is not excessive. We believe that the ASF is the > >>> right home for AsterixDB and that having AsterixDB inside of the ASF > >>> will lead to a better long-term outcome for the Big Data community. > >>> > >>> > >>> Documentation > >>> > >>> Documentation and publications related to AsterixDB can be found at > >>> http://asterixdb.ics.uci.edu/. > >>> > >>> > >>> Initial Source > >>> > >>> Current source resides in Google code: > >>> https://code.google.com/p/asterixdb/ (query language and upper system > >>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime > >>> system and storage management libraries). > >>> > >>> > >>> External Dependencies > >>> > >>> AsterixDB depends on a number of Apache projects: > >>> > >>> - Ant > >>> - Avro > >>> - ApacheDB JDO > >>> - Commons > >>> - Derby > >>> - Hadoop > >>> - Hive > >>> - HTTPComponents > >>> - Jakarta ORO > >>> - Maven > >>> - Tomcat > >>> - Thrift > >>> - Velocity > >>> - Wicket > >>> - Xerces > >>> > >>> and other open source projects (organized by license): > >>> > >>> -- ASL 2.0: > >>> - Jackson > >>> - Google Guava > >>> - Google Guice > >>> - JSON-simple > >>> - BoneCP > >>> - Microsoft Azure SDK > >>> - Netty > >>> - Rome > >>> - JetS3t > >>> - Groovy > >>> - Jettison > >>> - Plexus > >>> - Datanucleus (JDO) > >>> - Jetty > >>> - Twitter4J > >>> - Snappy-java > >>> > >>> -- BSD: > >>> - Antlr > >>> - ObjectWeb ASM > >>> - Protobuf > >>> - JSCH > >>> - JavaCC > >>> - Paranamer > >>> - JLine > >>> - Stax > >>> - StringTemplate > >>> - xmlEnc > >>> > >>> -- MIT > >>> - AppAssembler > >>> - SimpleLog4J > >>> > >>> -- CDDL 1.0 > >>> - Java Activation Framework > >>> - Java Transactions > >>> - Java Servlet API > >>> - Grizzly > >>> - gmbal > >>> - Glassfish > >>> > >>> -- CDDL 1.1 > >>> - Jersey > >>> - JAXB Reference Implementation > >>> > >>> -- JSON License > >>> - JSON > >>> > >>> -- EPL 1.0 > >>> - JUnit > >>> > >>> -- JDOM License > >>> - JDOM > >>> > >>> -- Public Domain > >>> - xz > >>> - AOPAlliance > >>> > >>> As all dependencies are managed using Apache Maven, none of the > >>> external libraries need to be packaged in a source distribution. > >>> > >>> > >>> Required Resources > >>> > >>> Developer and user mailing lists > >>> > >>> priv...@asterixdb.incubator.apache.org (with moderated subscriptions) > >>> comm...@asterixdb.incubator.apache.org > >>> d...@asterixdb.incubator.apache.org > >>> us...@asterixdb.incubator.apache.org > >>> > >>> > >>> A git repository > >>> > >>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git > >>> > >>> > >>> A JIRA issue tracker > >>> > >>> https://issues.apache.org/jira/browse/ASTERIXDB > >>> > >>> > >>> Initial Committers > >>> > >>> The following is a list of the planned initial Apache committers (the > >>> active subset of the committers for the current repository at Google > >>> code). > >>> > >>> Abdullah Alamoudi (bamou...@gmail.com) > >>> Cameron Samak (euf...@gmail.com) > >>> Chen Li (che...@gmail.com) > >>> Ian Maxon (ima...@uci.edu) > >>> Ildar Absalyamov (ildar.absalya...@gmail.com) > >>> Jianfeng Jia (jianfeng....@gmail.com) > >>> Karen Ouaknine (ker...@gmail.com) > >>> Markus Dreseler (apa...@dreseler.de) > >>> Mike Carey (dtab...@apache.org) > >>> Murtadha Hubail (hubail...@gmail.com) > >>> Pouria Pirzadeh (pouria.pirza...@gmail.com) > >>> Preston Carman (prest...@apache.org) > >>> Raman Grover (ramangrove...@gmail.com) > >>> Sattam Alsubaiee (salsuba...@gmail.com) > >>> Steven Jacobs (sjaco...@apache.org) > >>> Taewoo Kim (wangs...@gmail.com) > >>> Till Westmann (ti...@apache.org) > >>> Vinayak Borkar (vinay...@apache.org) > >>> Yingyi Bu (buyin...@gmail.com) > >>> Young-Seok Kim (kiss...@gmail.com) > >>> Zach Heilbron (zheilb...@gmail.com) > >>> > >>> > >>> Affiliations > >>> > >>> UC Irvine > >>> - Mike Carey > >>> - Chen Li > >>> - Ian Maxon > >>> - Yingyi Bu > >>> - Raman Grover > >>> - Pouria Pirzadeh > >>> - Young-Seok Kim > >>> - Cameron Samak > >>> - Taewoo Kim > >>> - Jianfeng Jia > >>> - Murtadha Hubail > >>> - Markus Dreseler > >>> > >>> UC Riverside > >>> - Ildar Absalyamov > >>> - Preston Carman > >>> - Steven Jacobs > >>> > >>> Hebrew University > >>> - Keren Ouaknine > >>> > >>> Oracle > >>> - Till Westmann > >>> > >>> X15 Software > >>> - Vinayak Borkar > >>> - Zach Heilbron > >>> > >>> KACST Saudi Arabia > >>> - Sattam Alsubaiee > >>> > >>> Saudi Aramco > >>> - Abdullah Alamoudi > >>> > >>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI > >>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The > >>> non-UC committers are a mix of alumni who continue to contribute to > >>> the effort and individuals working with permission part-time (or in > >>> spare time) on this project. > >>> > >>> > >>> Sponsors > >>> > >>> Champion > >>> > >>> Chris Mattmann (NASA/JPL) > >>> > >>> Nominated Mentors > >>> > >>> TBD > >>> > >>> Sponsoring Entity > >>> > >>> The Apache Incubator > >>> > >>> > >>> > >>> > >>> > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> Chris Mattmann, Ph.D. > >>> Chief Architect > >>> Instrument Software and Science Data Systems Section (398) > >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>> Office: 168-519, Mailstop: 168-527 > >>> Email: chris.a.mattm...@nasa.gov > >>> WWW: http://sunset.usc.edu/~mattmann/ > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> Adjunct Associate Professor, Computer Science Department > >>> University of Southern California, Los Angeles, CA 90089 USA > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> > >>> > >>> > >>> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >