> On Jan 19, 2015, at 11:34 AM, jan i <j...@apache.org> wrote: > > Looks like a real challenging project, and the proposal looks as if it has > already been through a couple of refinement rounds. > > Count on my +1, when it comes to voting.
Will do! Thanks, Till > > rgds > jan i > > On 19 January 2015 at 19:26, Henry Saputra <henry.sapu...@gmail.com > <mailto:henry.sapu...@gmail.com>> wrote: > +1 This is GREAT News! > > Was watching and trying AsterixDB last year and looked in awesome shape. > > I have my plate full but would love to help mentor this project to get > it going to ASF if needed! > > - Henry > > On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980) > <chris.a.mattm...@jpl.nasa.gov <mailto:chris.a.mattm...@jpl.nasa.gov>> wrote: > > Hi Folks, > > > > I am pleased to bring forth the Apache AsterixDB proposal to the > > Apache Incubator as Champion, working in collaboration with the > > team. Please find the wiki proposal here: > > > > https://wiki.apache.org/incubator/AsterixDBProposal > > <https://wiki.apache.org/incubator/AsterixDBProposal> > > > > > > Full text of the proposal is below. Please discuss and enjoy. I’ll > > leave the discussion open for a week, and then look to call a VOTE > > hopefully end of next week if all is well. > > > > Cheers! > > Chris Mattmann > > > > ============================================================= > > Apache AsterixDB Proposal > > > > Abstract > > > > Apache AsterixDB is a scalable big data management system (BDMS) that > > provides storage, management, and query capabilities for large > > collections of semi-structured data. > > > > Proposal > > > > AsterixDB is a big data management system (BDMS) that makes it > > well-suited to needs such as web data warehousing and social data > > storage and analysis. Feature-wise, AsterixDB has: > > > > * A NoSQL style data model (ADM) based on extending JSON with object > > database concepts. > > * An expressive and declarative query language (AQL) for querying > > semi-structured data. > > * A runtime query execution engine, Hyracks, for partitioned-parallel > > execution of query plans. > > * Partitioned LSM-based data storage and indexing for efficient > > ingestion of newly arriving data. > > * Support for querying and indexing external data (e.g., in HDFS) as > > well as data stored within AsterixDB. > > * A rich set of primitive data types, including support for spatial, > > temporal, and textual data. > > * Indexing options that include B+ trees, R trees, and inverted > > keyword index support. > > * Basic transactional (concurrency and recovery) capabilities akin to > > those of a NoSQL store. > > > > > > Background and Rationale > > > > In the world of relational databases, the need to tackle data volumes > > that exceed the capabilities of a single server led to the > > development of “shared-nothing” parallel database systems several > > decades ago. These systems spread data over a cluster based on a > > partitioning strategy, such as hash partitioning, and queries are > > processed by employing partitioned-parallel divide-and-conquer > > techniques. Since these systems are fronted by a high-level, > > declarative language (SQL), their users are shielded from the > > complexities of parallel programming. Parallel database systems have > > been an extremely successful application of parallel computing, and > > quite a number of commercial products exist today. > > > > In the distributed systems world, the Web brought a need to index and > > query its huge content. SQL and relational databases were not the > > answer, though shared-nothing clusters again emerged as the hardware > > platform of choice. Google developed the Google File System (GFS) and > > MapReduce programming model to allow programmers to store and process > > Big Data by writing a few user-defined functions. The MapReduce > > framework applies these functions in parallel to data instances in > > distributed files (map) and to sorted groups of instances sharing a > > common key (reduce) -- not unlike the partitioned parallelism in > > parallel database systems. Apache's Hadoop MapReduce platform is the > > most prominent implementation of this paradigm for the rest of the > > Big Data community. On top of Hadoop and HDFS sit declarative > > languages like Pig and Hive that each compile down to Hadoop > > MapReduce jobs. > > > > The big Web companies were also challenged by extreme user bases > > (100s of millions of users) and needed fast simple lookups and > > updates to very large keyed data sets like user profiles. SQL > > databases were deemed either too expensive or not scalable, so the > > “NoSQL movement” was born. The ASF now has HBase and Cassandra, two > > popular key-value stores, in this space. MongoDB and Couchbase are > > other open source alternatives (document stores). > > > > It is evident from the rapidly growing popularity of "NoSQL" stores, > > as well as the strong demand for Big Data analytics engines today, > > that there is a strong (and growing!) need to store, process, *and* > > query large volumes of semi-structured data in many application > > areas. Until very recently, developers have had to ``choose'' between > > using big data analytics engines like Apache Hive or Apache Spark, > > which can do complex query processing and analysis over HDFS-resident > > files, and flexible but low-function data stores like MongoDB or > > Apache HBase. (The Apache Phoenix project, > > http://phoenix.apache.org/ <http://phoenix.apache.org/>, is a recent > > SQL-over-HBase effort that > > aims to bridge between these choices.) > > > > AsterixDB is a highly scalable data management system that can store, > > index, and manage semi-structured data, e.g., much like MongoDB, but > > it also supports a full-power query language with the expressiveness > > of SQL (and more). Unlike analytics engines like Hive or Spark, it > > stores and manages data, so AsterixDB can exploit its knowledge of > > data partitioning and the availability of indexes to avoid always > > scanning data set(s) to process queries. Somewhat surprisingly, there > > is no open source parallel database system (relational or otherwise) > > available to developers today -- AsterixDB aims to fill this need. > > Since Apache is where the majority of the today's most important Big > > Data technologies live, the ASF seems like the obvious home for a > > system like AsterixDB. > > > > Current Status > > > > The current version of AsterixDB was co-developed by a team of > > faculty, staff, and students at UC Irvine and UC Riverside. The > > project was initiated as a large NSF-sponsored project in 2009, the > > goal of which was to combine the best ideas from the parallel > > database world, the then new Hadoop world, and the semi-structured > > (e.g., XML/JSON) data world in order to create a next-generation > > BDMS. A first informal open source release was made four years later, > > in June of 2013, under the Apache Software License 2.0. > > > > > > Meritocracy > > > > The current developers are familiar with meritocratic open source > > development at Apache. Apache was chosen specifically because we want > > to encourage this style of development for the project. > > > > > > Community > > > > While AsterixDB started as a university project it has developed into > > a community. A number of the initial committers started contributing > > in academia and continue to actively participate and contribute after > > graduation. And we seek to further develop developer and user > > communities. One way to broaden the community that is ongoing is > > through academic collaborations (currently with IIT Mumbai in India > > and TU Berlin in Germany). During incubation we will also explicitly > > seek increased industrial participation. > > > > Some indicators of the effort's development community and history can > > be > > found at: > > https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo > > <https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo>, > > https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo > > <https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo> > > > > > > Core Developers > > > > The core developers of the project are diverse, although initially UC > > Irvine heavy (roughly 50) due to the project's origins at UCI. The > > other 50 are from other academic institutions (UC Riverside and the > > Hebrew University in Jerusalem) and companies (Couchbase, Facebook, > > IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software). > > > > > > Alignment > > > > Apache is, by far, the most natural home for taking the AsterixDB > > project forward. A large fraction of today's top Big Data > > technologies have their homes in Apache, including Hadoop, YARN, Pig, > > Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a > > significant gap -- the parallel data management system gap -- that > > exists in the Big Data open source world. It is well-aligned with a > > number of the Apache projects, e.g., it has strong support for > > accessing and indexing external data in HDFS, and it uses YARN as an > > answer to basic cluster resource management. AsterixDB also seeks to > > achieve an Apache-style development model; it is seeking a broader > > community of contributors and users in order to achieve its full > > potential and value to the Big Data community. > > > > There are also a number of related Apache projects and dependencies > > that will be mentioned below in the Relationships with Other Apache > > products section. > > > > > > Known Risks > > > > Orphaned products > > > > Given the current level of intellectual investment in AsterixDB, the > > risk of the project being abandoned is very small. The UCI/UCR > > faculty team leads are highly incentivized to continue development > > since the database groups at UC Irvine and UC Riverside are both > > reliant on AsterixDB as a platform for long-term graduate research > > projects. UC San Diego is also beginning to contribute to the code > > base, and a collaboration involving public health applications is > > forming with UCLA. The work on AsterixDB is managed via a mix of > > mailing list discussions supplemented by weekly project status > > meetings which are summarized on the mailing list. Typical (local > > plus Skype-in) attendance to the weekly status meetings runs at about > > 20 active contributors. > > > > > > Inexperience with Open Source > > > > AsterixDB and Hyracks were completely developed in Open Source under > > the ASL 2.0. The source code repositories, issue tracker, and mailing > > lists are available on Google Code and discussions and decisions > > happen on the mailing lists (which is necessary due to the geographic > > distribution of the current developers). > > > > Also a few of the initial committers have contributed to Apache > > projects. Vinayak Borkar is a committer on the Apache Helix and > > Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF > > and an IPMC member. Preston Carman and Steven Jacobs are committers > > on the Apache VXQuery project. > > > > > > Relationships with Other Apache Products > > > > Apache VXQuery is based on the Hyracks data-parallel runtime, which > > is also included in the AsterixDB code base. > > > > AsterixDB is closely related to Apache Hadoop. Included in AsterixDB > > is support for accessing external data in HDFS (and Hive formats), > > and resource management and system administration features are in the > > process of being migrated to YARN. > > > > AsterixDB's AQL query facilities offer comparable query power to > > Apache's Pig and Hive systems for big data analytics. AsterixDB > > differs in storing and indexing data and thus being able to quickly > > answer small and medium queries without large HDFS data scans - > > thereby targeting a different class of use cases. > > > > AsterixDB's data storage and indexing facilities are similar to those > > of HBase, but AsterixDB differs in being a much more complete and > > queryable BDMS (not just a key-value style store). > > > > AsterixDB's target use cases are not in-memory processing or > > iterative algorithm support, making AsterixDB complementary to the > > Apache Spark platform. (Spark interoperability is on our longer-term > > to-do wishlist.) > > > > > > Homogeneous Developers > > > > As mentioned before the current community is already organizationally > > and geographically distributed - and we would like to increase the > > heterogeneity. > > > > > > Reliance on Salaried Developers > > > > Of the initial committers only 3 are full-time UCI staff. The other > > committers are a mix of students, alumni who continue to contribute > > to the effort, and individuals working with permission part-time (or > > in spare time) on this project. > > > > > > A Excessive Fascination with the Apache Brand > > > > We believe in the processes, systems, and framework Apache has put in > > place. Apache is also known to foster a great community around their > > projects and provide exposure. While brand is important, our > > fascination with it is not excessive. We believe that the ASF is the > > right home for AsterixDB and that having AsterixDB inside of the ASF > > will lead to a better long-term outcome for the Big Data community. > > > > > > Documentation > > > > Documentation and publications related to AsterixDB can be found at > > http://asterixdb.ics.uci.edu/ <http://asterixdb.ics.uci.edu/>. > > > > > > Initial Source > > > > Current source resides in Google code: > > https://code.google.com/p/asterixdb/ <https://code.google.com/p/asterixdb/> > > (query language and upper system > > layers) and https://code.google.com/p/hyracks/ > > <https://code.google.com/p/hyracks/> (dataflow runtime > > system and storage management libraries). > > > > > > External Dependencies > > > > AsterixDB depends on a number of Apache projects: > > > > - Ant > > - Avro > > - ApacheDB JDO > > - Commons > > - Derby > > - Hadoop > > - Hive > > - HTTPComponents > > - Jakarta ORO > > - Maven > > - Tomcat > > - Thrift > > - Velocity > > - Wicket > > - Xerces > > > > and other open source projects (organized by license): > > > > -- ASL 2.0: > > - Jackson > > - Google Guava > > - Google Guice > > - JSON-simple > > - BoneCP > > - Microsoft Azure SDK > > - Netty > > - Rome > > - JetS3t > > - Groovy > > - Jettison > > - Plexus > > - Datanucleus (JDO) > > - Jetty > > - Twitter4J > > - Snappy-java > > > > -- BSD: > > - Antlr > > - ObjectWeb ASM > > - Protobuf > > - JSCH > > - JavaCC > > - Paranamer > > - JLine > > - Stax > > - StringTemplate > > - xmlEnc > > > > -- MIT > > - AppAssembler > > - SimpleLog4J > > > > -- CDDL 1.0 > > - Java Activation Framework > > - Java Transactions > > - Java Servlet API > > - Grizzly > > - gmbal > > - Glassfish > > > > -- CDDL 1.1 > > - Jersey > > - JAXB Reference Implementation > > > > -- JSON License > > - JSON > > > > -- EPL 1.0 > > - JUnit > > > > -- JDOM License > > - JDOM > > > > -- Public Domain > > - xz > > - AOPAlliance > > > > As all dependencies are managed using Apache Maven, none of the > > external libraries need to be packaged in a source distribution. > > > > > > Required Resources > > > > Developer and user mailing lists > > > > priv...@asterixdb.incubator.apache.org > > <mailto:priv...@asterixdb.incubator.apache.org> (with moderated > > subscriptions) > > comm...@asterixdb.incubator.apache.org > > <mailto:comm...@asterixdb.incubator.apache.org> > > d...@asterixdb.incubator.apache.org > > <mailto:d...@asterixdb.incubator.apache.org> > > us...@asterixdb.incubator.apache.org > > <mailto:us...@asterixdb.incubator.apache.org> > > > > > > A git repository > > > > https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git > > <https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git> > > > > > > A JIRA issue tracker > > > > https://issues.apache.org/jira/browse/ASTERIXDB > > <https://issues.apache.org/jira/browse/ASTERIXDB> > > > > > > Initial Committers > > > > The following is a list of the planned initial Apache committers (the > > active subset of the committers for the current repository at Google > > code). > > > > Abdullah Alamoudi (bamou...@gmail.com <mailto:bamou...@gmail.com>) > > Cameron Samak (euf...@gmail.com <mailto:euf...@gmail.com>) > > Chen Li (che...@gmail.com <mailto:che...@gmail.com>) > > Ian Maxon (ima...@uci.edu <mailto:ima...@uci.edu>) > > Ildar Absalyamov (ildar.absalya...@gmail.com > > <mailto:ildar.absalya...@gmail.com>) > > Jianfeng Jia (jianfeng....@gmail.com <mailto:jianfeng....@gmail.com>) > > Karen Ouaknine (ker...@gmail.com <mailto:ker...@gmail.com>) > > Markus Dreseler (apa...@dreseler.de <mailto:apa...@dreseler.de>) > > Mike Carey (dtab...@apache.org <mailto:dtab...@apache.org>) > > Murtadha Hubail (hubail...@gmail.com <mailto:hubail...@gmail.com>) > > Pouria Pirzadeh (pouria.pirza...@gmail.com > > <mailto:pouria.pirza...@gmail.com>) > > Preston Carman (prest...@apache.org <mailto:prest...@apache.org>) > > Raman Grover (ramangrove...@gmail.com <mailto:ramangrove...@gmail.com>) > > Sattam Alsubaiee (salsuba...@gmail.com <mailto:salsuba...@gmail.com>) > > Steven Jacobs (sjaco...@apache.org <mailto:sjaco...@apache.org>) > > Taewoo Kim (wangs...@gmail.com <mailto:wangs...@gmail.com>) > > Till Westmann (ti...@apache.org <mailto:ti...@apache.org>) > > Vinayak Borkar (vinay...@apache.org <mailto:vinay...@apache.org>) > > Yingyi Bu (buyin...@gmail.com <mailto:buyin...@gmail.com>) > > Young-Seok Kim (kiss...@gmail.com <mailto:kiss...@gmail.com>) > > Zach Heilbron (zheilb...@gmail.com <mailto:zheilb...@gmail.com>) > > > > > > Affiliations > > > > UC Irvine > > - Mike Carey > > - Chen Li > > - Ian Maxon > > - Yingyi Bu > > - Raman Grover > > - Pouria Pirzadeh > > - Young-Seok Kim > > - Cameron Samak > > - Taewoo Kim > > - Jianfeng Jia > > - Murtadha Hubail > > - Markus Dreseler > > > > UC Riverside > > - Ildar Absalyamov > > - Preston Carman > > - Steven Jacobs > > > > Hebrew University > > - Keren Ouaknine > > > > Oracle > > - Till Westmann > > > > X15 Software > > - Vinayak Borkar > > - Zach Heilbron > > > > KACST Saudi Arabia > > - Sattam Alsubaiee > > > > Saudi Aramco > > - Abdullah Alamoudi > > > > Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI > > (UC Irvine) and UCR (UC Riverside) affiliates being students. The > > non-UC committers are a mix of alumni who continue to contribute to > > the effort and individuals working with permission part-time (or in > > spare time) on this project. > > > > > > Sponsors > > > > Champion > > > > Chris Mattmann (NASA/JPL) > > > > Nominated Mentors > > > > TBD > > > > Sponsoring Entity > > > > The Apache Incubator > > > > > > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Chief Architect > > Instrument Software and Science Data Systems Section (398) > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 168-519, Mailstop: 168-527 > > Email: chris.a.mattm...@nasa.gov <mailto:chris.a.mattm...@nasa.gov> > > WWW: http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Associate Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > <mailto:general-unsubscr...@incubator.apache.org> > For additional commands, e-mail: general-h...@incubator.apache.org > <mailto:general-h...@incubator.apache.org> > >
signature.asc
Description: Message signed with OpenPGP using GPGMail