Re: [PROPOSAL] Apache AsterixDB Incubator

Ted Dunning Tue, 20 Jan 2015 11:08:26 -0800

Added my name to the mentor list.



On Tue, Jan 20, 2015 at 8:37 AM, Mike Carey <dtab...@gmail.com> wrote:

>  Wonderful; thanks, Ted!!
> Cheers,
> Mike
>
>  On 1/19/15 11:29 PM, Ted Dunning wrote:
>
>
> Chris just asked me under separate cover.
>
>  I am happy to help out as mentor.
>
>
>
> On Mon, Jan 19, 2015 at 8:17 PM, Henry Saputra <henry.sapu...@gmail.com>
> wrote:
>
>> Thanks Till,
>>
>> Will try to solicit more mentors to help.
>> Especially with initial committers mostly have not been exposed to
>> contributing the Apache way.
>>
>> - Henry
>>
>> On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann <t...@westmann.org> wrote:
>> > Hi Henry,
>> >
>> > thanks! It’s great that you’ve seen (and liked) AsterixDB before.
>> >
>> > Even if your time is very limited we would be very happy to have you on
>> board as a mentor.
>> > I’ll add you to the proposal.
>> >
>> > Cheers,
>> > Till
>> >
>> >> On Jan 19, 2015, at 10:26 AM, Henry Saputra <henry.sapu...@gmail.com>
>> wrote:
>> >>
>> >> +1 This is GREAT News!
>> >>
>> >> Was watching and trying AsterixDB last year and looked in awesome
>> shape.
>> >>
>> >> I have my plate full but would love to help mentor this project to get
>> >> it going to ASF if needed!
>> >>
>> >> - Henry
>> >>
>> >> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
>> >> <chris.a.mattm...@jpl.nasa.gov> wrote:
>> >>> Hi Folks,
>> >>>
>> >>> I am pleased to bring forth the Apache AsterixDB proposal to the
>> >>> Apache Incubator as Champion, working in collaboration with the
>> >>> team. Please find the wiki proposal here:
>> >>>
>> >>> https://wiki.apache.org/incubator/AsterixDBProposal
>> >>>
>> >>>
>> >>> Full text of the proposal is below. Please discuss and enjoy. I’ll
>> >>> leave the discussion open for a week, and then look to call a VOTE
>> >>> hopefully end of next week if all is well.
>> >>>
>> >>> Cheers!
>> >>> Chris Mattmann
>> >>>
>> >>> =============================================================
>> >>> Apache AsterixDB Proposal
>> >>>
>> >>> Abstract
>> >>>
>> >>> Apache AsterixDB is a scalable big data management system (BDMS) that
>> >>> provides storage, management, and query capabilities for large
>> >>> collections of semi-structured data.
>> >>>
>> >>> Proposal
>> >>>
>> >>> AsterixDB is a big data management system (BDMS) that makes it
>> >>> well-suited to needs such as web data warehousing and social data
>> >>> storage and analysis. Feature-wise, AsterixDB has:
>> >>>
>> >>> * A NoSQL style data model (ADM) based on extending JSON with object
>> >>>  database concepts.
>> >>> * An expressive and declarative query language (AQL) for querying
>> >>>  semi-structured data.
>> >>> * A runtime query execution engine, Hyracks, for partitioned-parallel
>> >>>  execution of query plans.
>> >>> * Partitioned LSM-based data storage and indexing for efficient
>> >>>  ingestion of newly arriving data.
>> >>> * Support for querying and indexing external data (e.g., in HDFS) as
>> >>>  well as data stored within AsterixDB.
>> >>> * A rich set of primitive data types, including support for spatial,
>> >>>  temporal, and textual data.
>> >>> * Indexing options that include B+ trees, R trees, and inverted
>> >>>  keyword index support.
>> >>> * Basic transactional (concurrency and recovery) capabilities akin to
>> >>>  those of a NoSQL store.
>> >>>
>> >>>
>> >>> Background and Rationale
>> >>>
>> >>> In the world of relational databases, the need to tackle data volumes
>> >>> that exceed the capabilities of a single server led to the
>> >>> development of “shared-nothing” parallel database systems several
>> >>> decades ago. These systems spread data over a cluster based on a
>> >>> partitioning strategy, such as hash partitioning, and queries are
>> >>> processed by employing partitioned-parallel divide-and-conquer
>> >>> techniques. Since these systems are fronted by a high-level,
>> >>> declarative language (SQL), their users are shielded from the
>> >>> complexities of parallel programming. Parallel database systems have
>> >>> been an extremely successful application of parallel computing, and
>> >>> quite a number of commercial products exist today.
>> >>>
>> >>> In the distributed systems world, the Web brought a need to index and
>> >>> query its huge content. SQL and relational databases were not the
>> >>> answer, though shared-nothing clusters again emerged as the hardware
>> >>> platform of choice. Google developed the Google File System (GFS) and
>> >>> MapReduce programming model to allow programmers to store and process
>> >>> Big Data by writing a few user-defined functions. The MapReduce
>> >>> framework applies these functions in parallel to data instances in
>> >>> distributed files (map) and to sorted groups of instances sharing a
>> >>> common key (reduce) -- not unlike the partitioned parallelism in
>> >>> parallel database systems. Apache's Hadoop MapReduce platform is the
>> >>> most prominent implementation of this paradigm for the rest of the
>> >>> Big Data community. On top of Hadoop and HDFS sit declarative
>> >>> languages like Pig and Hive that each compile down to Hadoop
>> >>> MapReduce jobs.
>> >>>
>> >>> The big Web companies were also challenged by extreme user bases
>> >>> (100s of millions of users) and needed fast simple lookups and
>> >>> updates to very large keyed data sets like user profiles. SQL
>> >>> databases were deemed either too expensive or not scalable, so the
>> >>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
>> >>> popular key-value stores, in this space. MongoDB and Couchbase are
>> >>> other open source alternatives (document stores).
>> >>>
>> >>> It is evident from the rapidly growing popularity of "NoSQL" stores,
>> >>> as well as the strong demand for Big Data analytics engines today,
>> >>> that there is a strong (and growing!) need to store, process, *and*
>> >>> query large volumes of semi-structured data in many application
>> >>> areas. Until very recently, developers have had to ``choose'' between
>> >>> using big data analytics engines like Apache Hive or Apache Spark,
>> >>> which can do complex query processing and analysis over HDFS-resident
>> >>> files, and flexible but low-function data stores like MongoDB or
>> >>> Apache HBase. (The Apache Phoenix project,
>> >>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>> >>> aims to bridge between these choices.)
>> >>>
>> >>> AsterixDB is a highly scalable data management system that can store,
>> >>> index, and manage semi-structured data, e.g., much like MongoDB, but
>> >>> it also supports a full-power query language with the expressiveness
>> >>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
>> >>> stores and manages data, so AsterixDB can exploit its knowledge of
>> >>> data partitioning and the availability of indexes to avoid always
>> >>> scanning data set(s) to process queries. Somewhat surprisingly, there
>> >>> is no open source parallel database system (relational or otherwise)
>> >>> available to developers today -- AsterixDB aims to fill this need.
>> >>> Since Apache is where the majority of the today's most important Big
>> >>> Data technologies live, the ASF seems like the obvious home for a
>> >>> system like AsterixDB.
>> >>>
>> >>> Current Status
>> >>>
>> >>> The current version of AsterixDB was co-developed by a team of
>> >>> faculty, staff, and students at UC Irvine and UC Riverside. The
>> >>> project was initiated as a large NSF-sponsored project in 2009, the
>> >>> goal of which was to combine the best ideas from the parallel
>> >>> database world, the then new Hadoop world, and the semi-structured
>> >>> (e.g., XML/JSON) data world in order to create a next-generation
>> >>> BDMS. A first informal open source release was made four years later,
>> >>> in June of 2013, under the Apache Software License 2.0.
>> >>>
>> >>>
>> >>> Meritocracy
>> >>>
>> >>> The current developers are familiar with meritocratic open source
>> >>> development at Apache. Apache was chosen specifically because we want
>> >>> to encourage this style of development for the project.
>> >>>
>> >>>
>> >>> Community
>> >>>
>> >>> While AsterixDB started as a university project it has developed into
>> >>> a community. A number of the initial committers started contributing
>> >>> in academia and continue to actively participate and contribute after
>> >>> graduation. And we seek to further develop developer and user
>> >>> communities. One way to broaden the community that is ongoing is
>> >>> through academic collaborations (currently with IIT Mumbai in India
>> >>> and TU Berlin in Germany). During incubation we will also explicitly
>> >>> seek increased industrial participation.
>> >>>
>> >>> Some indicators of the effort's development community and history can
>> >>> be
>> >>> found at:
>> >>>
>> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo
>> ,
>> >>>
>> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>> >>>
>> >>>
>> >>> Core Developers
>> >>>
>> >>> The core developers of the project are diverse, although initially UC
>> >>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
>> >>> other 50 are from other academic institutions (UC Riverside and the
>> >>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
>> >>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>> >>>
>> >>>
>> >>> Alignment
>> >>>
>> >>> Apache is, by far, the most natural home for taking the AsterixDB
>> >>> project forward. A large fraction of today's top Big Data
>> >>> technologies have their homes in Apache, including Hadoop, YARN, Pig,
>> >>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>> >>> significant gap -- the parallel data management system gap -- that
>> >>> exists in the Big Data open source world. It is well-aligned with a
>> >>> number of the Apache projects, e.g., it has strong support for
>> >>> accessing and indexing external data in HDFS, and it uses YARN as an
>> >>> answer to basic cluster resource management. AsterixDB also seeks to
>> >>> achieve an Apache-style development model; it is seeking a broader
>> >>> community of contributors and users in order to achieve its full
>> >>> potential and value to the Big Data community.
>> >>>
>> >>> There are also a number of related Apache projects and dependencies
>> >>> that will be mentioned below in the Relationships with Other Apache
>> >>> products section.
>> >>>
>> >>>
>> >>> Known Risks
>> >>>
>> >>> Orphaned products
>> >>>
>> >>> Given the current level of intellectual investment in AsterixDB, the
>> >>> risk of the project being abandoned is very small. The UCI/UCR
>> >>> faculty team leads are highly incentivized to continue development
>> >>> since the database groups at UC Irvine and UC Riverside are both
>> >>> reliant on AsterixDB as a platform for long-term graduate research
>> >>> projects. UC San Diego is also beginning to contribute to the code
>> >>> base, and a collaboration involving public health applications is
>> >>> forming with UCLA. The work on AsterixDB is managed via a mix of
>> >>> mailing list discussions supplemented by weekly project status
>> >>> meetings which are summarized on the mailing list. Typical (local
>> >>> plus Skype-in) attendance to the weekly status meetings runs at about
>> >>> 20 active contributors.
>> >>>
>> >>>
>> >>> Inexperience with Open Source
>> >>>
>> >>> AsterixDB and Hyracks were completely developed in Open Source under
>> >>> the ASL 2.0. The source code repositories, issue tracker, and mailing
>> >>> lists are available on Google Code and discussions and decisions
>> >>> happen on the mailing lists (which is necessary due to the geographic
>> >>> distribution of the current developers).
>> >>>
>> >>> Also a few of the initial committers have contributed to Apache
>> >>> projects. Vinayak Borkar is a committer on the Apache Helix and
>> >>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
>> >>> and an IPMC member. Preston Carman and Steven Jacobs are committers
>> >>> on the Apache VXQuery project.
>> >>>
>> >>>
>> >>> Relationships with Other Apache Products
>> >>>
>> >>> Apache VXQuery is based on the Hyracks data-parallel runtime, which
>> >>> is also included in the AsterixDB code base.
>> >>>
>> >>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
>> >>> is support for accessing external data in HDFS (and Hive formats),
>> >>> and resource management and system administration features are in the
>> >>> process of being migrated to YARN.
>> >>>
>> >>> AsterixDB's AQL query facilities offer comparable query power to
>> >>> Apache's Pig and Hive systems for big data analytics. AsterixDB
>> >>> differs in storing and indexing data and thus being able to quickly
>> >>> answer small and medium queries without large HDFS data scans -
>> >>> thereby targeting a different class of use cases.
>> >>>
>> >>> AsterixDB's data storage and indexing facilities are similar to those
>> >>> of HBase, but AsterixDB differs in being a much more complete and
>> >>> queryable BDMS (not just a key-value style store).
>> >>>
>> >>> AsterixDB's target use cases are not in-memory processing or
>> >>> iterative algorithm support, making AsterixDB complementary to the
>> >>> Apache Spark platform. (Spark interoperability is on our longer-term
>> >>> to-do wishlist.)
>> >>>
>> >>>
>> >>> Homogeneous Developers
>> >>>
>> >>> As mentioned before the current community is already organizationally
>> >>> and geographically distributed - and we would like to increase the
>> >>> heterogeneity.
>> >>>
>> >>>
>> >>> Reliance on Salaried Developers
>> >>>
>> >>> Of the initial committers only 3 are full-time UCI staff. The other
>> >>> committers are a mix of students, alumni who continue to contribute
>> >>> to the effort, and individuals working with permission part-time (or
>> >>> in spare time) on this project.
>> >>>
>> >>>
>> >>> A Excessive Fascination with the Apache Brand
>> >>>
>> >>> We believe in the processes, systems, and framework Apache has put in
>> >>> place. Apache is also known to foster a great community around their
>> >>> projects and provide exposure. While brand is important, our
>> >>> fascination with it is not excessive. We believe that the ASF is the
>> >>> right home for AsterixDB and that having AsterixDB inside of the ASF
>> >>> will lead to a better long-term outcome for the Big Data community.
>> >>>
>> >>>
>> >>> Documentation
>> >>>
>> >>> Documentation and publications related to AsterixDB can be found at
>> >>> http://asterixdb.ics.uci.edu/.
>> >>>
>> >>>
>> >>> Initial Source
>> >>>
>> >>> Current source resides in Google code:
>> >>> https://code.google.com/p/asterixdb/ (query language and upper system
>> >>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>> >>> system and storage management libraries).
>> >>>
>> >>>
>> >>> External Dependencies
>> >>>
>> >>> AsterixDB depends on a number of Apache projects:
>> >>>
>> >>> - Ant
>> >>> - Avro
>> >>> - ApacheDB JDO
>> >>> - Commons
>> >>> - Derby
>> >>> - Hadoop
>> >>> - Hive
>> >>> - HTTPComponents
>> >>> - Jakarta ORO
>> >>> - Maven
>> >>> - Tomcat
>> >>> - Thrift
>> >>> - Velocity
>> >>> - Wicket
>> >>> - Xerces
>> >>>
>> >>> and other open source projects (organized by license):
>> >>>
>> >>> -- ASL 2.0:
>> >>> - Jackson
>> >>> - Google Guava
>> >>> - Google Guice
>> >>> - JSON-simple
>> >>> - BoneCP
>> >>> - Microsoft Azure SDK
>> >>> - Netty
>> >>> - Rome
>> >>> - JetS3t
>> >>> - Groovy
>> >>> - Jettison
>> >>> - Plexus
>> >>> - Datanucleus (JDO)
>> >>> - Jetty
>> >>> - Twitter4J
>> >>> - Snappy-java
>> >>>
>> >>> -- BSD:
>> >>> - Antlr
>> >>> - ObjectWeb ASM
>> >>> - Protobuf
>> >>> - JSCH
>> >>> - JavaCC
>> >>> - Paranamer
>> >>> - JLine
>> >>> - Stax
>> >>> - StringTemplate
>> >>> - xmlEnc
>> >>>
>> >>> -- MIT
>> >>> - AppAssembler
>> >>> - SimpleLog4J
>> >>>
>> >>> -- CDDL 1.0
>> >>> - Java Activation Framework
>> >>> - Java Transactions
>> >>> - Java Servlet API
>> >>> - Grizzly
>> >>> - gmbal
>> >>> - Glassfish
>> >>>
>> >>> -- CDDL 1.1
>> >>> - Jersey
>> >>> - JAXB Reference Implementation
>> >>>
>> >>> -- JSON License
>> >>> - JSON
>> >>>
>> >>> -- EPL 1.0
>> >>> - JUnit
>> >>>
>> >>> -- JDOM License
>> >>> - JDOM
>> >>>
>> >>> -- Public Domain
>> >>> - xz
>> >>> - AOPAlliance
>> >>>
>> >>> As all dependencies are managed using Apache Maven, none of the
>> >>> external libraries need to be packaged in a source distribution.
>> >>>
>> >>>
>> >>> Required Resources
>> >>>
>> >>> Developer and user mailing lists
>> >>>
>> >>> priv...@asterixdb.incubator.apache.org (with moderated subscriptions)
>> >>> comm...@asterixdb.incubator.apache.org
>> >>> d...@asterixdb.incubator.apache.org
>> >>> us...@asterixdb.incubator.apache.org
>> >>>
>> >>>
>> >>> A git repository
>> >>>
>> >>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>> >>>
>> >>>
>> >>> A JIRA issue tracker
>> >>>
>> >>> https://issues.apache.org/jira/browse/ASTERIXDB
>> >>>
>> >>>
>> >>> Initial Committers
>> >>>
>> >>> The following is a list of the planned initial Apache committers (the
>> >>> active subset of the committers for the current repository at Google
>> >>> code).
>> >>>
>> >>> Abdullah Alamoudi (bamou...@gmail.com)
>> >>> Cameron Samak (euf...@gmail.com)
>> >>> Chen Li (che...@gmail.com)
>> >>> Ian Maxon (ima...@uci.edu)
>> >>> Ildar Absalyamov (ildar.absalya...@gmail.com)
>> >>> Jianfeng Jia (jianfeng....@gmail.com)
>> >>> Karen Ouaknine (ker...@gmail.com)
>> >>> Markus Dreseler (apa...@dreseler.de)
>> >>> Mike Carey (dtab...@apache.org)
>> >>> Murtadha Hubail (hubail...@gmail.com)
>> >>> Pouria Pirzadeh (pouria.pirza...@gmail.com)
>> >>> Preston Carman (prest...@apache.org)
>> >>> Raman Grover (ramangrove...@gmail.com)
>> >>> Sattam Alsubaiee (salsuba...@gmail.com)
>> >>> Steven Jacobs (sjaco...@apache.org)
>> >>> Taewoo Kim (wangs...@gmail.com)
>> >>> Till Westmann (ti...@apache.org)
>> >>> Vinayak Borkar (vinay...@apache.org)
>> >>> Yingyi Bu (buyin...@gmail.com)
>> >>> Young-Seok Kim (kiss...@gmail.com)
>> >>> Zach Heilbron (zheilb...@gmail.com)
>> >>>
>> >>>
>> >>> Affiliations
>> >>>
>> >>> UC Irvine
>> >>> - Mike Carey
>> >>> - Chen Li
>> >>> - Ian Maxon
>> >>> - Yingyi Bu
>> >>> - Raman Grover
>> >>> - Pouria Pirzadeh
>> >>> - Young-Seok Kim
>> >>> - Cameron Samak
>> >>> - Taewoo Kim
>> >>> - Jianfeng Jia
>> >>> - Murtadha Hubail
>> >>> - Markus Dreseler
>> >>>
>> >>> UC Riverside
>> >>> - Ildar Absalyamov
>> >>> - Preston Carman
>> >>> - Steven Jacobs
>> >>>
>> >>> Hebrew University
>> >>> - Keren Ouaknine
>> >>>
>> >>> Oracle
>> >>> - Till Westmann
>> >>>
>> >>> X15 Software
>> >>> - Vinayak Borkar
>> >>> - Zach Heilbron
>> >>>
>> >>> KACST Saudi Arabia
>> >>> - Sattam Alsubaiee
>> >>>
>> >>> Saudi Aramco
>> >>> - Abdullah Alamoudi
>> >>>
>> >>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
>> >>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>> >>> non-UC committers are a mix of alumni who continue to contribute to
>> >>> the effort and individuals working with permission part-time (or in
>> >>> spare time) on this project.
>> >>>
>> >>>
>> >>> Sponsors
>> >>>
>> >>> Champion
>> >>>
>> >>> Chris Mattmann (NASA/JPL)
>> >>>
>> >>> Nominated Mentors
>> >>>
>> >>> TBD
>> >>>
>> >>> Sponsoring Entity
>> >>>
>> >>> The Apache Incubator
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>> Chris Mattmann, Ph.D.
>> >>> Chief Architect
>> >>> Instrument Software and Science Data Systems Section (398)
>> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >>> Office: 168-519, Mailstop: 168-527
>> >>> Email: chris.a.mattm...@nasa.gov
>> >>> WWW:  http://sunset.usc.edu/~mattmann/
>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>> Adjunct Associate Professor, Computer Science Department
>> >>> University of Southern California, Los Angeles, CA 90089 USA
>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>
>> >>>
>> >>>
>> >>>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> For additional commands, e-mail: general-h...@incubator.apache.org
>>
>>
>
>

Re: [PROPOSAL] Apache AsterixDB Incubator

Reply via email to