+1 (binding) On Thu, Aug 9, 2012 at 1:05 AM, Tommaso Teofili <tommaso.teof...@gmail.com> wrote: > +1 > > Tommaso > > 2012/8/8 Ted Dunning <ted.dunn...@gmail.com> > >> I would like to call a vote for accepting Drill for incubation in the >> Apache Incubator. The full proposal is available below. Discussion >> over the last few days has been quite positive. >> >> Please cast your vote: >> >> [ ] +1, bring Drill into Incubator >> [ ] +0, I don't care either way, >> [ ] -1, do not bring Drill into Incubator, because... >> >> This vote will be open for 72 hours and only votes from the Incubator >> PMC are binding. The start of the vote is just before 3AM UTC on 8 >> August so the closing time will be 3AM UTC on 11 August. >> >> Thank you for your consideration! >> >> Ted >> >> http://wiki.apache.org/incubator/DrillProposal >> >> = Drill = >> >> == Abstract == >> Drill is a distributed system for interactive analysis of large-scale >> datasets, inspired by >> [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. >> >> == Proposal == >> Drill is a distributed system for interactive analysis of large-scale >> datasets. Drill is similar to Google's Dremel, with the additional >> flexibility needed to support a broader range of query languages, data >> formats and data sources. It is designed to efficiently process nested >> data. It is a design goal to scale to 10,000 servers or more and to be >> able to process petabyes of data and trillions of records in seconds. >> >> == Background == >> Many organizations have the need to run data-intensive applications, >> including batch processing, stream processing and interactive >> analysis. In recent years open source systems have emerged to address >> the need for scalable batch processing (Apache Hadoop) and stream >> processing (Storm, Apache S4). In 2010 Google published a paper called >> "Dremel: Interactive Analysis of Web-Scale Datasets," describing a >> scalable system used internally for interactive analysis of nested >> data. No open source project has successfully replicated the >> capabilities of Dremel. >> >> == Rationale == >> There is a strong need in the market for low-latency interactive >> analysis of large-scale datasets, including nested data (eg, JSON, >> Avro, Protocol Buffers). This need was identified by Google and >> addressed internally with a system called Dremel. >> >> In recent years open source systems have emerged to address the need >> for scalable batch processing (Apache Hadoop) and stream processing >> (Storm, Apache S4). Apache Hadoop, originally inspired by Google's >> internal MapReduce system, is used by thousands of organizations >> processing large-scale datasets. Apache Hadoop is designed to achieve >> very high throughput, but is not designed to achieve the sub-second >> latency needed for interactive data analysis and exploration. Drill, >> inspired by Google's internal Dremel system, is intended to address >> this need. >> >> It is worth noting that, as explained by Google in the original paper, >> Dremel complements MapReduce-based computing. Dremel is not intended >> as a replacement for MapReduce and is often used in conjunction with >> it to analyze outputs of MapReduce pipelines or rapidly prototype >> larger computations. Indeed, Dremel and MapReduce are both used by >> thousands of Google employees. >> >> Like Dremel, Drill supports a nested data model with data encoded in a >> number of formats such as JSON, Avro or Protocol Buffers. In many >> organizations nested data is the standard, so supporting a nested data >> model eliminates the need to normalize the data. With that said, flat >> data formats, such as CSV files, are naturally supported as a special >> case of nested data. >> >> The Drill architecture consists of four key components/layers: >> * Query languages: This layer is responsible for parsing the user's >> query and constructing an execution plan. The initial goal is to >> support the SQL-like language used by Dremel and >> [[https://developers.google.com/bigquery/docs/query-reference|Google >> BigQuery]], which we call DrQL. However, Drill is designed to support >> other languages and programming models, such as the >> [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query >> Language]], [[http://www.cascading.org/|Cascading]] or >> [[https://github.com/tdunning/Plume|Plume]]. >> * Low-latency distributed execution engine: This layer is responsible >> for executing the physical plan. It provides the scalability and fault >> tolerance needed to efficiently query petabytes of data on 10,000 >> servers. Drill's execution engine is based on research in distributed >> execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and >> columnar storage, and can be extended with additional operators and >> connectors. >> * Nested data formats: This layer is responsible for supporting >> various data formats. The initial goal is to support the column-based >> format used by Dremel. Drill is designed to support schema-based >> formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, >> and schema-less formats such as JSON, BSON or YAML. In addition, it is >> designed to support column-based formats such as Dremel, >> AVRO-806/Trevni and RCFile, and row-based formats such as Protocol >> Buffers, Avro, JSON, BSON and CSV. A particular distinction with Drill >> is that the execution engine is flexible enough to support >> column-based processing as well as row-based processing. This is >> important because column-based processing can be much more efficient >> when the data is stored in a column-based format, but many large data >> assets are stored in a row-based format that would require conversion >> before use. >> * Scalable data sources: This layer is responsible for supporting >> various data sources. The initial focus is to leverage Hadoop as a >> data source. >> >> It is worth noting that no open source project has successfully >> replicated the capabilities of Dremel, nor have any taken on the >> broader goals of flexibility (eg, pluggable query languages, data >> formats, data sources and execution engine operators/connectors) that >> are part of Drill. >> >> == Initial Goals == >> The initial goals for this project are to specify the detailed >> requirements and architecture, and then develop the initial >> implementation including the execution engine and DrQL. >> Like Apache Hadoop, which was built to support multiple storage >> systems (through the FileSystem API) and file formats (through the >> InputFormat/OutputFormat APIs), Drill will be built to support >> multiple query languages, data formats and data sources. The initial >> implementation of Drill will support the DrQL and a column-based >> format similar to Dremel. >> >> == Current Status == >> Significant work has been completed to identify the initial >> requirements and define the overall system architecture. The next step >> is to implement the four components described in the Rationale >> section, and we intend to do that development as an Apache project. >> >> === Meritocracy === >> We plan to invest in supporting a meritocracy. We will discuss the >> requirements in an open forum. Several companies have already >> expressed interest in this project, and we intend to invite additional >> developers to participate. We will encourage and monitor community >> participation so that privileges can be extended to those that >> contribute. Also, Drill has an extensible/pluggable architecture that >> encourages developers to contribute various extensions, such as query >> languages, data formats, data sources and execution engine operators >> and connectors. While some companies will surely develop commercial >> extensions, we also anticipate that some companies and individuals >> will want to contribute such extensions back to the project, and we >> look forward to fostering a rich ecosystem of extensions. >> >> === Community === >> The need for a system for interactive analysis of large datasets in >> the open source is tremendous, so there is a potential for a very >> large community. We believe that Drill's extensible architecture will >> further encourage community participation. Also, related Apache >> projects (eg, Hadoop) have very large and active communities, and we >> expect that over time Drill will also attract a large community. >> >> === Core Developers === >> The developers on the initial committers list include experienced >> distributed systems engineers: >> * Tomer Shiran has experience developing distributed execution >> engines. He developed Parallel DataSeries, a data-parallel version of >> the open source [[http://tesla.hpl.hp.com/opensource/|DataSeries]] >> system. He is also the author of Applying Idealized Lower-bound >> Runtime Models to Understand Inefficiencies in Data-intensive >> Computing (SIGMETRICS 2011). Tomer worked as a software developer and >> researcher at IBM Research, Microsoft and HP Labs, and is now at MapR >> Technologies. He has been active in the Hadoop community since 2009. >> * Jason Frantz was at Clustrix, where he designed and developed the >> first scale-out SQL database based on MySQL. Jason developed the >> distributed query optimizer that powered Clustrix. He is now a >> software engineer and architect at MapR Technologies. >> * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout, >> and has a history of over 30 years of contributions to open source. He >> is now at MapR Technologies. Ted has been very active in the Hadoop >> community since the project's early days. >> * MC Srivas is the co-founder and CTO of MapR Technologies. While at >> Google he worked on Google's scalable search infrastructure. MC Srivas >> has been active in the Hadoop community since 2009. >> * Chris Wensel is the founder and CEO of Concurrent. Prior to >> founding Concurrent, he developed Cascading, an Apache-licensed open >> source application framework enabling Java developers to quickly and >> easily develop robust Data Analytics and Data Management applications >> on Apache Hadoop. Chris has been involved in the Hadoop community >> since the project's early days. >> * Keys Botzum was at IBM, where he worked on security and distributed >> systems, and is currently at MapR Technologies. >> * Gera Shegalov was at Oracle, where he worked on networking, storage >> and database kernels, and is currently at MapR Technologies. >> * Ryan Rawson is the VP Engineering of Drawn to Scale where he >> developed Spire, a real-time operational database for Hadoop. He is >> also a committer and PMC member for Apache HBase, and has a long >> history of contributions to open source. Ryan has been involved in the >> Hadoop community since the project's early days. >> >> We realize that additional employer diversity is needed, and we will >> work aggressively to recruit developers from additional companies. >> >> === Alignment === >> The initial committers strongly believe that a system for interactive >> analysis of large-scale datasets will gain broader adoption as an open >> source, community driven project, where the community can contribute >> not only to the core components, but also to a growing collection of >> query languages and optimizers, data formats, data formats, and >> execution engine operators and connectors. Drill will integrate >> closely with Apache Hadoop. First, the data will live in Hadoop. That >> is, Drill will support Hadoop FileSystem implementations and HBase. >> Second, Hadoop-related data formats will be supported (eg, Apache >> Avro, RCFile). Third, MapReduce-based tools will be provided to >> produce column-based formats. Fourth, Drill tables can be registered >> in HCatalog. Finally, Hive is being considered as the basis of the >> DrQL implementation. >> >> == Known Risks == >> >> === Orphaned Products === >> The contributors are leading vendors in this space, with significant >> open source experience, so the risk of being orphaned is relatively >> low. The project could be at risk if vendors decided to change their >> strategies in the market. In such an event, the current committers >> plan to continue working on the project on their own time, though the >> progress will likely be slower. We plan to mitigate this risk by >> recruiting additional committers. >> >> === Inexperience with Open Source === >> The initial committers include veteran Apache members (committers and >> PMC members) and other developers who have varying degrees of >> experience with open source projects. All have been involved with >> source code that has been released under an open source license, and >> several also have experience developing code with an open source >> development process. >> >> === Homogenous Developers === >> The initial committers are employed by a number of companies, >> including MapR Technologies, Concurrent and Drawn to Scale. We are >> committed to recruiting additional committers from other companies. >> >> === Reliance on Salaried Developers === >> It is expected that Drill development will occur on both salaried time >> and on volunteer time, after hours. The majority of initial committers >> are paid by their employer to contribute to this project. However, >> they are all passionate about the project, and we are confident that >> the project will continue even if no salaried developers contribute to >> the project. We are committed to recruiting additional committers >> including non-salaried developers. >> >> === Relationships with Other Apache Products === >> As mentioned in the Alignment section, Drill is closely integrated >> with Hadoop, Avro, Hive and HBase in a numerous ways. For example, >> Drill data lives inside a Hadoop environment (Drill operates on in >> situ data). We look forward to collaborating with those communities, >> as well as other Apache communities. >> >> === An Excessive Fascination with the Apache Brand === >> Drill solves a real problem that many organizations struggle with, and >> has been proven within Google to be of significant value. The >> architecture is based on academic and industry research. Our rationale >> for developing Drill as an Apache project is detailed in the Rationale >> section. We believe that the Apache brand and community process will >> help us attract more contributors to this project, and help establish >> ubiquitous APIs. In addition, establishing consensus among users and >> developers of a Dremel-like tool is a key requirement for success of >> the project. >> >> == Documentation == >> Drill is inspired by Google's Dremel. Google has published a >> [[http://research.google.com/pubs/pub36632.html|paper]] highlighting >> Dremel's innovative nested column-based data format and execution >> engine. >> >> == Initial Source == >> The requirement and design documents are currently stored in MapR >> Technologies' source code repository. They will be checked in as part >> of the initial code dump. >> >> == Cryptography == >> Drill will eventually support encryption on the wire. This is not one >> of the initial goals, and we do not expect Drill to be a controlled >> export item due to the use of encryption. >> >> == Required Resources == >> >> === Mailing List === >> * drill-private >> * drill-dev >> * drill-user >> >> === Subversion Directory === >> Git is the preferred source control system: git://git.apache.org/drill >> >> === Issue Tracking === >> JIRA Drill (DRILL) >> >> == Initial Committers == >> * Tomer Shiran <tshiran at maprtech dot com> >> * Ted Dunning <tdunning at apache dot org> >> * Jason Frantz <jfrantz at maprtech dot com> >> * MC Srivas <mcsrivas at maprtech dot com> >> * Chris Wensel <chris and concurrentinc dot com> >> * Keys Botzum <kbotzum at maprtech dot com> >> * Gera Shegalov <gshegalov at maprtech dot com> >> * Ryan Rawson <ryan at drawntoscale dot com> >> >> == Affiliations == >> The initial committers are employees of MapR Technologies, Drawn to >> Scale and Concurrent. The nominated mentors are employees of MapR >> Technologies, Lucid Imagination and Nokia. >> >> == Sponsors == >> >> === Champion === >> Ted Dunning (tdunning at apache dot org) >> >> === Nominated Mentors === >> * Ted Dunning <tdunning at apache dot org> – Chief Application >> Architect at MapR Technologies, Committer for Lucene, Mahout and >> ZooKeeper. >> * Grant Ingersoll <grant at lucidimagination dot com> – Chief >> Scientist at Lucid Imagination, Committer for Lucene, Mahout and other >> projects. >> * Isabel Drost <isabel at apache dot org> – Software Developer at >> Nokia Gate 5 GmbH, Committer for Lucene, Mahout and other projects. >> >> === Sponsoring Entity === >> Incubator >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >> >>
--------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org