Guys Any updates on this ? On Sun, Aug 12, 2012 at 12:31 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> Yes. Saw that. Responded to him privately at the time. > > Good humor or good typo. > > On Sat, Aug 11, 2012 at 7:29 AM, Doug Cutting <cutt...@gmail.com> wrote: > > > Otis said his vote was 'blinding', not 'binding'. > > > > Doug > > On Aug 11, 2012 12:28 AM, "Ted Dunning" <ted.dunn...@gmail.com> wrote: > > > > > This vote is now closed. > > > > > > In the responses to this thread, I count 15 binding positive votes and > > > 4 non-binding votes. The number of positive votes increases to 17 if > > > you count myself (the champion) and Isabel (a mentor) but neither of > > > us actually sent the key email to record a vote (oops). > > > > > > One of the non-binding votes was by Otis Gospadnetic who said that his > > > vote was binding, but I didn't find his name on the list of incubator > > > PMC members, so I counted it as non-binding. The list I used is at > > > http://people.apache.org/committers-by-project.html#incubator-pmc > > > > > > By any count, this vote to admit Drill to incubator therefore passes. > > > > > > This proposal includes mentors so this vote also constitutes > > > acceptance of the mentors by the Incubator PMC. All three of the > > > mentors (Grant, myself, and Isabel) are Apache members. > > > > > > This proposal as approved also includes an initial list of committers, > > > all of whom have ICLA's on file. > > > > > > I will coordinate with the other mentors and the committers to commit > > > the status file and perform other establishment activities necessary > > > to establish Drill as a project under incubation. I expect that this > > > will take several days. I will announce progress on this mailing list > > > to allow people to subscribe to the mailing lists. > > > > > > > > > On Thu, Aug 9, 2012 at 11:27 AM, Andrew Purtell <apurt...@apache.org> > > > wrote: > > > > +1 (non-binding) > > > > > > > > On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning <ted.dunn...@gmail.com> > > > wrote: > > > >> I would like to call a vote for accepting Drill for incubation in > the > > > >> Apache Incubator. The full proposal is available below. Discussion > > > >> over the last few days has been quite positive. > > > >> > > > >> Please cast your vote: > > > >> > > > >> [ ] +1, bring Drill into Incubator > > > >> [ ] +0, I don't care either way, > > > >> [ ] -1, do not bring Drill into Incubator, because... > > > >> > > > >> This vote will be open for 72 hours and only votes from the > Incubator > > > >> PMC are binding. The start of the vote is just before 3AM UTC on 8 > > > >> August so the closing time will be 3AM UTC on 11 August. > > > >> > > > >> Thank you for your consideration! > > > >> > > > >> Ted > > > >> > > > >> http://wiki.apache.org/incubator/DrillProposal > > > >> > > > >> = Drill = > > > >> > > > >> == Abstract == > > > >> Drill is a distributed system for interactive analysis of > large-scale > > > >> datasets, inspired by > > > >> [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. > > > >> > > > >> == Proposal == > > > >> Drill is a distributed system for interactive analysis of > large-scale > > > >> datasets. Drill is similar to Google's Dremel, with the additional > > > >> flexibility needed to support a broader range of query languages, > data > > > >> formats and data sources. It is designed to efficiently process > nested > > > >> data. It is a design goal to scale to 10,000 servers or more and to > be > > > >> able to process petabyes of data and trillions of records in > seconds. > > > >> > > > >> == Background == > > > >> Many organizations have the need to run data-intensive applications, > > > >> including batch processing, stream processing and interactive > > > >> analysis. In recent years open source systems have emerged to > address > > > >> the need for scalable batch processing (Apache Hadoop) and stream > > > >> processing (Storm, Apache S4). In 2010 Google published a paper > called > > > >> "Dremel: Interactive Analysis of Web-Scale Datasets," describing a > > > >> scalable system used internally for interactive analysis of nested > > > >> data. No open source project has successfully replicated the > > > >> capabilities of Dremel. > > > >> > > > >> == Rationale == > > > >> There is a strong need in the market for low-latency interactive > > > >> analysis of large-scale datasets, including nested data (eg, JSON, > > > >> Avro, Protocol Buffers). This need was identified by Google and > > > >> addressed internally with a system called Dremel. > > > >> > > > >> In recent years open source systems have emerged to address the need > > > >> for scalable batch processing (Apache Hadoop) and stream processing > > > >> (Storm, Apache S4). Apache Hadoop, originally inspired by Google's > > > >> internal MapReduce system, is used by thousands of organizations > > > >> processing large-scale datasets. Apache Hadoop is designed to > achieve > > > >> very high throughput, but is not designed to achieve the sub-second > > > >> latency needed for interactive data analysis and exploration. Drill, > > > >> inspired by Google's internal Dremel system, is intended to address > > > >> this need. > > > >> > > > >> It is worth noting that, as explained by Google in the original > paper, > > > >> Dremel complements MapReduce-based computing. Dremel is not intended > > > >> as a replacement for MapReduce and is often used in conjunction with > > > >> it to analyze outputs of MapReduce pipelines or rapidly prototype > > > >> larger computations. Indeed, Dremel and MapReduce are both used by > > > >> thousands of Google employees. > > > >> > > > >> Like Dremel, Drill supports a nested data model with data encoded > in a > > > >> number of formats such as JSON, Avro or Protocol Buffers. In many > > > >> organizations nested data is the standard, so supporting a nested > data > > > >> model eliminates the need to normalize the data. With that said, > flat > > > >> data formats, such as CSV files, are naturally supported as a > special > > > >> case of nested data. > > > >> > > > >> The Drill architecture consists of four key components/layers: > > > >> * Query languages: This layer is responsible for parsing the user's > > > >> query and constructing an execution plan. The initial goal is to > > > >> support the SQL-like language used by Dremel and > > > >> [[ > https://developers.google.com/bigquery/docs/query-reference|Google > > > >> BigQuery]], which we call DrQL. However, Drill is designed to > support > > > >> other languages and programming models, such as the > > > >> [[ > http://www.mongodb.org/display/DOCS/Mongo+Query+Language|MongoQuery > > > >> Language]], [[http://www.cascading.org/|Cascading]] or > > > >> [[https://github.com/tdunning/Plume|Plume]]. > > > >> * Low-latency distributed execution engine: This layer is > responsible > > > >> for executing the physical plan. It provides the scalability and > fault > > > >> tolerance needed to efficiently query petabytes of data on 10,000 > > > >> servers. Drill's execution engine is based on research in > distributed > > > >> execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) > and > > > >> columnar storage, and can be extended with additional operators and > > > >> connectors. > > > >> * Nested data formats: This layer is responsible for supporting > > > >> various data formats. The initial goal is to support the > column-based > > > >> format used by Dremel. Drill is designed to support schema-based > > > >> formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and > CSV, > > > >> and schema-less formats such as JSON, BSON or YAML. In addition, it > is > > > >> designed to support column-based formats such as Dremel, > > > >> AVRO-806/Trevni and RCFile, and row-based formats such as Protocol > > > >> Buffers, Avro, JSON, BSON and CSV. A particular distinction with > Drill > > > >> is that the execution engine is flexible enough to support > > > >> column-based processing as well as row-based processing. This is > > > >> important because column-based processing can be much more efficient > > > >> when the data is stored in a column-based format, but many large > data > > > >> assets are stored in a row-based format that would require > conversion > > > >> before use. > > > >> * Scalable data sources: This layer is responsible for supporting > > > >> various data sources. The initial focus is to leverage Hadoop as a > > > >> data source. > > > >> > > > >> It is worth noting that no open source project has successfully > > > >> replicated the capabilities of Dremel, nor have any taken on the > > > >> broader goals of flexibility (eg, pluggable query languages, data > > > >> formats, data sources and execution engine operators/connectors) > that > > > >> are part of Drill. > > > >> > > > >> == Initial Goals == > > > >> The initial goals for this project are to specify the detailed > > > >> requirements and architecture, and then develop the initial > > > >> implementation including the execution engine and DrQL. > > > >> Like Apache Hadoop, which was built to support multiple storage > > > >> systems (through the FileSystem API) and file formats (through the > > > >> InputFormat/OutputFormat APIs), Drill will be built to support > > > >> multiple query languages, data formats and data sources. The initial > > > >> implementation of Drill will support the DrQL and a column-based > > > >> format similar to Dremel. > > > >> > > > >> == Current Status == > > > >> Significant work has been completed to identify the initial > > > >> requirements and define the overall system architecture. The next > step > > > >> is to implement the four components described in the Rationale > > > >> section, and we intend to do that development as an Apache project. > > > >> > > > >> === Meritocracy === > > > >> We plan to invest in supporting a meritocracy. We will discuss the > > > >> requirements in an open forum. Several companies have already > > > >> expressed interest in this project, and we intend to invite > additional > > > >> developers to participate. We will encourage and monitor community > > > >> participation so that privileges can be extended to those that > > > >> contribute. Also, Drill has an extensible/pluggable architecture > that > > > >> encourages developers to contribute various extensions, such as > query > > > >> languages, data formats, data sources and execution engine operators > > > >> and connectors. While some companies will surely develop commercial > > > >> extensions, we also anticipate that some companies and individuals > > > >> will want to contribute such extensions back to the project, and we > > > >> look forward to fostering a rich ecosystem of extensions. > > > >> > > > >> === Community === > > > >> The need for a system for interactive analysis of large datasets in > > > >> the open source is tremendous, so there is a potential for a very > > > >> large community. We believe that Drill's extensible architecture > will > > > >> further encourage community participation. Also, related Apache > > > >> projects (eg, Hadoop) have very large and active communities, and we > > > >> expect that over time Drill will also attract a large community. > > > >> > > > >> === Core Developers === > > > >> The developers on the initial committers list include experienced > > > >> distributed systems engineers: > > > >> * Tomer Shiran has experience developing distributed execution > > > >> engines. He developed Parallel DataSeries, a data-parallel version > of > > > >> the open source [[http://tesla.hpl.hp.com/opensource/|DataSeries]] > > > >> system. He is also the author of Applying Idealized Lower-bound > > > >> Runtime Models to Understand Inefficiencies in Data-intensive > > > >> Computing (SIGMETRICS 2011). Tomer worked as a software developer > and > > > >> researcher at IBM Research, Microsoft and HP Labs, and is now at > MapR > > > >> Technologies. He has been active in the Hadoop community since 2009. > > > >> * Jason Frantz was at Clustrix, where he designed and developed the > > > >> first scale-out SQL database based on MySQL. Jason developed the > > > >> distributed query optimizer that powered Clustrix. He is now a > > > >> software engineer and architect at MapR Technologies. > > > >> * Ted Dunning is a PMC member for Apache ZooKeeper and Apache > Mahout, > > > >> and has a history of over 30 years of contributions to open source. > He > > > >> is now at MapR Technologies. Ted has been very active in the Hadoop > > > >> community since the project's early days. > > > >> * MC Srivas is the co-founder and CTO of MapR Technologies. While > at > > > >> Google he worked on Google's scalable search infrastructure. MC > Srivas > > > >> has been active in the Hadoop community since 2009. > > > >> * Chris Wensel is the founder and CEO of Concurrent. Prior to > > > >> founding Concurrent, he developed Cascading, an Apache-licensed open > > > >> source application framework enabling Java developers to quickly and > > > >> easily develop robust Data Analytics and Data Management > applications > > > >> on Apache Hadoop. Chris has been involved in the Hadoop community > > > >> since the project's early days. > > > >> * Keys Botzum was at IBM, where he worked on security and > distributed > > > >> systems, and is currently at MapR Technologies. > > > >> * Gera Shegalov was at Oracle, where he worked on networking, > storage > > > >> and database kernels, and is currently at MapR Technologies. > > > >> * Ryan Rawson is the VP Engineering of Drawn to Scale where he > > > >> developed Spire, a real-time operational database for Hadoop. He is > > > >> also a committer and PMC member for Apache HBase, and has a long > > > >> history of contributions to open source. Ryan has been involved in > the > > > >> Hadoop community since the project's early days. > > > >> > > > >> We realize that additional employer diversity is needed, and we will > > > >> work aggressively to recruit developers from additional companies. > > > >> > > > >> === Alignment === > > > >> The initial committers strongly believe that a system for > interactive > > > >> analysis of large-scale datasets will gain broader adoption as an > open > > > >> source, community driven project, where the community can contribute > > > >> not only to the core components, but also to a growing collection of > > > >> query languages and optimizers, data formats, data formats, and > > > >> execution engine operators and connectors. Drill will integrate > > > >> closely with Apache Hadoop. First, the data will live in Hadoop. > That > > > >> is, Drill will support Hadoop FileSystem implementations and HBase. > > > >> Second, Hadoop-related data formats will be supported (eg, Apache > > > >> Avro, RCFile). Third, MapReduce-based tools will be provided to > > > >> produce column-based formats. Fourth, Drill tables can be registered > > > >> in HCatalog. Finally, Hive is being considered as the basis of the > > > >> DrQL implementation. > > > >> > > > >> == Known Risks == > > > >> > > > >> === Orphaned Products === > > > >> The contributors are leading vendors in this space, with significant > > > >> open source experience, so the risk of being orphaned is relatively > > > >> low. The project could be at risk if vendors decided to change their > > > >> strategies in the market. In such an event, the current committers > > > >> plan to continue working on the project on their own time, though > the > > > >> progress will likely be slower. We plan to mitigate this risk by > > > >> recruiting additional committers. > > > >> > > > >> === Inexperience with Open Source === > > > >> The initial committers include veteran Apache members (committers > and > > > >> PMC members) and other developers who have varying degrees of > > > >> experience with open source projects. All have been involved with > > > >> source code that has been released under an open source license, and > > > >> several also have experience developing code with an open source > > > >> development process. > > > >> > > > >> === Homogenous Developers === > > > >> The initial committers are employed by a number of companies, > > > >> including MapR Technologies, Concurrent and Drawn to Scale. We are > > > >> committed to recruiting additional committers from other companies. > > > >> > > > >> === Reliance on Salaried Developers === > > > >> It is expected that Drill development will occur on both salaried > time > > > >> and on volunteer time, after hours. The majority of initial > committers > > > >> are paid by their employer to contribute to this project. However, > > > >> they are all passionate about the project, and we are confident that > > > >> the project will continue even if no salaried developers contribute > to > > > >> the project. We are committed to recruiting additional committers > > > >> including non-salaried developers. > > > >> > > > >> === Relationships with Other Apache Products === > > > >> As mentioned in the Alignment section, Drill is closely integrated > > > >> with Hadoop, Avro, Hive and HBase in a numerous ways. For example, > > > >> Drill data lives inside a Hadoop environment (Drill operates on in > > > >> situ data). We look forward to collaborating with those communities, > > > >> as well as other Apache communities. > > > >> > > > >> === An Excessive Fascination with the Apache Brand === > > > >> Drill solves a real problem that many organizations struggle with, > and > > > >> has been proven within Google to be of significant value. The > > > >> architecture is based on academic and industry research. Our > rationale > > > >> for developing Drill as an Apache project is detailed in the > Rationale > > > >> section. We believe that the Apache brand and community process will > > > >> help us attract more contributors to this project, and help > establish > > > >> ubiquitous APIs. In addition, establishing consensus among users and > > > >> developers of a Dremel-like tool is a key requirement for success of > > > >> the project. > > > >> > > > >> == Documentation == > > > >> Drill is inspired by Google's Dremel. Google has published a > > > >> [[http://research.google.com/pubs/pub36632.html|paper]] > highlighting > > > >> Dremel's innovative nested column-based data format and execution > > > >> engine. > > > >> > > > >> == Initial Source == > > > >> The requirement and design documents are currently stored in MapR > > > >> Technologies' source code repository. They will be checked in as > part > > > >> of the initial code dump. > > > >> > > > >> == Cryptography == > > > >> Drill will eventually support encryption on the wire. This is not > one > > > >> of the initial goals, and we do not expect Drill to be a controlled > > > >> export item due to the use of encryption. > > > >> > > > >> == Required Resources == > > > >> > > > >> === Mailing List === > > > >> * drill-private > > > >> * drill-dev > > > >> * drill-user > > > >> > > > >> === Subversion Directory === > > > >> Git is the preferred source control system: git:// > > git.apache.org/drill > > > >> > > > >> === Issue Tracking === > > > >> JIRA Drill (DRILL) > > > >> > > > >> == Initial Committers == > > > >> * Tomer Shiran <tshiran at maprtech dot com> > > > >> * Ted Dunning <tdunning at apache dot org> > > > >> * Jason Frantz <jfrantz at maprtech dot com> > > > >> * MC Srivas <mcsrivas at maprtech dot com> > > > >> * Chris Wensel <chris and concurrentinc dot com> > > > >> * Keys Botzum <kbotzum at maprtech dot com> > > > >> * Gera Shegalov <gshegalov at maprtech dot com> > > > >> * Ryan Rawson <ryan at drawntoscale dot com> > > > >> > > > >> == Affiliations == > > > >> The initial committers are employees of MapR Technologies, Drawn to > > > >> Scale and Concurrent. The nominated mentors are employees of MapR > > > >> Technologies, Lucid Imagination and Nokia. > > > >> > > > >> == Sponsors == > > > >> > > > >> === Champion === > > > >> Ted Dunning (tdunning at apache dot org) > > > >> > > > >> === Nominated Mentors === > > > >> * Ted Dunning <tdunning at apache dot org> – Chief Application > > > >> Architect at MapR Technologies, Committer for Lucene, Mahout and > > > >> ZooKeeper. > > > >> * Grant Ingersoll <grant at lucidimagination dot com> – Chief > > > >> Scientist at Lucid Imagination, Committer for Lucene, Mahout and > other > > > >> projects. > > > >> * Isabel Drost <isabel at apache dot org> – Software Developer at > > > >> Nokia Gate 5 GmbH, Committer for Lucene, Mahout and other projects. > > > >> > > > >> === Sponsoring Entity === > > > >> Incubator > > > >> > > > >> > --------------------------------------------------------------------- > > > >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > > >> For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > > > -- > > > > Best regards, > > > > > > > > - Andy > > > > > > > > Problems worthy of attack prove their worth by hitting back. - Piet > > > > Hein (via Tom White) > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > > > >