Hi, This proposal looks very interesting to me. What exactly is the scope of Tez? Does it aim to be a general data flow system such as Stratosphere[1] or Hyracks[2]? Or will it still be executing Map and Reduce tasks, that are composable in a more flexible manner?
Best, Sebastian [1] http://dl.acm.org/citation.cfm?id=1807148 https://www.stratosphere.eu/sites/default/files/papers/NephelePACTs_10.pdf [2] http://dl.acm.org/citation.cfm?id=2005632 http://asterix.ics.uci.edu/pub/Hyracks.pdf On 19.02.2013 09:53, Avik Dey wrote: > The Tez incubator proposal seems to have a lot in common with the work on > https://issues.apache.org/jira/browse/OOZIE-1178 > >> It is useful to have a workflow application master, which will be capable >> of running a DAG of jobs. The workflow client submits a DAG request to the >> AM and then the AM will manage the life cycle of this application in terms >> of requesting the needed resources from the RM, and starting, monitoring >> and retrying the application's individual tasks. >> >> Compared to running Oozie with the current MapReduce Application Master, >> these are some of the advantages: >> >> - Less number of consumed resources, since only one application master >> will be spawned for the whole workflow. >> - Reuse of resources, since the same resources can be used by multiple >> consecutive jobs in the workflow (no need to request/wait for resources >> for >> every individual job from the central RM). >> - More optimization opportunities in terms of collective resource >> requests. >> - Optimization opportunities in terms of rewriting and composing jobs >> in the workflow (e.g. pushing down Mappers). >> - This Application Master can be reused/extended by higher systems >> like Pig and hive to provide an optimized way of running their workflows. >> >> So, is this the 'yapp' proposal that was discussed on that thread? > > ~avik > > > On Mon, Feb 18, 2013 at 9:40 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > >> This seems like a reasonable project (basically it is the long fabled >> map-reduce-reduce or MCR* in google terminology). >> >> But it is *very* heavy with Hortonworks developers. By my count, the >> proportion is over half from HW with only token representation from other >> companies: >> >> 13 Hortonworks >> 4 Yahoo >> 3 Facebook >> 2 Microsoft >> 1 Cloudera >> >> Shouldn't this be a bit broader to start with? Or is that an incubation >> task? >> >> On Mon, Feb 18, 2013 at 9:29 PM, Arun C Murthy <a...@hortonworks.com> >> wrote: >> >>> Folks, >>> >>> I'd like to propose adding Tez to the Apache Incubator: >>> http://wiki.apache.org/incubator/TezProposal >>> >>> Essentially, it's the next step to improve projects in the Apache Hadoop >>> ecosystem such as Apache Hive, Apache Pig, Cascading (ASL2, but not ASF >>> project) by providing a more complex DAG of 'tasks' in a single >> application >>> to process data, there-by providing significant advantages for them. >>> >>> During the time I've spent working on MapReduce, I've forever heard >>> complaints from Pig/Hive folks about the fact that MapReduce provides a >>> very constrained task graph which results in excessive number of >> MapReduce >>> jobs... *smile*. It's very exciting to take this next step, and I would >> be >>> thrilled to have it happen in the ASF - as you can see in the proposal >> this >>> effort has broad support from members of MapReduce, Hive & Pig >> communities, >>> many of whom are eager to participate and have already contributed their >>> efforts during the initial prototype. >>> >>> I welcome your feedback/discussion and look forward to it! >>> >>> thanks, >>> Arun >>> (proposed Champion) >>> >>> ---- >>> >>> = Tez = >>> >>> == Abstract == >>> Tez is an effort to develop a generic application framework which can be >>> used >>> to process arbitrarily complex data-processing tasks and also a re-usable >>> set >>> of data-processing primitives which can be used by other projects. >>> >>> == Proposal == >>> Tez is a proposal to develop a generic application which can be used to >>> process complex data-processing task DAGs and runs natively on Apache >>> Hadoop >>> YARN. YARN is a generic resource-management system on which currently >>> applications like MapReduce already exist. MapReduce is a specific, and >>> constrained, DAG - which is not optimal for several frameworks like >> Apache >>> Hive >>> and Apache Pig. Furthermore, we propose to develop a re-usable set of >>> libraries of data-processing primitives such as sorting, merging, >>> data-shuffling, intermediate data management etc. which are necessary for >>> Tez >>> which we envision can be used directly by other projects. >>> >>> == Background == >>> Apache Hadoop MapReduce has emerged as the assembly-language on which >> other >>> frameworks like Apache Pig and Apache Hive have been built. However, it >> has >>> been well accepted that MapReduce produces very constrained task DAGs for >>> each >>> job which results in Apache Pig and Apache Hive requiring multiple >>> MapReduce >>> jobs for several queries. By providing a more expressive DAG of tasks >> for a >>> job, Tez attempts to provide significantly enhanced data-processing >>> capabilities for projects like Apache Pig, Apache Hive, Cascading etc. >>> >>> == Rationale == >>> There is an important gap that Tez fulfills in the Apache Hadoop >> ecosystem >>> of >>> allowing for more expressive task DAGs for data-processing applications >>> such >>> as Apache Pig, Apache Hive, Cascading etc. >>> >>> With emergence of Apache Hadoop YARN, there is a strong need for a >>> common DAG application which can then be shared by Apache Pig, Apache >> Hive, >>> Cascading etc. >>> >>> == Initial Goals == >>> The initial goals for this project are to specify the detailed >> requirements >>> and architecture, and then develop the initial implementation including >> the >>> DAG ApplicationMaster to run natively inside Apache Hadoop YARN. >>> >>> == Current Status == >>> Significant work has been completed to identify the initial requirements >>> and >>> define the overall system architecture. There is a patch available in the >>> internal Hortonworks git repository which can act as the initial seed. >>> >>> === Meritocracy === >>> We plan to invest in supporting a meritocracy. We will discuss the >>> requirements >>> in an open forum. Several companies have already expressed interest in >> this >>> project, and we intend to invite additional developers to participate. >>> We will encourage and monitor community participation so that privileges >>> can be >>> extended to those that contribute. >>> >>> === Community === >>> The need for a generic DAG application for data processing in the open >>> source is >>> tremendous, so there is a potential for a very large community. We >> believe >>> that Tez's extensible architecture will further encourage community >>> participation. >>> Also, related Apache projects (eg, Pig, Hive) have very large and active >>> communities, and we expect that over time Tez will also attract a large >>> community. >>> >>> === Core Developers === >>> The developers on the initial committers list include people very >>> experienced >>> in the Apache Hadoop ecosystem: >>> >>> * Alan Gates <gates at apache dot org> >>> * Arun C Murthy <acmurthy at apache dot org> >>> * Ashutosh Chauhan <hashutosh at apache dot org> >>> * Bikas Saha <bikas at apache dot org> >>> * Chris Douglas <cdouglas at apache dot org> >>> * Daryn Sharp <daryn at apache dot org> >>> * Devaraj Das <ddas at apache dot org> >>> * Gopal Vijayaraghavan <gopal at hortonworks dot com> >>> * Gunther Hagleitner <ghagleitner at hortonworks dot com> >>> * Hitesh Shah <hitesh at apache dot org> >>> * Jason Lowe <jlowe at apache dot org> >>> * Jean Xu <jeanxu at facebook dot com> >>> * Jitendra Pandey <jitendra at apache dot org> >>> * Kevin Wilfong <kevinwilfong at apache dot org> >>> * Mike Liddell <mike dot lidell at microsoft dot com> >>> * Namit Jain <namit at apache dot org> >>> * Owen O'Malley <omalley at apache dot org> >>> * Robert Evans <bobby at apache dot org> >>> * Siddharth Seth <sseth at apache dot org> >>> * Tom White <tomwhite at apache dot org> >>> * Thomas Graves <tgraves at apache dot org> >>> * Vikram Dixit <vikram at apache dot org> >>> * Vinod Kumar Vavilapalli <vinodkv at apache dot org> >>> >>> We realize that though we have significant employer diversity already, >>> additional diversity is always better, and we will work >>> aggressively to recruit developers from additional companies. >>> >>> === Alignment === >>> The initial committers strongly believe that a standard task DAG >>> application on Apache Hadoop YARN will gain broader adoption as an open >>> source, >>> community driven project, where the community can contribute not only to >>> the >>> core components, but also to a growing collection of applications which >>> will >>> be based on top of Tez. Our hope is that the Apache Hive, Apache Pig, >>> Cascading and other communities will find tremendous value in Tez and >> will >>> adopt >>> it en masse. >>> >>> == Known Risks == >>> >>> === Orphaned Products === >>> The contributors are leading users and vendors in the Apache Hadoop >>> ecosystem, >>> with significant open source experience, so the risk of being orphaned is >>> relatively low. The project could be at risk if vendors decided to change >>> their strategies in the market. In such an event, the current committers >>> plan to continue working on the project on their own time, though the >>> progress will likely be slower. We plan to mitigate this risk by >>> recruiting additional committers. >>> >>> === Inexperience with Open Source === >>> The initial committers include veteran Apache members (Committers, PMC >>> members >>> and Apache Members) and other developers who have varying degrees of >>> experience >>> with open source projects. All have been involved with source code that >> has >>> been released under an open source license, and several also have >>> experience >>> developing code with an open source development process. >>> >>> === Homogenous Developers === >>> The initial committers are employed by a number of companies, including >>> Cloudera, Facebook, Hortonworks, Microsoft and Yahoo. We are committed to >>> recruiting additional committers from other companies based on their >>> contributions to the project even though we do have significant diversity >>> already. >>> >>> === Reliance on Salaried Developers === >>> It is expected that Tez development will occur on both salaried time and >> on >>> volunteer time, after hours. The majority of initial committers are paid >> by >>> their employer to contribute to this project. However, they are all >>> passionate >>> about the project, and we are confident that the project will continue >>> even if >>> no salaried developers contribute to the project. We are committed to >>> recruiting >>> additional committers including non-salaried developers. >>> >>> === Relationships with Other Apache Products === >>> As mentioned in the Alignment section, Tez is closely integrated with >>> Hadoop, >>> Hive and Pig in a numerous ways. We look forward to collaborating with >>> those communities, as well as other Apache communities. >>> >>> === An Excessive Fascination with the Apache Brand === >>> Tez solves a real need for generic task DAG management in the Apache >> Hadoop >>> ecosystem, something which has been addressed in a very ad hoc manner so >>> far >>> by multiple Apache projects. Our rationale for developing Tez as an >> Apache >>> project is detailed in the Rationale section. We believe that the Apache >>> brand >>> and community process will help us attract more contributors to this >>> project, >>> and help establish ubiquitous APIs. >>> >>> == Documentation == >>> http://wiki.apache.org/incubator/TezProposal >>> >>> == Initial Source == >>> Available as a patch. >>> >>> == Cryptography == >>> Tez will eventually support encryption on the wire. This is not one of >> the >>> initial >>> goals, and we do not expect Tez to be a controlled export item due to the >>> use >>> of encryption. >>> >>> == Required Resources == >>> >>> === Mailing List === >>> * tez-private >>> * tez-dev >>> * tez-user >>> >>> === Subversion Directory === >>> Git is the preferred source control system: git://git.apache.org/tez >>> >>> === Issue Tracking === >>> >>> JIRA Tez (TEZ) >>> >>> == Initial Committers == >>> * Alan Gates <gates at apache dot org> >>> * Arun C Murthy <acmurthy at apache dot org> >>> * Ashutosh Chauhan <hashutosh at apache dot org> >>> * Bikas Saha <bikas at apache dot org> >>> * Chris Douglas <cdouglas at apache dot org> >>> * Daryn Sharp <daryn at apache dot org> >>> * Devaraj Das <ddas at apache dot org> >>> * Gopal Vijayaraghavan <gopal at hortonworks dot com> >>> * Gunther Hagleitner <ghagleitner at hortonworks dot com> >>> * Hitesh Shah <hitesh at apache dot org> >>> * Jason Lowe <jlowe at apache dot org> >>> * Jean Xu <jeanxu at facebook dot com> >>> * Jitendra Pandey <jitendra at apache dot org> >>> * Kevin Wilfong <kevinwilfong at apache dot org> >>> * Mike Liddell <mike dot lidell at microsoft dot com> >>> * Namit Jain <namit at apache dot org> >>> * Owen O'Malley <omalley at apache dot org> >>> * Robert Evans <bobby at apache dot org> >>> * Siddharth Seth <sseth at apache dot org> >>> * Tom White <tomwhite at apache dot org> >>> * Thomas Graves <tgraves at apache dot org> >>> * Vikram Dixit <vikram at apache dot org> >>> * Vinod Kumar Vavilapalli <vinodkv at apache dot org> >>> >>> == Affiliations == >>> The initial committers are employees of Cloudera, Facebook, Hortonworks, >>> Microsoft and Yahoo Inc. >>> >>> * Alan Gates - Hortonworks >>> * Arun C Murthy - Hortonworks >>> * Ashutosh Chauhan - Hortonworks >>> * Bikas Saha - Hortonworks >>> * Chris Douglas - Microsoft >>> * Daryn Sharp - Yahoo >>> * Devaraj Das - Hortonworks >>> * Gopal Vijayaraghavan - Hortonworks >>> * Gunther Hagleitner - Hortonworks >>> * Hitesh Shah - Hortonworks >>> * Jason Lowe - Yahoo >>> * Jean Xu - Facebook >>> * Jitendra Pandey - Hortonworks >>> * Kevin Wilfong - Facebook >>> * Mike Liddell - Microsoft >>> * Namit Jain - Facebook >>> * Owen O'Malley - Hortonworks >>> * Robert Evans - Yahoo >>> * Siddharth Seth - Hortonworks >>> * Tom White - Cloudera >>> * Thomas Graves - Yahoo >>> * Vikram Dixit - Hortonworks >>> * Vinod Kumar Vavilapalli - Hortonworks >>> >>> The nominated mentors are employees of Hortonworks, >>> NASA JPL and Microsoft. >>> >>> * Alan Gates - Hortonworks >>> * Arun C Murthy - Hortonworks >>> * Chris Douglas - Microsoft >>> * Chris Mattman - NASA JPL >>> * Owen O'Malley - Hortonworks >>> >>> == Sponsors == >>> >>> === Champion === >>> Arun C Murthy <acmurthy at apache dot org> >>> >>> === Nominated Mentors === >>> * Alan Gates <gates at apache dot org> – Architect at Hortonworks. >>> Committer for Pig. >>> * Arun C Murthy <acmurthy at apache dot org> – Architect at >>> Hortonworks. Committer for Hadoop. >>> * Chris Douglas <cdouglas at apache dot org> - Sr. Research Engineer at >>> Microsoft. Committer for Hadoop. >>> * Chris Mattman <mattmann at apache dot org> - Sr. Computer Scientist, >>> NASA JPL. Committer for Nutch, OODT and Tika. >>> * Owen O'Malley <omalley at apache dot org> – Architect at >> Hortonworks. >>> Committer for Hadoop, Ambari. >>> >>> === Sponsoring Entity === >>> Incubator >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>> For additional commands, e-mail: general-h...@incubator.apache.org >>> >>> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org