At ~25:00 "There is a working prototype of hive which is using tez as the targeted runtime"
Can I get a look at that code? Is it on github? Edward On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates <ga...@hortonworks.com> wrote: > Answers to some of your questions inlined. > > Alan. > > On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote: > > > There are some points I want to bring up. First, I am on the PMC. Here is > > something I find relevant: > > > > http://www.apache.org/foundation/how-it-works.html > > > > ------------------------------ > > > > The role of the PMC from a Foundation perspective is oversight. The main > > role of the PMC is not code and not coding - but to ensure that all legal > > issues are addressed, that procedure is followed, and that each and every > > release is the product of the community as a whole. That is key to our > > litigation protection mechanisms. > > > > Secondly the role of the PMC is to further the long term development and > > health of the community as a whole, and to ensure that balanced and wide > > scale peer review and collaboration does happen. Within the ASF we worry > > about any community which centers around a few individuals who are > working > > virtually uncontested. We believe that this is detrimental to quality, > > stability, and robustness of both code and long term social structures. > > > > -------------------------------- > > > > > https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different > > > > ------------------------------------- > > > > All other decisions happen on the dev list, discussions on the private > list > > are kept to a minimum. > > > > "If it didn't happen on the dev list, it didn't happen" - which leads to: > > > > a) Elections of committers and PMC members are published on the dev list > > once finalized. > > > > b) Out-of-band discussions (IRC etc.) are summarized on the dev list as > > soon as they have impact on the project, code or community. > > --------------------------------- > > > > https://issues.apache.org/jira/browse/HIVE-4660 ironically titled "Let > > their be Tez" has not be +1 ed by any committer. It was never discussed > on > > the dev or the user list (as far as I can tell). > > As all JIRA creations and updates are sent to dev@hive, creating a JIRA > is de facto posting to the list. > > > > > As a PMC member I feel we need more discussion on Tez on the dev list > along > > with a wiki-fied design document. Topics of discussion should include: > > I talked with Gunther and he's working on posting a design doc on the > wiki. He has a PDF on the JIRA but he doesn't have write permissions yet > on the wiki. > > > > > 1) What is tez? > In Hadoop 2.0, YARN opens up the ability to have multiple execution > frameworks in Hadoop. Hadoop apps are no longer tied to MapReduce as the > only execution option. Tez is an effort to build an execution engine that > is optimized for relational data processing, such as Hive and Pig. > > The biggest change here is to move away from only Map and Reduce as > processing options and to allow alternate combinations of processing, such > as map -> reduce -> reduce or tasks that take multiple inputs or shuffles > that avoid sorting when it isn't needed. > > For a good intro to Tez, see Arun's presentation on it at the recent > Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides > http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212) > > > > 2) How is tez different from oozie, http://code.google.com/p/hop/, > > http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming > map > > reduce tools/frameworks? Why should we use this and not those? > > Oozie is a completely different thing. Oozie is a workflow engine and a > scheduler. It's core competencies are the ability to coordinate workflows > of disparate job types (MR, Pig, Hive, etc.) and to schedule them. It is > not intended as an execution engine for apps such as Pig and Hive. > > I am not familiar with these other engines, but the short answer is that > Tez is built to work on YARN, which works well for Hive since it is tied to > Hadoop. > > > > 3) When can we expect the first tez release? > I don't know, but I hope sometime this fall. > > > > > 4) How much effort is involved in integrating hive and tez? > Covered in the design doc. > > > > > 5) Who is ready to commit to this effort? > I'll let people speak for themselves on that one. > > > > > 6) can we expect this work to be done in one hive release? > Unlikely. Initial integration will be done in one release, but as Tez is > a new project I expect it will be adding features in the future that Hive > will want to take advantage of. > > > > > In my opinion we should not start any work on this tez-hive until these > > questions are answered to the satisfaction of the hive developers. > > Can we change this to "not commit patches"? We can't tell willing people > not to work on it. > > > > > > > > > > > > > > > > > > On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo <edlinuxg...@gmail.com > >wrote: > > > >> > >>>> The Hive bylaws, > >> https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out what > >> votes are needed for what. I don't see anything there about needing 3 > +1s > >> for a branch. Branching >>would seem to fall under code change, which > >> requires one vote and a minimum length of 1 day. > >> > >> You could argue that all you need is one +1 to create a branch, but this > >> is more then a branch. If you are talking about something that is: > >> 1) going to cause major re-factoring of critical pieces of hive like > >> ExecDriver and MapRedTask > >> 2) going to be very disruptive to the efforts of other committers > >> 3) something that may be a major architectural change > >> > >> Getting the project on board with the idea is a good idea. > >> > >> Now I want to point something out. Here are some recent initiatives in > >> hive: > >> > >> 1) At one point there was a big initiative to "support oracle" after the > >> initial work, there are patches in Jira no one seems to care about > oracle > >> support. > >> 2) Another such decisions was this "support windows" one, there are > >> probably 4 windows patches waiting reviews. > >> 3) I still have no clue what the official hadoop1 hadoop2, hadoop 0.23 > >> support prospective is, but every couple weeks we get another jira about > >> something not working/testing on one of those versions, seems like > several > >> builds are broken. > >> 4) Hive-storage handler, after the initial implementation no one cares > to > >> review any other storage handler implementation, 3 patches there or > more, > >> could not even find anyone willing to review the cassandra storage > handler > >> I spent months on. > >> 5) OCR, Vectorization > >> 6) Windowing: committed, numerous check-style violations. > >> > >> We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active committers. > We > >> are spread very thin, and embarking on another side project not involved > >> with core hive seems like the wrong direction at the moment. > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates <ga...@hortonworks.com> > wrote: > >> > >>> > >>> On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote: > >>> > >>>> I have started to see several re factoring patches around tez. > >>>> https://issues.apache.org/jira/browse/HIVE-4843 > >>>> > >>>> This is the only mention on the hive list I can find with tez: > >>>> "Makes sense. I will create the branch soon. > >>>> > >>>> Thanks, > >>>> Ashutosh > >>>> > >>>> > >>>> On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner < > >>>> ghagleit...@hortonworks.com> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> I am starting to work on integrating Tez into Hive (see HIVE-4660, > >>> design > >>>>> doc has already been uploaded - any feedback will be much > appreciated). > >>>>> This will be a fair amount of work that will take time to > >>> stabilize/test. > >>>>> I'd like to propose creating a branch in order to be able to do this > >>>>> incrementally and collaboratively. In order to progress rapidly with > >>> this, > >>>>> I would also like to go "commit-then-review". > >>>>> > >>>>> Thanks, > >>>>> Gunther. > >>>>> " > >>>> > >>>> These refactor-ings are largely destructive to a number of bugs and > >>>> language improvements in hive.The language improvements and bug fixes > >>> that > >>>> have been sitting in Jira for quite some time now marked > patch-available > >>>> and are waiting for review. > >>>> > >>>> There are a few things I want to point out: > >>>> 1) Normally we create design docs in out wiki (which it is not) > >>>> 2) Normally when the change is significantly complex we get multiple > >>>> committers to comment on it (which we did not) > >>>> On point 2 no one -1 the branch, but this is really something that > >>> should > >>>> have required a +1 from 3 committers. > >>> > >>> The Hive bylaws, > https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what > votes are needed for what. I don't see anything there about > >>> needing 3 +1s for a branch. Branching would seem to fall under code > >>> change, which requires one vote and a minimum length of 1 day. > >>> > >>>> > >>>> I for one am not completely sold on Tez. > >>>> http://incubator.apache.org/projects/tez.html. > >>>> "directed-acyclic-graph of tasks for processing data" this description > >>>> sounds like many things which have never become popular. One to think > >>> of is > >>>> oozie "Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of > >>>> actions.". I am sure I can find a number of libraries/frameworks that > >>> make > >>>> this same claim. In general I do not feel like we have done our > homework > >>>> and pre-requisites to justify all this work. If we have done the > >>> homework, > >>>> I am sure that it has not been communicated and accepted by hive > >>> developers > >>>> at large. > >>> > >>> A request for better documentation on Tez and a project road map seems > >>> totally reasonable. > >>> > >>>> > >>>> If we have a branch, why are we also committing on trunk? Scanning > >>> through > >>>> the tez doc the only language I keep finding language like "minimal > >>> changes > >>>> to the planner" yet, there is ALREADY lots of large changes going on! > >>>> > >>>> Really none of the above would bother me accept for the fact that > these > >>>> "minimal changes" are causing many "patch available" ready-for-review > >>> bugs > >>>> and core hive features to need to be re based. > >>>> > >>>> I am sure I have mentioned this before, but I have to spend 12+ hours > to > >>>> test a single patch on my laptop. A few days ago I was testing a new > >>> core > >>>> hive feature. After all the tests passed and before I was able to > >>> commit, > >>>> someone unleashed a tez patch on trunk which caused the thing I was > >>> testing > >>>> for 12 hours to need to be rebased. > >>>> > >>>> > >>>> I'm not cool with this.Next time that happens to me I will seriously > >>>> consider reverting the patch. Bug fixes and new hive features are more > >>>> important to me then integrating with incubator projects. > >>> > >>> (With my Apache member hat on) Reverting patches that aren't breaking > >>> the build is considered very bad form in Apache. It does make sense to > >>> request that when people are going to commit a patch that will break > many > >>> other patches they first give a few hours of notice so people can say > >>> something if they're about to commit another patch and avoid your fate > of > >>> needing to rerun the tests. The other thing is we need to get get the > >>> automated build of patches working on Hive so committers are forced to > run > >>> all of the tests themselves. We are working on it, but we're not > there yet. > >>> > >>> Alan. > >>> > >>> > >> > >