I agree we are getting into grey area with the term disruptive. For reference ( I have not been doing this all the time bad on me) we are supposed to +1 and wait a day.
>> I am not familiar with these other engines, but the short answer is that >> Tez is built to work on YARN, which works well for Hive since it is tied >> to Hadoop I understand what you are saying here yarn support is a plus. However the rest of the answer is something relevant to the discussion. There are already frameworks like spark that are semi popular. http://www.slideshare.net/jetlore/spark-and-shark-lightningfast-analytics-over-hadoop-and-hive-data. There are also other framworks like s4 http://incubator.apache.org/s4/, or storm. A big part of making a design decision is doing a competitive analysis. Usually asking yourself "What else for this is already out there?" or "Can this be done other ways?" I do want to be convinced we do not lock into tez too early with tunnel vision. Possibly we should be thinking on how to build hive in such a way that many different frameworks could plug in. In other words convincing that tez is the best choice, since many people are claiming an mrr type solution. I will watch the video you posted and study the material myself as well. On Wed, Jul 17, 2013 at 8:43 PM, Ashutosh Chauhan <hashut...@apache.org>wrote: > On Wed, Jul 17, 2013 at 1:41 PM, Edward Capriolo <edlinuxg...@gmail.com > >wrote: > > > > > "In my opinion we should limit the amount of tez related optimizations to > > and trunk" Refactoring that cleans up code is good, but as you have > pointed > > out there wont be a tez release until sometime this fall, and this branch > > will be open for an extended period of time. Thus code cleanups and other > > tez related refactoring does not need to be disruptive to trunk. > > > I agree Tez specific changes need not to go in trunk. But general > refactoring and code cleanup needs to happen on trunk as and when someone > is willing to work on those. We have to continually improve our code > quality. Code maintainability and readability is a priority. Without that > code quality suffers and discourages new contributors to contribute because > code is unnecessarily complicated. SemanticAnalyzer is 11K line class. We > need to simplify it. Patch like HIVE-4811 is a welcome change which tackled > it. Exec package is all convoluted which mixes up runtime operators and > drivers for runtime. Thats a welcome patch because it makes it much more > easy to read and reason about that piece of code. HIVE-4825 is another > example which improves modularity of code. For contributors who are exposed > to Hive first time it will be easier for them to follow the code. > > Rather than disruptive to trunk, they are constructive for trunk and I am > glad people are choosing to work on that. Tez or no Tez Hive is better off > with these patches. > > Thanks, > Ashutosh > > > > > On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates <ga...@hortonworks.com> > > wrote: > > > > > Answers to some of your questions inlined. > > > > > > Alan. > > > > > > On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote: > > > > > > > There are some points I want to bring up. First, I am on the PMC. > Here > > is > > > > something I find relevant: > > > > > > > > http://www.apache.org/foundation/how-it-works.html > > > > > > > > ------------------------------ > > > > > > > > The role of the PMC from a Foundation perspective is oversight. The > > main > > > > role of the PMC is not code and not coding - but to ensure that all > > legal > > > > issues are addressed, that procedure is followed, and that each and > > every > > > > release is the product of the community as a whole. That is key to > our > > > > litigation protection mechanisms. > > > > > > > > Secondly the role of the PMC is to further the long term development > > and > > > > health of the community as a whole, and to ensure that balanced and > > wide > > > > scale peer review and collaboration does happen. Within the ASF we > > worry > > > > about any community which centers around a few individuals who are > > > working > > > > virtually uncontested. We believe that this is detrimental to > quality, > > > > stability, and robustness of both code and long term social > structures. > > > > > > > > -------------------------------- > > > > > > > > > > > > > > https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different > > > > > > > > ------------------------------------- > > > > > > > > All other decisions happen on the dev list, discussions on the > private > > > list > > > > are kept to a minimum. > > > > > > > > "If it didn't happen on the dev list, it didn't happen" - which leads > > to: > > > > > > > > a) Elections of committers and PMC members are published on the dev > > list > > > > once finalized. > > > > > > > > b) Out-of-band discussions (IRC etc.) are summarized on the dev list > as > > > > soon as they have impact on the project, code or community. > > > > --------------------------------- > > > > > > > > https://issues.apache.org/jira/browse/HIVE-4660 ironically titled > "Let > > > > their be Tez" has not be +1 ed by any committer. It was never > discussed > > > on > > > > the dev or the user list (as far as I can tell). > > > > > > As all JIRA creations and updates are sent to dev@hive, creating a > JIRA > > > is de facto posting to the list. > > > > > > > > > > > As a PMC member I feel we need more discussion on Tez on the dev list > > > along > > > > with a wiki-fied design document. Topics of discussion should > include: > > > > > > I talked with Gunther and he's working on posting a design doc on the > > > wiki. He has a PDF on the JIRA but he doesn't have write permissions > yet > > > on the wiki. > > > > > > > > > > > 1) What is tez? > > > In Hadoop 2.0, YARN opens up the ability to have multiple execution > > > frameworks in Hadoop. Hadoop apps are no longer tied to MapReduce as > the > > > only execution option. Tez is an effort to build an execution engine > > that > > > is optimized for relational data processing, such as Hive and Pig. > > > > > > The biggest change here is to move away from only Map and Reduce as > > > processing options and to allow alternate combinations of processing, > > such > > > as map -> reduce -> reduce or tasks that take multiple inputs or > shuffles > > > that avoid sorting when it isn't needed. > > > > > > For a good intro to Tez, see Arun's presentation on it at the recent > > > Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides > > > http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212 > ) > > > > > > > > 2) How is tez different from oozie, http://code.google.com/p/hop/, > > > > http://cs.brown.edu/~backman/cmr.html , and other DAG and or > streaming > > > map > > > > reduce tools/frameworks? Why should we use this and not those? > > > > > > Oozie is a completely different thing. Oozie is a workflow engine and > a > > > scheduler. It's core competencies are the ability to coordinate > > workflows > > > of disparate job types (MR, Pig, Hive, etc.) and to schedule them. It > is > > > not intended as an execution engine for apps such as Pig and Hive. > > > > > > I am not familiar with these other engines, but the short answer is > that > > > Tez is built to work on YARN, which works well for Hive since it is > tied > > to > > > Hadoop. > > > > > > > > 3) When can we expect the first tez release? > > > I don't know, but I hope sometime this fall. > > > > > > > > > > > 4) How much effort is involved in integrating hive and tez? > > > Covered in the design doc. > > > > > > > > > > > 5) Who is ready to commit to this effort? > > > I'll let people speak for themselves on that one. > > > > > > > > > > > 6) can we expect this work to be done in one hive release? > > > Unlikely. Initial integration will be done in one release, but as Tez > is > > > a new project I expect it will be adding features in the future that > Hive > > > will want to take advantage of. > > > > > > > > > > > In my opinion we should not start any work on this tez-hive until > these > > > > questions are answered to the satisfaction of the hive developers. > > > > > > Can we change this to "not commit patches"? We can't tell willing > people > > > not to work on it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo < > > edlinuxg...@gmail.com > > > >wrote: > > > > > > > >> > > > >>>> The Hive bylaws, > > > >> https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out > > what > > > >> votes are needed for what. I don't see anything there about > needing 3 > > > +1s > > > >> for a branch. Branching >>would seem to fall under code change, > which > > > >> requires one vote and a minimum length of 1 day. > > > >> > > > >> You could argue that all you need is one +1 to create a branch, but > > this > > > >> is more then a branch. If you are talking about something that is: > > > >> 1) going to cause major re-factoring of critical pieces of hive like > > > >> ExecDriver and MapRedTask > > > >> 2) going to be very disruptive to the efforts of other committers > > > >> 3) something that may be a major architectural change > > > >> > > > >> Getting the project on board with the idea is a good idea. > > > >> > > > >> Now I want to point something out. Here are some recent initiatives > in > > > >> hive: > > > >> > > > >> 1) At one point there was a big initiative to "support oracle" after > > the > > > >> initial work, there are patches in Jira no one seems to care about > > > oracle > > > >> support. > > > >> 2) Another such decisions was this "support windows" one, there are > > > >> probably 4 windows patches waiting reviews. > > > >> 3) I still have no clue what the official hadoop1 hadoop2, hadoop > 0.23 > > > >> support prospective is, but every couple weeks we get another jira > > about > > > >> something not working/testing on one of those versions, seems like > > > several > > > >> builds are broken. > > > >> 4) Hive-storage handler, after the initial implementation no one > cares > > > to > > > >> review any other storage handler implementation, 3 patches there or > > > more, > > > >> could not even find anyone willing to review the cassandra storage > > > handler > > > >> I spent months on. > > > >> 5) OCR, Vectorization > > > >> 6) Windowing: committed, numerous check-style violations. > > > >> > > > >> We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active > committers. > > > We > > > >> are spread very thin, and embarking on another side project not > > involved > > > >> with core hive seems like the wrong direction at the moment. > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates <ga...@hortonworks.com> > > > wrote: > > > >> > > > >>> > > > >>> On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote: > > > >>> > > > >>>> I have started to see several re factoring patches around tez. > > > >>>> https://issues.apache.org/jira/browse/HIVE-4843 > > > >>>> > > > >>>> This is the only mention on the hive list I can find with tez: > > > >>>> "Makes sense. I will create the branch soon. > > > >>>> > > > >>>> Thanks, > > > >>>> Ashutosh > > > >>>> > > > >>>> > > > >>>> On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner < > > > >>>> ghagleit...@hortonworks.com> wrote: > > > >>>> > > > >>>>> Hi, > > > >>>>> > > > >>>>> I am starting to work on integrating Tez into Hive (see > HIVE-4660, > > > >>> design > > > >>>>> doc has already been uploaded - any feedback will be much > > > appreciated). > > > >>>>> This will be a fair amount of work that will take time to > > > >>> stabilize/test. > > > >>>>> I'd like to propose creating a branch in order to be able to do > > this > > > >>>>> incrementally and collaboratively. In order to progress rapidly > > with > > > >>> this, > > > >>>>> I would also like to go "commit-then-review". > > > >>>>> > > > >>>>> Thanks, > > > >>>>> Gunther. > > > >>>>> " > > > >>>> > > > >>>> These refactor-ings are largely destructive to a number of bugs > and > > > >>>> language improvements in hive.The language improvements and bug > > fixes > > > >>> that > > > >>>> have been sitting in Jira for quite some time now marked > > > patch-available > > > >>>> and are waiting for review. > > > >>>> > > > >>>> There are a few things I want to point out: > > > >>>> 1) Normally we create design docs in out wiki (which it is not) > > > >>>> 2) Normally when the change is significantly complex we get > multiple > > > >>>> committers to comment on it (which we did not) > > > >>>> On point 2 no one -1 the branch, but this is really something > that > > > >>> should > > > >>>> have required a +1 from 3 committers. > > > >>> > > > >>> The Hive bylaws, > > > https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what > > > votes are needed for what. I don't see anything there about > > > >>> needing 3 +1s for a branch. Branching would seem to fall under > code > > > >>> change, which requires one vote and a minimum length of 1 day. > > > >>> > > > >>>> > > > >>>> I for one am not completely sold on Tez. > > > >>>> http://incubator.apache.org/projects/tez.html. > > > >>>> "directed-acyclic-graph of tasks for processing data" this > > description > > > >>>> sounds like many things which have never become popular. One to > > think > > > >>> of is > > > >>>> oozie "Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of > > > >>>> actions.". I am sure I can find a number of libraries/frameworks > > that > > > >>> make > > > >>>> this same claim. In general I do not feel like we have done our > > > homework > > > >>>> and pre-requisites to justify all this work. If we have done the > > > >>> homework, > > > >>>> I am sure that it has not been communicated and accepted by hive > > > >>> developers > > > >>>> at large. > > > >>> > > > >>> A request for better documentation on Tez and a project road map > > seems > > > >>> totally reasonable. > > > >>> > > > >>>> > > > >>>> If we have a branch, why are we also committing on trunk? Scanning > > > >>> through > > > >>>> the tez doc the only language I keep finding language like > "minimal > > > >>> changes > > > >>>> to the planner" yet, there is ALREADY lots of large changes going > > on! > > > >>>> > > > >>>> Really none of the above would bother me accept for the fact that > > > these > > > >>>> "minimal changes" are causing many "patch available" > > ready-for-review > > > >>> bugs > > > >>>> and core hive features to need to be re based. > > > >>>> > > > >>>> I am sure I have mentioned this before, but I have to spend 12+ > > hours > > > to > > > >>>> test a single patch on my laptop. A few days ago I was testing a > new > > > >>> core > > > >>>> hive feature. After all the tests passed and before I was able to > > > >>> commit, > > > >>>> someone unleashed a tez patch on trunk which caused the thing I > was > > > >>> testing > > > >>>> for 12 hours to need to be rebased. > > > >>>> > > > >>>> > > > >>>> I'm not cool with this.Next time that happens to me I will > seriously > > > >>>> consider reverting the patch. Bug fixes and new hive features are > > more > > > >>>> important to me then integrating with incubator projects. > > > >>> > > > >>> (With my Apache member hat on) Reverting patches that aren't > > breaking > > > >>> the build is considered very bad form in Apache. It does make > sense > > to > > > >>> request that when people are going to commit a patch that will > break > > > many > > > >>> other patches they first give a few hours of notice so people can > say > > > >>> something if they're about to commit another patch and avoid your > > fate > > > of > > > >>> needing to rerun the tests. The other thing is we need to get get > > the > > > >>> automated build of patches working on Hive so committers are forced > > to > > > run > > > >>> all of the tests themselves. We are working on it, but we're not > > > there yet. > > > >>> > > > >>> Alan. > > > >>> > > > >>> > > > >> > > > > > > > > >