Re: Tez branch and tez based patches

Edward Capriolo Sat, 20 Jul 2013 08:11:56 -0700

I agree we are getting into grey area with the term disruptive. For
reference ( I have not been doing this all the time bad on me) we are
supposed to +1 and wait a day.


>> I am not familiar with these other engines, but the short answer is that
>> Tez is built to work on YARN, which works well for Hive since it is tied
>> to Hadoop

I understand what you are saying here yarn support is a plus. However the
rest of the answer is something relevant to the discussion.

There are already frameworks like spark that are semi popular.
http://www.slideshare.net/jetlore/spark-and-shark-lightningfast-analytics-over-hadoop-and-hive-data.
There are also other framworks like s4 http://incubator.apache.org/s4/, or
storm.

A big part of making a design decision is doing a competitive analysis.
Usually asking yourself "What else for this is already out there?" or "Can
this be done other ways?"
I do want to be convinced we do not lock into tez too early with tunnel
vision. Possibly we should be thinking on how to build hive in such a way
that many different frameworks could plug in. In other words convincing
that tez is the best choice, since many people are claiming an mrr type
solution.

I will watch the video you posted and study the material myself as well.


On Wed, Jul 17, 2013 at 8:43 PM, Ashutosh Chauhan <[email protected]>wrote:

> On Wed, Jul 17, 2013 at 1:41 PM, Edward Capriolo <[email protected]
> >wrote:
>
> >
> > "In my opinion we should limit the amount of tez related optimizations to
> > and trunk" Refactoring that cleans up code is good, but as you have
> pointed
> > out there wont be a tez release until sometime this fall, and this branch
> > will be open for an extended period of time. Thus code cleanups and other
> > tez related refactoring does not need to be disruptive to trunk.
>
>
> I agree Tez specific changes need not to go in trunk. But general
> refactoring and code cleanup needs to happen on trunk as and when someone
> is willing to work on those. We have to continually improve our code
> quality. Code maintainability and readability is a priority. Without that
> code quality suffers and discourages new contributors to contribute because
> code is unnecessarily complicated. SemanticAnalyzer is 11K line class. We
> need to simplify it. Patch like HIVE-4811 is a welcome change which tackled
> it. Exec package is all convoluted which mixes up runtime operators and
> drivers for runtime. Thats a welcome patch because it makes it much more
> easy to read and reason about that piece of code. HIVE-4825 is another
> example which improves modularity of code. For contributors who are exposed
> to Hive first time it will be easier for them to follow the code.
>
> Rather than disruptive to trunk, they are constructive for trunk and I am
> glad people are choosing to work on that. Tez or no Tez Hive is better off
> with these patches.
>
> Thanks,
> Ashutosh
>
>
>
> >  On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates <[email protected]>
> > wrote:
> >
> > > Answers to some of your questions inlined.
> > >
> > > Alan.
> > >
> > > On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote:
> > >
> > > > There are some points I want to bring up. First, I am on the PMC.
> Here
> > is
> > > > something I find relevant:
> > > >
> > > > http://www.apache.org/foundation/how-it-works.html
> > > >
> > > > ------------------------------
> > > >
> > > > The role of the PMC from a Foundation perspective is oversight. The
> > main
> > > > role of the PMC is not code and not coding - but to ensure that all
> > legal
> > > > issues are addressed, that procedure is followed, and that each and
> > every
> > > > release is the product of the community as a whole. That is key to
> our
> > > > litigation protection mechanisms.
> > > >
> > > > Secondly the role of the PMC is to further the long term development
> > and
> > > > health of the community as a whole, and to ensure that balanced and
> > wide
> > > > scale peer review and collaboration does happen. Within the ASF we
> > worry
> > > > about any community which centers around a few individuals who are
> > > working
> > > > virtually uncontested. We believe that this is detrimental to
> quality,
> > > > stability, and robustness of both code and long term social
> structures.
> > > >
> > > > --------------------------------
> > > >
> > > >
> > >
> >
> https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different
> > > >
> > > > -------------------------------------
> > > >
> > > > All other decisions happen on the dev list, discussions on the
> private
> > > list
> > > > are kept to a minimum.
> > > >
> > > > "If it didn't happen on the dev list, it didn't happen" - which leads
> > to:
> > > >
> > > > a) Elections of committers and PMC members are published on the dev
> > list
> > > > once finalized.
> > > >
> > > > b) Out-of-band discussions (IRC etc.) are summarized on the dev list
> as
> > > > soon as they have impact on the project, code or community.
> > > > ---------------------------------
> > > >
> > > > https://issues.apache.org/jira/browse/HIVE-4660 ironically titled
> "Let
> > > > their be Tez" has not be +1 ed by any committer. It was never
> discussed
> > > on
> > > > the dev or the user list (as far as I can tell).
> > >
> > > As all JIRA creations and updates are sent to dev@hive, creating a
> JIRA
> > > is de facto posting to the list.
> > >
> > > >
> > > > As a PMC member I feel we need more discussion on Tez on the dev list
> > > along
> > > > with a wiki-fied design document. Topics of discussion should
> include:
> > >
> > > I talked with Gunther and he's working on posting a design doc on the
> > > wiki.  He has a PDF on the JIRA but he doesn't have write permissions
> yet
> > > on the wiki.
> > >
> > > >
> > > > 1) What is tez?
> > > In Hadoop 2.0, YARN opens up the ability to have multiple execution
> > > frameworks in Hadoop.  Hadoop apps are no longer tied to MapReduce as
> the
> > > only execution option.  Tez is an effort to build an execution engine
> > that
> > > is optimized for relational data processing, such as Hive and Pig.
> > >
> > > The biggest change here is to move away from only Map and Reduce as
> > > processing options and to allow alternate combinations of processing,
> > such
> > > as map -> reduce -> reduce or tasks that take multiple inputs or
> shuffles
> > > that avoid sorting when it isn't needed.
> > >
> > > For a good intro to Tez, see Arun's presentation on it at the recent
> > > Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides
> > > http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212
> )
> > > >
> > > > 2) How is tez different from oozie, http://code.google.com/p/hop/,
> > > > http://cs.brown.edu/~backman/cmr.html , and other DAG and or
> streaming
> > > map
> > > > reduce tools/frameworks? Why should we use this and not those?
> > >
> > > Oozie is a completely different thing.  Oozie is a workflow engine and
> a
> > > scheduler.  It's core competencies are the ability to coordinate
> > workflows
> > > of disparate job types (MR, Pig, Hive, etc.) and to schedule them.  It
> is
> > > not intended as an execution engine for apps such as Pig and Hive.
> > >
> > > I am not familiar with these other engines, but the short answer is
> that
> > > Tez is built to work on YARN, which works well for Hive since it is
> tied
> > to
> > > Hadoop.
> > > >
> > > > 3) When can we expect the first tez release?
> > > I don't know, but I hope sometime this fall.
> > >
> > > >
> > > > 4) How much effort is involved in integrating hive and tez?
> > > Covered in the design doc.
> > >
> > > >
> > > > 5) Who is ready to commit to this effort?
> > > I'll let people speak for themselves on that one.
> > >
> > > >
> > > > 6) can we expect this work to be done in one hive release?
> > > Unlikely.  Initial integration will be done in one release, but as Tez
> is
> > > a new project I expect it will be adding features in the future that
> Hive
> > > will want to take advantage of.
> > >
> > > >
> > > > In my opinion we should not start any work on this tez-hive until
> these
> > > > questions are answered to the satisfaction of the hive developers.
> > >
> > > Can we change this to "not commit patches"?  We can't tell willing
> people
> > > not to work on it.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo <
> > [email protected]
> > > >wrote:
> > > >
> > > >>
> > > >>>> The Hive bylaws,
> > > >> https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out
> > what
> > > >> votes are needed for what.  I don't see anything there about
> needing 3
> > > +1s
> > > >> for a branch.  Branching >>would seem to fall under code change,
> which
> > > >> requires one vote and a minimum length of 1 day.
> > > >>
> > > >> You could argue that all you need is one +1 to create a branch, but
> > this
> > > >> is more then a branch. If you are talking about something that is:
> > > >> 1) going to cause major re-factoring of critical pieces of hive like
> > > >> ExecDriver and MapRedTask
> > > >> 2) going to be very disruptive to the efforts of other committers
> > > >> 3) something that may be a major architectural change
> > > >>
> > > >> Getting the project on board with the idea is a good idea.
> > > >>
> > > >> Now I want to point something out. Here are some recent initiatives
> in
> > > >> hive:
> > > >>
> > > >> 1) At one point there was a big initiative to "support oracle" after
> > the
> > > >> initial work, there are patches in Jira no one seems to care about
> > > oracle
> > > >> support.
> > > >> 2) Another such decisions was this "support windows" one, there are
> > > >> probably 4 windows patches waiting reviews.
> > > >> 3) I still have no clue what the official hadoop1 hadoop2, hadoop
> 0.23
> > > >> support prospective is, but every couple weeks we get another jira
> > about
> > > >> something not working/testing on one of those versions, seems like
> > > several
> > > >> builds are broken.
> > > >> 4) Hive-storage handler, after the initial implementation no one
> cares
> > > to
> > > >> review any other storage handler implementation, 3 patches there or
> > > more,
> > > >> could not even find anyone willing to review the cassandra storage
> > > handler
> > > >> I spent months on.
> > > >> 5) OCR, Vectorization
> > > >> 6) Windowing: committed, numerous check-style violations.
> > > >>
> > > >> We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active
> committers.
> > > We
> > > >> are spread very thin, and embarking on another side project not
> > involved
> > > >> with core hive seems like the wrong direction at the moment.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates <[email protected]>
> > > wrote:
> > > >>
> > > >>>
> > > >>> On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote:
> > > >>>
> > > >>>> I have started to see several re factoring patches around tez.
> > > >>>> https://issues.apache.org/jira/browse/HIVE-4843
> > > >>>>
> > > >>>> This is the only mention on the hive list I can find with tez:
> > > >>>> "Makes sense. I will create the branch soon.
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Ashutosh
> > > >>>>
> > > >>>>
> > > >>>> On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner <
> > > >>>> [email protected]> wrote:
> > > >>>>
> > > >>>>> Hi,
> > > >>>>>
> > > >>>>> I am starting to work on integrating Tez into Hive (see
> HIVE-4660,
> > > >>> design
> > > >>>>> doc has already been uploaded - any feedback will be much
> > > appreciated).
> > > >>>>> This will be a fair amount of work that will take time to
> > > >>> stabilize/test.
> > > >>>>> I'd like to propose creating a branch in order to be able to do
> > this
> > > >>>>> incrementally and collaboratively. In order to progress rapidly
> > with
> > > >>> this,
> > > >>>>> I would also like to go "commit-then-review".
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>> Gunther.
> > > >>>>> "
> > > >>>>
> > > >>>> These refactor-ings are largely destructive to a number of bugs
> and
> > > >>>> language improvements in hive.The language improvements and bug
> > fixes
> > > >>> that
> > > >>>> have been sitting in Jira for quite some time now marked
> > > patch-available
> > > >>>> and are waiting for review.
> > > >>>>
> > > >>>> There are a few things I want to point out:
> > > >>>> 1) Normally we create design docs in out wiki (which it is not)
> > > >>>> 2) Normally when the change is significantly complex we get
> multiple
> > > >>>> committers to comment on it (which we did not)
> > > >>>> On point 2 no one -1  the branch, but this is really something
> that
> > > >>> should
> > > >>>> have required a +1 from 3 committers.
> > > >>>
> > > >>> The Hive bylaws,
> > > https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what
> > > votes are needed for what.  I don't see anything there about
> > > >>> needing 3 +1s for a branch.  Branching would seem to fall under
> code
> > > >>> change, which requires one vote and a minimum length of 1 day.
> > > >>>
> > > >>>>
> > > >>>> I for one am not completely sold on Tez.
> > > >>>> http://incubator.apache.org/projects/tez.html.
> > > >>>> "directed-acyclic-graph of tasks for processing data" this
> > description
> > > >>>> sounds like many things which have never become popular. One to
> > think
> > > >>> of is
> > > >>>> oozie "Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of
> > > >>>> actions.". I am sure I can find a number of libraries/frameworks
> > that
> > > >>> make
> > > >>>> this same claim. In general I do not feel like we have done our
> > > homework
> > > >>>> and pre-requisites to justify all this work. If we have done the
> > > >>> homework,
> > > >>>> I am sure that it has not been communicated and accepted by hive
> > > >>> developers
> > > >>>> at large.
> > > >>>
> > > >>> A request for better documentation on Tez and a project road map
> > seems
> > > >>> totally reasonable.
> > > >>>
> > > >>>>
> > > >>>> If we have a branch, why are we also committing on trunk? Scanning
> > > >>> through
> > > >>>> the tez doc the only language I keep finding language like
> "minimal
> > > >>> changes
> > > >>>> to the planner" yet, there is ALREADY lots of large changes going
> > on!
> > > >>>>
> > > >>>> Really none of the above would bother me accept for the fact that
> > > these
> > > >>>> "minimal changes" are causing many "patch available"
> > ready-for-review
> > > >>> bugs
> > > >>>> and core hive features to need to be re based.
> > > >>>>
> > > >>>> I am sure I have mentioned this before, but I have to spend 12+
> > hours
> > > to
> > > >>>> test a single patch on my laptop. A few days ago I was testing a
> new
> > > >>> core
> > > >>>> hive feature. After all the tests passed and before I was able to
> > > >>> commit,
> > > >>>> someone unleashed a tez patch on trunk which caused the thing I
> was
> > > >>> testing
> > > >>>> for 12 hours to need to be rebased.
> > > >>>>
> > > >>>>
> > > >>>> I'm not cool with this.Next time that happens to me I will
> seriously
> > > >>>> consider reverting the patch. Bug fixes and new hive features are
> > more
> > > >>>> important to me then integrating with incubator projects.
> > > >>>
> > > >>> (With my Apache member hat on)  Reverting patches that aren't
> > breaking
> > > >>> the build is considered very bad form in Apache.  It does make
> sense
> > to
> > > >>> request that when people are going to commit a patch that will
> break
> > > many
> > > >>> other patches they first give a few hours of notice so people can
> say
> > > >>> something if they're about to commit another patch and avoid your
> > fate
> > > of
> > > >>> needing to rerun the tests.  The other thing is we need to get get
> > the
> > > >>> automated build of patches working on Hive so committers are forced
> > to
> > > run
> > > >>> all of the tests themselves.  We are working on it, but we're not
> > > there yet.
> > > >>>
> > > >>> Alan.
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> >
>

Re: Tez branch and tez based patches

Reply via email to