Re: Tez branch and tez based patches

Gunther Hagleitner Mon, 22 Jul 2013 17:09:14 -0700

I have finally gotten access to wiki and added the design doc:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez


I've also added links to it from the jira and in general overhauled the
design. Please let me know if you feel there's still stuff missing from the
document.

>> Possibly we should be thinking on how to build hive in such a way
>> that many different frameworks could plug in.

I believe that the proposed design and refactoring puts you on that path.
I'm not introducing layer upon layer of abstraction without a specific use
case in mind, but high level you would go through similar steps:

Exec layer:
- Define your own Task classes
- If you can reuse the operator pipeline define your own replacement for
ExecMapper/ExecReducer (glue code to drive records through the pipeline)
- Operators: You might have to add specific operators for your framework

Planning:
- Define your own work classes (or reuse existing ones). These abstractly
encapsulate all input/meta info necessary to execute.
- Define your own *Compiler to translate either the logical plan or
physical plan to a graph of Tasks. This might include specific additional
optimizations.

Devil's in the details no doubt.

Thanks,
Gunther.






On Sat, Jul 20, 2013 at 8:10 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

> I agree we are getting into grey area with the term disruptive. For
> reference ( I have not been doing this all the time bad on me) we are
> supposed to +1 and wait a day.
>
> >> I am not familiar with these other engines, but the short answer is that
> >> Tez is built to work on YARN, which works well for Hive since it is tied
> >> to Hadoop
>
> I understand what you are saying here yarn support is a plus. However the
> rest of the answer is something relevant to the discussion.
>
> There are already frameworks like spark that are semi popular.
>
> http://www.slideshare.net/jetlore/spark-and-shark-lightningfast-analytics-over-hadoop-and-hive-data
> .
> There are also other framworks like s4 http://incubator.apache.org/s4/, or
> storm.
>
> A big part of making a design decision is doing a competitive analysis.
> Usually asking yourself "What else for this is already out there?" or "Can
> this be done other ways?"
> I do want to be convinced we do not lock into tez too early with tunnel
> vision. Possibly we should be thinking on how to build hive in such a way
> that many different frameworks could plug in. In other words convincing
> that tez is the best choice, since many people are claiming an mrr type
> solution.
>
> I will watch the video you posted and study the material myself as well.
>
>
> On Wed, Jul 17, 2013 at 8:43 PM, Ashutosh Chauhan <hashut...@apache.org
> >wrote:
>
> > On Wed, Jul 17, 2013 at 1:41 PM, Edward Capriolo <edlinuxg...@gmail.com
> > >wrote:
> >
> > >
> > > "In my opinion we should limit the amount of tez related optimizations
> to
> > > and trunk" Refactoring that cleans up code is good, but as you have
> > pointed
> > > out there wont be a tez release until sometime this fall, and this
> branch
> > > will be open for an extended period of time. Thus code cleanups and
> other
> > > tez related refactoring does not need to be disruptive to trunk.
> >
> >
> > I agree Tez specific changes need not to go in trunk. But general
> > refactoring and code cleanup needs to happen on trunk as and when someone
> > is willing to work on those. We have to continually improve our code
> > quality. Code maintainability and readability is a priority. Without that
> > code quality suffers and discourages new contributors to contribute
> because
> > code is unnecessarily complicated. SemanticAnalyzer is 11K line class. We
> > need to simplify it. Patch like HIVE-4811 is a welcome change which
> tackled
> > it. Exec package is all convoluted which mixes up runtime operators and
> > drivers for runtime. Thats a welcome patch because it makes it much more
> > easy to read and reason about that piece of code. HIVE-4825 is another
> > example which improves modularity of code. For contributors who are
> exposed
> > to Hive first time it will be easier for them to follow the code.
> >
> > Rather than disruptive to trunk, they are constructive for trunk and I am
> > glad people are choosing to work on that. Tez or no Tez Hive is better
> off
> > with these patches.
> >
> > Thanks,
> > Ashutosh
> >
> >
> >
> > >  On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates <ga...@hortonworks.com>
> > > wrote:
> > >
> > > > Answers to some of your questions inlined.
> > > >
> > > > Alan.
> > > >
> > > > On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote:
> > > >
> > > > > There are some points I want to bring up. First, I am on the PMC.
> > Here
> > > is
> > > > > something I find relevant:
> > > > >
> > > > > http://www.apache.org/foundation/how-it-works.html
> > > > >
> > > > > ------------------------------
> > > > >
> > > > > The role of the PMC from a Foundation perspective is oversight. The
> > > main
> > > > > role of the PMC is not code and not coding - but to ensure that all
> > > legal
> > > > > issues are addressed, that procedure is followed, and that each and
> > > every
> > > > > release is the product of the community as a whole. That is key to
> > our
> > > > > litigation protection mechanisms.
> > > > >
> > > > > Secondly the role of the PMC is to further the long term
> development
> > > and
> > > > > health of the community as a whole, and to ensure that balanced and
> > > wide
> > > > > scale peer review and collaboration does happen. Within the ASF we
> > > worry
> > > > > about any community which centers around a few individuals who are
> > > > working
> > > > > virtually uncontested. We believe that this is detrimental to
> > quality,
> > > > > stability, and robustness of both code and long term social
> > structures.
> > > > >
> > > > > --------------------------------
> > > > >
> > > > >
> > > >
> > >
> >
> https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different
> > > > >
> > > > > -------------------------------------
> > > > >
> > > > > All other decisions happen on the dev list, discussions on the
> > private
> > > > list
> > > > > are kept to a minimum.
> > > > >
> > > > > "If it didn't happen on the dev list, it didn't happen" - which
> leads
> > > to:
> > > > >
> > > > > a) Elections of committers and PMC members are published on the dev
> > > list
> > > > > once finalized.
> > > > >
> > > > > b) Out-of-band discussions (IRC etc.) are summarized on the dev
> list
> > as
> > > > > soon as they have impact on the project, code or community.
> > > > > ---------------------------------
> > > > >
> > > > > https://issues.apache.org/jira/browse/HIVE-4660 ironically titled
> > "Let
> > > > > their be Tez" has not be +1 ed by any committer. It was never
> > discussed
> > > > on
> > > > > the dev or the user list (as far as I can tell).
> > > >
> > > > As all JIRA creations and updates are sent to dev@hive, creating a
> > JIRA
> > > > is de facto posting to the list.
> > > >
> > > > >
> > > > > As a PMC member I feel we need more discussion on Tez on the dev
> list
> > > > along
> > > > > with a wiki-fied design document. Topics of discussion should
> > include:
> > > >
> > > > I talked with Gunther and he's working on posting a design doc on the
> > > > wiki.  He has a PDF on the JIRA but he doesn't have write permissions
> > yet
> > > > on the wiki.
> > > >
> > > > >
> > > > > 1) What is tez?
> > > > In Hadoop 2.0, YARN opens up the ability to have multiple execution
> > > > frameworks in Hadoop.  Hadoop apps are no longer tied to MapReduce as
> > the
> > > > only execution option.  Tez is an effort to build an execution engine
> > > that
> > > > is optimized for relational data processing, such as Hive and Pig.
> > > >
> > > > The biggest change here is to move away from only Map and Reduce as
> > > > processing options and to allow alternate combinations of processing,
> > > such
> > > > as map -> reduce -> reduce or tasks that take multiple inputs or
> > shuffles
> > > > that avoid sorting when it isn't needed.
> > > >
> > > > For a good intro to Tez, see Arun's presentation on it at the recent
> > > > Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8slides
> > > >
> http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212
> > )
> > > > >
> > > > > 2) How is tez different from oozie, http://code.google.com/p/hop/,
> > > > > http://cs.brown.edu/~backman/cmr.html , and other DAG and or
> > streaming
> > > > map
> > > > > reduce tools/frameworks? Why should we use this and not those?
> > > >
> > > > Oozie is a completely different thing.  Oozie is a workflow engine
> and
> > a
> > > > scheduler.  It's core competencies are the ability to coordinate
> > > workflows
> > > > of disparate job types (MR, Pig, Hive, etc.) and to schedule them.
>  It
> > is
> > > > not intended as an execution engine for apps such as Pig and Hive.
> > > >
> > > > I am not familiar with these other engines, but the short answer is
> > that
> > > > Tez is built to work on YARN, which works well for Hive since it is
> > tied
> > > to
> > > > Hadoop.
> > > > >
> > > > > 3) When can we expect the first tez release?
> > > > I don't know, but I hope sometime this fall.
> > > >
> > > > >
> > > > > 4) How much effort is involved in integrating hive and tez?
> > > > Covered in the design doc.
> > > >
> > > > >
> > > > > 5) Who is ready to commit to this effort?
> > > > I'll let people speak for themselves on that one.
> > > >
> > > > >
> > > > > 6) can we expect this work to be done in one hive release?
> > > > Unlikely.  Initial integration will be done in one release, but as
> Tez
> > is
> > > > a new project I expect it will be adding features in the future that
> > Hive
> > > > will want to take advantage of.
> > > >
> > > > >
> > > > > In my opinion we should not start any work on this tez-hive until
> > these
> > > > > questions are answered to the satisfaction of the hive developers.
> > > >
> > > > Can we change this to "not commit patches"?  We can't tell willing
> > people
> > > > not to work on it.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo <
> > > edlinuxg...@gmail.com
> > > > >wrote:
> > > > >
> > > > >>
> > > > >>>> The Hive bylaws,
> > > > >> https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out
> > > what
> > > > >> votes are needed for what.  I don't see anything there about
> > needing 3
> > > > +1s
> > > > >> for a branch.  Branching >>would seem to fall under code change,
> > which
> > > > >> requires one vote and a minimum length of 1 day.
> > > > >>
> > > > >> You could argue that all you need is one +1 to create a branch,
> but
> > > this
> > > > >> is more then a branch. If you are talking about something that is:
> > > > >> 1) going to cause major re-factoring of critical pieces of hive
> like
> > > > >> ExecDriver and MapRedTask
> > > > >> 2) going to be very disruptive to the efforts of other committers
> > > > >> 3) something that may be a major architectural change
> > > > >>
> > > > >> Getting the project on board with the idea is a good idea.
> > > > >>
> > > > >> Now I want to point something out. Here are some recent
> initiatives
> > in
> > > > >> hive:
> > > > >>
> > > > >> 1) At one point there was a big initiative to "support oracle"
> after
> > > the
> > > > >> initial work, there are patches in Jira no one seems to care about
> > > > oracle
> > > > >> support.
> > > > >> 2) Another such decisions was this "support windows" one, there
> are
> > > > >> probably 4 windows patches waiting reviews.
> > > > >> 3) I still have no clue what the official hadoop1 hadoop2, hadoop
> > 0.23
> > > > >> support prospective is, but every couple weeks we get another jira
> > > about
> > > > >> something not working/testing on one of those versions, seems like
> > > > several
> > > > >> builds are broken.
> > > > >> 4) Hive-storage handler, after the initial implementation no one
> > cares
> > > > to
> > > > >> review any other storage handler implementation, 3 patches there
> or
> > > > more,
> > > > >> could not even find anyone willing to review the cassandra storage
> > > > handler
> > > > >> I spent months on.
> > > > >> 5) OCR, Vectorization
> > > > >> 6) Windowing: committed, numerous check-style violations.
> > > > >>
> > > > >> We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active
> > committers.
> > > > We
> > > > >> are spread very thin, and embarking on another side project not
> > > involved
> > > > >> with core hive seems like the wrong direction at the moment.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates <
> ga...@hortonworks.com>
> > > > wrote:
> > > > >>
> > > > >>>
> > > > >>> On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote:
> > > > >>>
> > > > >>>> I have started to see several re factoring patches around tez.
> > > > >>>> https://issues.apache.org/jira/browse/HIVE-4843
> > > > >>>>
> > > > >>>> This is the only mention on the hive list I can find with tez:
> > > > >>>> "Makes sense. I will create the branch soon.
> > > > >>>>
> > > > >>>> Thanks,
> > > > >>>> Ashutosh
> > > > >>>>
> > > > >>>>
> > > > >>>> On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner <
> > > > >>>> ghagleit...@hortonworks.com> wrote:
> > > > >>>>
> > > > >>>>> Hi,
> > > > >>>>>
> > > > >>>>> I am starting to work on integrating Tez into Hive (see
> > HIVE-4660,
> > > > >>> design
> > > > >>>>> doc has already been uploaded - any feedback will be much
> > > > appreciated).
> > > > >>>>> This will be a fair amount of work that will take time to
> > > > >>> stabilize/test.
> > > > >>>>> I'd like to propose creating a branch in order to be able to do
> > > this
> > > > >>>>> incrementally and collaboratively. In order to progress rapidly
> > > with
> > > > >>> this,
> > > > >>>>> I would also like to go "commit-then-review".
> > > > >>>>>
> > > > >>>>> Thanks,
> > > > >>>>> Gunther.
> > > > >>>>> "
> > > > >>>>
> > > > >>>> These refactor-ings are largely destructive to a number of bugs
> > and
> > > > >>>> language improvements in hive.The language improvements and bug
> > > fixes
> > > > >>> that
> > > > >>>> have been sitting in Jira for quite some time now marked
> > > > patch-available
> > > > >>>> and are waiting for review.
> > > > >>>>
> > > > >>>> There are a few things I want to point out:
> > > > >>>> 1) Normally we create design docs in out wiki (which it is not)
> > > > >>>> 2) Normally when the change is significantly complex we get
> > multiple
> > > > >>>> committers to comment on it (which we did not)
> > > > >>>> On point 2 no one -1  the branch, but this is really something
> > that
> > > > >>> should
> > > > >>>> have required a +1 from 3 committers.
> > > > >>>
> > > > >>> The Hive bylaws,
> > > > https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out
> what
> > > > votes are needed for what.  I don't see anything there about
> > > > >>> needing 3 +1s for a branch.  Branching would seem to fall under
> > code
> > > > >>> change, which requires one vote and a minimum length of 1 day.
> > > > >>>
> > > > >>>>
> > > > >>>> I for one am not completely sold on Tez.
> > > > >>>> http://incubator.apache.org/projects/tez.html.
> > > > >>>> "directed-acyclic-graph of tasks for processing data" this
> > > description
> > > > >>>> sounds like many things which have never become popular. One to
> > > think
> > > > >>> of is
> > > > >>>> oozie "Oozie Workflow jobs are Directed Acyclical Graphs (DAGs)
> of
> > > > >>>> actions.". I am sure I can find a number of libraries/frameworks
> > > that
> > > > >>> make
> > > > >>>> this same claim. In general I do not feel like we have done our
> > > > homework
> > > > >>>> and pre-requisites to justify all this work. If we have done the
> > > > >>> homework,
> > > > >>>> I am sure that it has not been communicated and accepted by hive
> > > > >>> developers
> > > > >>>> at large.
> > > > >>>
> > > > >>> A request for better documentation on Tez and a project road map
> > > seems
> > > > >>> totally reasonable.
> > > > >>>
> > > > >>>>
> > > > >>>> If we have a branch, why are we also committing on trunk?
> Scanning
> > > > >>> through
> > > > >>>> the tez doc the only language I keep finding language like
> > "minimal
> > > > >>> changes
> > > > >>>> to the planner" yet, there is ALREADY lots of large changes
> going
> > > on!
> > > > >>>>
> > > > >>>> Really none of the above would bother me accept for the fact
> that
> > > > these
> > > > >>>> "minimal changes" are causing many "patch available"
> > > ready-for-review
> > > > >>> bugs
> > > > >>>> and core hive features to need to be re based.
> > > > >>>>
> > > > >>>> I am sure I have mentioned this before, but I have to spend 12+
> > > hours
> > > > to
> > > > >>>> test a single patch on my laptop. A few days ago I was testing a
> > new
> > > > >>> core
> > > > >>>> hive feature. After all the tests passed and before I was able
> to
> > > > >>> commit,
> > > > >>>> someone unleashed a tez patch on trunk which caused the thing I
> > was
> > > > >>> testing
> > > > >>>> for 12 hours to need to be rebased.
> > > > >>>>
> > > > >>>>
> > > > >>>> I'm not cool with this.Next time that happens to me I will
> > seriously
> > > > >>>> consider reverting the patch. Bug fixes and new hive features
> are
> > > more
> > > > >>>> important to me then integrating with incubator projects.
> > > > >>>
> > > > >>> (With my Apache member hat on)  Reverting patches that aren't
> > > breaking
> > > > >>> the build is considered very bad form in Apache.  It does make
> > sense
> > > to
> > > > >>> request that when people are going to commit a patch that will
> > break
> > > > many
> > > > >>> other patches they first give a few hours of notice so people can
> > say
> > > > >>> something if they're about to commit another patch and avoid your
> > > fate
> > > > of
> > > > >>> needing to rerun the tests.  The other thing is we need to get
> get
> > > the
> > > > >>> automated build of patches working on Hive so committers are
> forced
> > > to
> > > > run
> > > > >>> all of the tests themselves.  We are working on it, but we're not
> > > > there yet.
> > > > >>>
> > > > >>> Alan.
> > > > >>>
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: Tez branch and tez based patches

Reply via email to