Also watched http://www.ustream.tv/recorded/36323173
I definitely see the win in being able to stream inter-stage output. I see some cases where small intermediate results can be kept "In memory". But I was somewhat under the impression that the map reduce spill settings kept stuff in memory, isn't that what spill settings are? There is a few bullet points that came up repeatedly that I do not follow: Something was said to the effect of "Container reuse makes X faster". Hadoop has jvm reuse. Not following what the difference is here? Not everyone has a 10K node cluster. "Joins in map reduce are hard" Really? I mean some of them are I guess, but the typical join is very easy. Just shuffle by the join key. There was not really enough low level details here saying why joins are better in tez. "Chosing the number of maps and reduces is hard" Really? I do not find it that hard, I think there are times when it's not perfect but I do not find it hard. The talk did not really offer anything here technical on how tez makes this better other then it could make it better. The presentations mentioned streaming data, how do two nodes stream data between a tasks and how it it reliable? If the sender or receiver dies does the entire process have to start again? Again one of the talks implied there is a prototype out there that launches hive jobs into tez. I would like to see that, it might answer more questions then a power point, and I could profile some common queries. Random late night thoughts over, Ed On Tue, Jul 30, 2013 at 12:02 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > At ~25:00 > > "There is a working prototype of hive which is using tez as the targeted > runtime" > > Can I get a look at that code? Is it on github? > > Edward > > > On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates <ga...@hortonworks.com> wrote: > >> Answers to some of your questions inlined. >> >> Alan. >> >> On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote: >> >> > There are some points I want to bring up. First, I am on the PMC. Here >> is >> > something I find relevant: >> > >> > http://www.apache.org/foundation/how-it-works.html >> > >> > ------------------------------ >> > >> > The role of the PMC from a Foundation perspective is oversight. The main >> > role of the PMC is not code and not coding - but to ensure that all >> legal >> > issues are addressed, that procedure is followed, and that each and >> every >> > release is the product of the community as a whole. That is key to our >> > litigation protection mechanisms. >> > >> > Secondly the role of the PMC is to further the long term development and >> > health of the community as a whole, and to ensure that balanced and wide >> > scale peer review and collaboration does happen. Within the ASF we worry >> > about any community which centers around a few individuals who are >> working >> > virtually uncontested. We believe that this is detrimental to quality, >> > stability, and robustness of both code and long term social structures. >> > >> > -------------------------------- >> > >> > >> https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different >> > >> > ------------------------------------- >> > >> > All other decisions happen on the dev list, discussions on the private >> list >> > are kept to a minimum. >> > >> > "If it didn't happen on the dev list, it didn't happen" - which leads >> to: >> > >> > a) Elections of committers and PMC members are published on the dev list >> > once finalized. >> > >> > b) Out-of-band discussions (IRC etc.) are summarized on the dev list as >> > soon as they have impact on the project, code or community. >> > --------------------------------- >> > >> > https://issues.apache.org/jira/browse/HIVE-4660 ironically titled "Let >> > their be Tez" has not be +1 ed by any committer. It was never discussed >> on >> > the dev or the user list (as far as I can tell). >> >> As all JIRA creations and updates are sent to dev@hive, creating a JIRA >> is de facto posting to the list. >> >> > >> > As a PMC member I feel we need more discussion on Tez on the dev list >> along >> > with a wiki-fied design document. Topics of discussion should include: >> >> I talked with Gunther and he's working on posting a design doc on the >> wiki. He has a PDF on the JIRA but he doesn't have write permissions yet >> on the wiki. >> >> > >> > 1) What is tez? >> In Hadoop 2.0, YARN opens up the ability to have multiple execution >> frameworks in Hadoop. Hadoop apps are no longer tied to MapReduce as the >> only execution option. Tez is an effort to build an execution engine that >> is optimized for relational data processing, such as Hive and Pig. >> >> The biggest change here is to move away from only Map and Reduce as >> processing options and to allow alternate combinations of processing, such >> as map -> reduce -> reduce or tasks that take multiple inputs or shuffles >> that avoid sorting when it isn't needed. >> >> For a good intro to Tez, see Arun's presentation on it at the recent >> Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides >> http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212) >> > >> > 2) How is tez different from oozie, http://code.google.com/p/hop/, >> > http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming >> map >> > reduce tools/frameworks? Why should we use this and not those? >> >> Oozie is a completely different thing. Oozie is a workflow engine and a >> scheduler. It's core competencies are the ability to coordinate workflows >> of disparate job types (MR, Pig, Hive, etc.) and to schedule them. It is >> not intended as an execution engine for apps such as Pig and Hive. >> >> I am not familiar with these other engines, but the short answer is that >> Tez is built to work on YARN, which works well for Hive since it is tied to >> Hadoop. >> > >> > 3) When can we expect the first tez release? >> I don't know, but I hope sometime this fall. >> >> > >> > 4) How much effort is involved in integrating hive and tez? >> Covered in the design doc. >> >> > >> > 5) Who is ready to commit to this effort? >> I'll let people speak for themselves on that one. >> >> > >> > 6) can we expect this work to be done in one hive release? >> Unlikely. Initial integration will be done in one release, but as Tez is >> a new project I expect it will be adding features in the future that Hive >> will want to take advantage of. >> >> > >> > In my opinion we should not start any work on this tez-hive until these >> > questions are answered to the satisfaction of the hive developers. >> >> Can we change this to "not commit patches"? We can't tell willing people >> not to work on it. >> > >> > >> > >> > >> > >> > >> > >> > >> > On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo <edlinuxg...@gmail.com >> >wrote: >> > >> >> >> >>>> The Hive bylaws, >> >> https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out what >> >> votes are needed for what. I don't see anything there about needing 3 >> +1s >> >> for a branch. Branching >>would seem to fall under code change, which >> >> requires one vote and a minimum length of 1 day. >> >> >> >> You could argue that all you need is one +1 to create a branch, but >> this >> >> is more then a branch. If you are talking about something that is: >> >> 1) going to cause major re-factoring of critical pieces of hive like >> >> ExecDriver and MapRedTask >> >> 2) going to be very disruptive to the efforts of other committers >> >> 3) something that may be a major architectural change >> >> >> >> Getting the project on board with the idea is a good idea. >> >> >> >> Now I want to point something out. Here are some recent initiatives in >> >> hive: >> >> >> >> 1) At one point there was a big initiative to "support oracle" after >> the >> >> initial work, there are patches in Jira no one seems to care about >> oracle >> >> support. >> >> 2) Another such decisions was this "support windows" one, there are >> >> probably 4 windows patches waiting reviews. >> >> 3) I still have no clue what the official hadoop1 hadoop2, hadoop 0.23 >> >> support prospective is, but every couple weeks we get another jira >> about >> >> something not working/testing on one of those versions, seems like >> several >> >> builds are broken. >> >> 4) Hive-storage handler, after the initial implementation no one cares >> to >> >> review any other storage handler implementation, 3 patches there or >> more, >> >> could not even find anyone willing to review the cassandra storage >> handler >> >> I spent months on. >> >> 5) OCR, Vectorization >> >> 6) Windowing: committed, numerous check-style violations. >> >> >> >> We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active committers. >> We >> >> are spread very thin, and embarking on another side project not >> involved >> >> with core hive seems like the wrong direction at the moment. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates <ga...@hortonworks.com> >> wrote: >> >> >> >>> >> >>> On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote: >> >>> >> >>>> I have started to see several re factoring patches around tez. >> >>>> https://issues.apache.org/jira/browse/HIVE-4843 >> >>>> >> >>>> This is the only mention on the hive list I can find with tez: >> >>>> "Makes sense. I will create the branch soon. >> >>>> >> >>>> Thanks, >> >>>> Ashutosh >> >>>> >> >>>> >> >>>> On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner < >> >>>> ghagleit...@hortonworks.com> wrote: >> >>>> >> >>>>> Hi, >> >>>>> >> >>>>> I am starting to work on integrating Tez into Hive (see HIVE-4660, >> >>> design >> >>>>> doc has already been uploaded - any feedback will be much >> appreciated). >> >>>>> This will be a fair amount of work that will take time to >> >>> stabilize/test. >> >>>>> I'd like to propose creating a branch in order to be able to do this >> >>>>> incrementally and collaboratively. In order to progress rapidly with >> >>> this, >> >>>>> I would also like to go "commit-then-review". >> >>>>> >> >>>>> Thanks, >> >>>>> Gunther. >> >>>>> " >> >>>> >> >>>> These refactor-ings are largely destructive to a number of bugs and >> >>>> language improvements in hive.The language improvements and bug fixes >> >>> that >> >>>> have been sitting in Jira for quite some time now marked >> patch-available >> >>>> and are waiting for review. >> >>>> >> >>>> There are a few things I want to point out: >> >>>> 1) Normally we create design docs in out wiki (which it is not) >> >>>> 2) Normally when the change is significantly complex we get multiple >> >>>> committers to comment on it (which we did not) >> >>>> On point 2 no one -1 the branch, but this is really something that >> >>> should >> >>>> have required a +1 from 3 committers. >> >>> >> >>> The Hive bylaws, >> https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what >> votes are needed for what. I don't see anything there about >> >>> needing 3 +1s for a branch. Branching would seem to fall under code >> >>> change, which requires one vote and a minimum length of 1 day. >> >>> >> >>>> >> >>>> I for one am not completely sold on Tez. >> >>>> http://incubator.apache.org/projects/tez.html. >> >>>> "directed-acyclic-graph of tasks for processing data" this >> description >> >>>> sounds like many things which have never become popular. One to think >> >>> of is >> >>>> oozie "Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of >> >>>> actions.". I am sure I can find a number of libraries/frameworks that >> >>> make >> >>>> this same claim. In general I do not feel like we have done our >> homework >> >>>> and pre-requisites to justify all this work. If we have done the >> >>> homework, >> >>>> I am sure that it has not been communicated and accepted by hive >> >>> developers >> >>>> at large. >> >>> >> >>> A request for better documentation on Tez and a project road map seems >> >>> totally reasonable. >> >>> >> >>>> >> >>>> If we have a branch, why are we also committing on trunk? Scanning >> >>> through >> >>>> the tez doc the only language I keep finding language like "minimal >> >>> changes >> >>>> to the planner" yet, there is ALREADY lots of large changes going on! >> >>>> >> >>>> Really none of the above would bother me accept for the fact that >> these >> >>>> "minimal changes" are causing many "patch available" ready-for-review >> >>> bugs >> >>>> and core hive features to need to be re based. >> >>>> >> >>>> I am sure I have mentioned this before, but I have to spend 12+ >> hours to >> >>>> test a single patch on my laptop. A few days ago I was testing a new >> >>> core >> >>>> hive feature. After all the tests passed and before I was able to >> >>> commit, >> >>>> someone unleashed a tez patch on trunk which caused the thing I was >> >>> testing >> >>>> for 12 hours to need to be rebased. >> >>>> >> >>>> >> >>>> I'm not cool with this.Next time that happens to me I will seriously >> >>>> consider reverting the patch. Bug fixes and new hive features are >> more >> >>>> important to me then integrating with incubator projects. >> >>> >> >>> (With my Apache member hat on) Reverting patches that aren't breaking >> >>> the build is considered very bad form in Apache. It does make sense >> to >> >>> request that when people are going to commit a patch that will break >> many >> >>> other patches they first give a few hours of notice so people can say >> >>> something if they're about to commit another patch and avoid your >> fate of >> >>> needing to rerun the tests. The other thing is we need to get get the >> >>> automated build of patches working on Hive so committers are forced >> to run >> >>> all of the tests themselves. We are working on it, but we're not >> there yet. >> >>> >> >>> Alan. >> >>> >> >>> >> >> >> >> >