thanks for the explanation and good suggestions. Regards, Sarfraz Rasheed Ramay (DIT) Dublin, Ireland.
On Sat, May 3, 2014 at 6:56 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > > These days pig and hive are designed to play together more, it is not a VS > thing. Benchmarks I barely read them any more. They are typically done by a > few types of entities: > 1) vendors that clearly have something to gain by presenting one side of > the issue > 2) people not familiar with the intricacies of either project and do not > usually effectively figure out how to use one or both of the projects > efficiently > 3) In depth analysis that prove temporal results (pig is faster then hive > at X) and with a different data set the opposite is true or after 3 months > both codes bases have changed significantly and the analysis would need to > be redone (but others continue using the result for years as if it were > some permanent fact) > > I would think an approach like this is more interesting. > > The design: Hive is SQL-like, a declarative language. Pig while still > being declarative is more imperative. User has to deal with flow. For > example: is where clause/ filter done before the group or after? What are > the benefits of one vs the other? If the same transformation like group, > count with where clause is a 8 line pig script vs a 1 line hive query are > their cases where that is better and worse? > Third party: > How does the system support plugins, ie, is there support to get data from > mongo or who knows access/ excel? What about user functions to trim data or > reshape xml? etc. What are the pluggable points of both systems? > > > > > On Sat, May 3, 2014 at 1:12 PM, Sarfraz Ramay <sarfraz.ra...@gmail.com>wrote: > >> Thanks for the suggestion. Can you please explain a little on "focusing >> on the design, the implementation with third party tools", do you mean >> comparing them ? And by script you mean scripts of UDFs, SerDes and Loaders >> ? >> >> >> >> >> Regards, >> Sarfraz Rasheed Ramay (DIT) >> Dublin, Ireland. >> >> >> On Sat, May 3, 2014 at 4:23 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote: >> >>> IMHP Comparing the "performance" is boring and has been done umpteen >>> times before. The world won't get much out of another performance >>> benchmark, other then a bunch of fan boys saying "Look ours is faster >>> hahahahah" and then the other side says "but in this case ours is faster >>> and that is the more important case" Benchmarks are easy to bias and >>> manipulate, and comparing two like but not exact systems is hard. For >>> example you will see impala "winning" benchmarks HPC by re-writing queries, >>> and then someone in tez re-writes it another way tunes a setting and then >>> they are "winning" the benchmark. >>> >>> You would be better off focusing on the design, the implementation with >>> third party tools (udfs, serdes, loaders) , the nuances of a more >>> procedural language then a declarative. Look in the world for scripts and >>> see who is deploying them effectively. >>> >>> >>> >>> >>> >>> On Sat, May 3, 2014 at 4:46 AM, Sarfraz Ramay >>> <sarfraz.ra...@gmail.com>wrote: >>> >>>> Thanks Thejas for your input! These are interesting and very specific >>>> which is exactly what is required for a masters thesis. >>>> >>>> Are there any publications on Hive and the evaluation of its >>>> performance that i can use to compare ? >>>> >>>> Regards, >>>> Sarfraz Rasheed Ramay (DIT) >>>> Dublin, Ireland. >>>> >>>> >>>> On Sat, May 3, 2014 at 3:07 AM, Thejas Nair <the...@hortonworks.com>wrote: >>>> >>>>> The primary difference between hive and pig is the language. There are >>>>> implementation differences that will result in performance >>>>> differences, but it will be hard to figure out what aspect of >>>>> implementation responsible for what improvement. >>>>> >>>>> I think a more interesting project would be to compare the impact of >>>>> various performance improvements in hive. There are many features that >>>>> you can turn on and off. >>>>> >>>>> example - >>>>> - hive vectorization >>>>> - file format - text vs RCFile vs ORC >>>>> - compressed vs uncompressed >>>>> - mapreduce vs tez execution engine >>>>> - stats optimized queries >>>>> >>>>> >>>>> >>>>> On Thu, May 1, 2014 at 5:47 AM, Sarfraz Ramay <sarfraz.ra...@gmail.com> >>>>> wrote: >>>>> >> >>>>> >> Hi, >>>>> >> >>>>> >> It seems that both Hive and Pig are used for managing large data >>>>> sets. >>>>> >> Hive is more SQL oriented whereas Pig is more for the data flows. I >>>>> am doing >>>>> >> a master's thesis on the performance evaluation of both. Can some >>>>> please >>>>> >> provide a list of tasks that would make for an interesting >>>>> comparison ? >>>>> >> >>>>> >> >>>>> >> What is Hive good at ? >>>>> >> >>>>> >> What is Pig good at ? >>>>> >> >>>>> >> Ideally, i would like to take what Hive is good at and test it in >>>>> Pig and >>>>> >> vice versa. The competitive characteristics would make for an >>>>> interesting >>>>> >> comparison. >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> Regards, >>>>> >> Sarfraz Rasheed Ramay (DIT) >>>>> >> Dublin, Ireland. >>>>> > >>>>> > >>>>> >>>>> -- >>>>> CONFIDENTIALITY NOTICE >>>>> NOTICE: This message is intended for the use of the individual or >>>>> entity to >>>>> which it is addressed and may contain information that is confidential, >>>>> privileged and exempt from disclosure under applicable law. If the >>>>> reader >>>>> of this message is not the intended recipient, you are hereby notified >>>>> that >>>>> any printing, copying, dissemination, distribution, disclosure or >>>>> forwarding of this communication is strictly prohibited. If you have >>>>> received this communication in error, please contact the sender >>>>> immediately >>>>> and delete it from your system. Thank You. >>>>> >>>> >>>> >>> >> >