Re: Hive Vs Pig: Master's thesis

Edward Capriolo Sat, 03 May 2014 10:57:25 -0700

These days pig and hive are designed to play together more, it is not a VS
thing. Benchmarks I barely read them any more. They are typically done by a
few types of entities:
1) vendors that clearly have something to gain by presenting one side of
the issue
2) people not familiar with the intricacies of either project and do not
usually effectively figure out how to use one or both of the projects
efficiently
3) In depth analysis that prove temporal results  (pig is faster then hive
at X) and with a different data set the opposite is true or after 3 months
both codes bases have changed significantly and the analysis would need to
be redone (but others continue using the result for years as if it were
some permanent fact)


I would think an approach like this is more interesting.

The design: Hive is SQL-like, a declarative language. Pig while still being
declarative is more imperative. User has to deal with flow. For example: is
where clause/ filter done before the group or after? What are the benefits
of one vs the other? If the same transformation like group, count with
where clause is a 8 line pig script vs a 1 line hive query are their cases
where that is better and worse?
Third party:
How does the system support plugins, ie, is there support to get data from
mongo or who knows access/ excel? What about user functions to trim data or
reshape xml? etc. What are the pluggable points of both systems?




On Sat, May 3, 2014 at 1:12 PM, Sarfraz Ramay <sarfraz.ra...@gmail.com>wrote:

> Thanks for the suggestion. Can you please explain a little on "focusing on
> the design, the implementation with third party tools", do you mean
> comparing them ? And by script you mean scripts of UDFs, SerDes and Loaders
> ?
>
>
>
>
> Regards,
> Sarfraz Rasheed Ramay (DIT)
> Dublin, Ireland.
>
>
> On Sat, May 3, 2014 at 4:23 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:
>
>> IMHP Comparing the "performance" is boring and has been done umpteen
>> times before. The world won't get much out of another performance
>> benchmark, other then a bunch of fan boys saying "Look ours is faster
>> hahahahah" and then the other side says "but in this case ours is faster
>> and that is the more important case" Benchmarks are easy to bias and
>> manipulate, and comparing two like but not exact systems is hard. For
>> example you will see impala "winning" benchmarks HPC by re-writing queries,
>> and then someone in tez re-writes it another way tunes a setting and then
>> they are "winning" the benchmark.
>>
>> You would be better off focusing on the design, the implementation with
>> third party tools (udfs, serdes, loaders) , the nuances of a more
>> procedural language then a declarative. Look in the world for scripts and
>> see who is deploying them effectively.
>>
>>
>>
>>
>>
>> On Sat, May 3, 2014 at 4:46 AM, Sarfraz Ramay <sarfraz.ra...@gmail.com>wrote:
>>
>>> Thanks Thejas for your input! These are interesting and very specific
>>> which is exactly what is required for a masters thesis.
>>>
>>> Are there any publications on Hive and the evaluation of its performance
>>> that i can use to compare ?
>>>
>>> Regards,
>>> Sarfraz Rasheed Ramay (DIT)
>>> Dublin, Ireland.
>>>
>>>
>>> On Sat, May 3, 2014 at 3:07 AM, Thejas Nair <the...@hortonworks.com>wrote:
>>>
>>>> The primary difference between hive and pig is the language. There are
>>>> implementation differences that will result in performance
>>>> differences, but it will be hard to figure out what aspect of
>>>> implementation responsible for what improvement.
>>>>
>>>> I think a more interesting project would be to compare the impact of
>>>> various performance improvements in hive. There are many features that
>>>> you can turn on and off.
>>>>
>>>> example -
>>>> - hive vectorization
>>>> - file format - text vs RCFile vs ORC
>>>> - compressed vs uncompressed
>>>> - mapreduce vs tez execution engine
>>>> - stats optimized queries
>>>>
>>>>
>>>>
>>>> On Thu, May 1, 2014 at 5:47 AM, Sarfraz Ramay <sarfraz.ra...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> It seems that both Hive and Pig are used for managing large data
>>>> sets.
>>>> >> Hive is more SQL oriented whereas Pig is more for the data flows. I
>>>> am doing
>>>> >> a master's thesis on the performance evaluation of both. Can some
>>>> please
>>>> >> provide a list of tasks that would make for an interesting
>>>> comparison ?
>>>> >>
>>>> >>
>>>> >> What is Hive good at ?
>>>> >>
>>>> >> What is Pig good at ?
>>>> >>
>>>> >> Ideally, i would like to take what Hive is good at and test it in
>>>> Pig and
>>>> >> vice versa. The competitive characteristics  would make for an
>>>> interesting
>>>> >> comparison.
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> Regards,
>>>> >> Sarfraz Rasheed Ramay (DIT)
>>>> >> Dublin, Ireland.
>>>> >
>>>> >
>>>>
>>>> --
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or
>>>> entity to
>>>> which it is addressed and may contain information that is confidential,
>>>> privileged and exempt from disclosure under applicable law. If the
>>>> reader
>>>> of this message is not the intended recipient, you are hereby notified
>>>> that
>>>> any printing, copying, dissemination, distribution, disclosure or
>>>> forwarding of this communication is strictly prohibited. If you have
>>>> received this communication in error, please contact the sender
>>>> immediately
>>>> and delete it from your system. Thank You.
>>>>
>>>
>>>
>>
>

Re: Hive Vs Pig: Master's thesis

Reply via email to