Re: Hive Vs Pig: Master's thesis

Sarfraz Ramay Sat, 03 May 2014 11:15:07 -0700

thanks for the explanation and good suggestions.

Regards,
Sarfraz Rasheed Ramay (DIT)
Dublin, Ireland.



On Sat, May 3, 2014 at 6:56 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

>
> These days pig and hive are designed to play together more, it is not a VS
> thing. Benchmarks I barely read them any more. They are typically done by a
> few types of entities:
> 1) vendors that clearly have something to gain by presenting one side of
> the issue
> 2) people not familiar with the intricacies of either project and do not
> usually effectively figure out how to use one or both of the projects
> efficiently
> 3) In depth analysis that prove temporal results  (pig is faster then hive
> at X) and with a different data set the opposite is true or after 3 months
> both codes bases have changed significantly and the analysis would need to
> be redone (but others continue using the result for years as if it were
> some permanent fact)
>
> I would think an approach like this is more interesting.
>
> The design: Hive is SQL-like, a declarative language. Pig while still
> being declarative is more imperative. User has to deal with flow. For
> example: is where clause/ filter done before the group or after? What are
> the benefits of one vs the other? If the same transformation like group,
> count with where clause is a 8 line pig script vs a 1 line hive query are
> their cases where that is better and worse?
> Third party:
> How does the system support plugins, ie, is there support to get data from
> mongo or who knows access/ excel? What about user functions to trim data or
> reshape xml? etc. What are the pluggable points of both systems?
>
>
>
>
> On Sat, May 3, 2014 at 1:12 PM, Sarfraz Ramay <sarfraz.ra...@gmail.com>wrote:
>
>> Thanks for the suggestion. Can you please explain a little on "focusing
>> on the design, the implementation with third party tools", do you mean
>> comparing them ? And by script you mean scripts of UDFs, SerDes and Loaders
>> ?
>>
>>
>>
>>
>> Regards,
>> Sarfraz Rasheed Ramay (DIT)
>> Dublin, Ireland.
>>
>>
>> On Sat, May 3, 2014 at 4:23 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:
>>
>>> IMHP Comparing the "performance" is boring and has been done umpteen
>>> times before. The world won't get much out of another performance
>>> benchmark, other then a bunch of fan boys saying "Look ours is faster
>>> hahahahah" and then the other side says "but in this case ours is faster
>>> and that is the more important case" Benchmarks are easy to bias and
>>> manipulate, and comparing two like but not exact systems is hard. For
>>> example you will see impala "winning" benchmarks HPC by re-writing queries,
>>> and then someone in tez re-writes it another way tunes a setting and then
>>> they are "winning" the benchmark.
>>>
>>> You would be better off focusing on the design, the implementation with
>>> third party tools (udfs, serdes, loaders) , the nuances of a more
>>> procedural language then a declarative. Look in the world for scripts and
>>> see who is deploying them effectively.
>>>
>>>
>>>
>>>
>>>
>>> On Sat, May 3, 2014 at 4:46 AM, Sarfraz Ramay 
>>> <sarfraz.ra...@gmail.com>wrote:
>>>
>>>> Thanks Thejas for your input! These are interesting and very specific
>>>> which is exactly what is required for a masters thesis.
>>>>
>>>> Are there any publications on Hive and the evaluation of its
>>>> performance that i can use to compare ?
>>>>
>>>> Regards,
>>>> Sarfraz Rasheed Ramay (DIT)
>>>> Dublin, Ireland.
>>>>
>>>>
>>>> On Sat, May 3, 2014 at 3:07 AM, Thejas Nair <the...@hortonworks.com>wrote:
>>>>
>>>>> The primary difference between hive and pig is the language. There are
>>>>> implementation differences that will result in performance
>>>>> differences, but it will be hard to figure out what aspect of
>>>>> implementation responsible for what improvement.
>>>>>
>>>>> I think a more interesting project would be to compare the impact of
>>>>> various performance improvements in hive. There are many features that
>>>>> you can turn on and off.
>>>>>
>>>>> example -
>>>>> - hive vectorization
>>>>> - file format - text vs RCFile vs ORC
>>>>> - compressed vs uncompressed
>>>>> - mapreduce vs tez execution engine
>>>>> - stats optimized queries
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 1, 2014 at 5:47 AM, Sarfraz Ramay <sarfraz.ra...@gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >> Hi,
>>>>> >>
>>>>> >> It seems that both Hive and Pig are used for managing large data
>>>>> sets.
>>>>> >> Hive is more SQL oriented whereas Pig is more for the data flows. I
>>>>> am doing
>>>>> >> a master's thesis on the performance evaluation of both. Can some
>>>>> please
>>>>> >> provide a list of tasks that would make for an interesting
>>>>> comparison ?
>>>>> >>
>>>>> >>
>>>>> >> What is Hive good at ?
>>>>> >>
>>>>> >> What is Pig good at ?
>>>>> >>
>>>>> >> Ideally, i would like to take what Hive is good at and test it in
>>>>> Pig and
>>>>> >> vice versa. The competitive characteristics  would make for an
>>>>> interesting
>>>>> >> comparison.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Regards,
>>>>> >> Sarfraz Rasheed Ramay (DIT)
>>>>> >> Dublin, Ireland.
>>>>> >
>>>>> >
>>>>>
>>>>> --
>>>>> CONFIDENTIALITY NOTICE
>>>>> NOTICE: This message is intended for the use of the individual or
>>>>> entity to
>>>>> which it is addressed and may contain information that is confidential,
>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>> reader
>>>>> of this message is not the intended recipient, you are hereby notified
>>>>> that
>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>> forwarding of this communication is strictly prohibited. If you have
>>>>> received this communication in error, please contact the sender
>>>>> immediately
>>>>> and delete it from your system. Thank You.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Hive Vs Pig: Master's thesis

Reply via email to