Re: Pig on Spark

Mayur Rustagi Wed, 23 Apr 2014 13:44:34 -0700

UDF
Generate
& many many more are not working :)

Several of them work. Joins, filters, group by etc.
I am translating the ones we need, would be happy to get help on others.
Will host a jira to track them if you are intersted.



Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj <suman....@gmail.com>wrote:

> Are all the features available in PIG working in SPORK ?? Like for eg:
> UDFs ?
>
> Thanks.
>
>
> On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi <mayur.rust...@gmail.com>wrote:
>
>> Thr are two benefits I get as of now
>> 1. Most of the time a lot of customers dont want the full power but they
>> want something dead simple with which they can do dsl. They end up using
>> Hive for a lot of ETL just cause its SQL & they understand it. Pig is close
>> & wraps up a lot of framework level semantics away from the user & lets him
>> focus on data flow
>> 2. Some have codebases in Pig already & are just looking to do it faster.
>> I am yet to benchmark that on Pig on spark.
>>
>> I agree that pig on spark cannot solve a lot problems but it can solve
>> some without forcing the end customer to do anything even close to coding,
>> I believe thr is quite some value in making Spark accessible to larger
>> group of audience.
>> End of the day to each his own :)
>>
>> Regards
>> Mayur
>>
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi <mundlap...@gmail.com
>> > wrote:
>>
>>> This seems like an interesting question.
>>>
>>> I love Apache Pig. It is so natural and the language flows with nice
>>> syntax.
>>>
>>> While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
>>> for analytics and provided feedback to Pig Team to do much more
>>> functionality when it was at version 0.7. Lots of new functionality got
>>> offered now
>>> .
>>> End of the day, Pig is a DSL for data flows. There will be always gaps
>>> and enhancements. I was often thought is DSL right way to solve data flow
>>> problems? May be not, we need complete language construct. We may have
>>> found the answer - Scala. With Scala's dynamic compilation, we can write
>>> much power constructs than any DSL can provide.
>>>
>>> If I am a new organization and beginning to choose, I would go with
>>> Scala.
>>>
>>> Here is the example:
>>>
>>> #!/bin/sh
>>> exec scala "$0" "$@"
>>> !#
>>> YOUR DSL GOES HERE BUT IN SCALA!
>>>
>>> You have DSL like scripting, functional and complete language power! If
>>> we can improve first 3 lines, here you go, you have most powerful DSL to
>>> solve data problems.
>>>
>>> -Bharath
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <men...@gmail.com>wrote:
>>>
>>>> Hi Sameer,
>>>>
>>>> Lin (cc'ed) could also give you some updates about Pig on Spark
>>>> development on her side.
>>>>
>>>> Best,
>>>> Xiangrui
>>>>
>>>> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ssti...@live.com>
>>>> wrote:
>>>> > Hi Mayur,
>>>> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the
>>>> goal is
>>>> > to get SPROK set up next month. I will keep you posted. Can you
>>>> please keep
>>>> > me informed about your progress as well.
>>>> >
>>>> > ________________________________
>>>> > From: mayur.rust...@gmail.com
>>>> > Date: Mon, 10 Mar 2014 11:47:56 -0700
>>>> >
>>>> > Subject: Re: Pig on Spark
>>>> > To: user@spark.apache.org
>>>> >
>>>> >
>>>> > Hi Sameer,
>>>> > Did you make any progress on this. My team is also trying it out
>>>> would love
>>>> > to know some detail so progress.
>>>> >
>>>> > Mayur Rustagi
>>>> > Ph: +1 (760) 203 3257
>>>> > http://www.sigmoidanalytics.com
>>>> > @mayur_rustagi
>>>> >
>>>> >
>>>> >
>>>> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ssti...@live.com>
>>>> wrote:
>>>> >
>>>> > Hi Aniket,
>>>> > Many thanks! I will check this out.
>>>> >
>>>> > ________________________________
>>>> > Date: Thu, 6 Mar 2014 13:46:50 -0800
>>>> > Subject: Re: Pig on Spark
>>>> > From: aniket...@gmail.com
>>>> > To: user@spark.apache.org; tgraves...@yahoo.com
>>>> >
>>>> >
>>>> > There is some work to make this work on yarn at
>>>> > https://github.com/aniket486/pig. (So, compile pig with ant
>>>> > -Dhadoopversion=23)
>>>> >
>>>> > You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto
>>>> > find out what sort of env variables you need (sorry, I haven't been
>>>> able to
>>>> > clean this up- in-progress). There are few known issues with this, I
>>>> will
>>>> > work on fixing them soon.
>>>> >
>>>> > Known issues-
>>>> > 1. Limit does not work (spork-fix)
>>>> > 2. Foreach requires to turn off schema-tuple-backend (should be a
>>>> pig-jira)
>>>> > 3. Algebraic udfs dont work (spork-fix in-progress)
>>>> > 4. Group by rework (to avoid OOMs)
>>>> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
>>>> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
>>>> jars)
>>>> >
>>>> > ~Aniket
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tgraves...@yahoo.com>
>>>> wrote:
>>>> >
>>>> > I had asked a similar question on the dev mailing list a while back
>>>> (Jan
>>>> > 22nd).
>>>> >
>>>> > See the archives:
>>>> >
>>>> http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
>>>> > look for spork.
>>>> >
>>>> > Basically Matei said:
>>>> >
>>>> > Yup, that was it, though I believe people at Twitter picked it up
>>>> again
>>>> > recently. I'd suggest
>>>> > asking Dmitriy if you know him. I've seen interest in this from
>>>> several
>>>> > other groups, and
>>>> > if there's enough of it, maybe we can start another open source repo
>>>> to
>>>> > track it. The work
>>>> > in that repo you pointed to was done over one week, and already had
>>>> most of
>>>> > Pig's operators
>>>> > working. (I helped out with this prototype over Twitter's hack week.)
>>>> That
>>>> > work also calls
>>>> > the Scala API directly, because it was done before we had a Java API;
>>>> it
>>>> > should be easier
>>>> > with the Java one.
>>>> >
>>>> >
>>>> > Tom
>>>> >
>>>> >
>>>> >
>>>> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ssti...@live.com>
>>>> wrote:
>>>> > Hi everyone,
>>>> >
>>>> > We are using to Pig to build our data pipeline. I came across Spork
>>>> -- Pig
>>>> > on Spark at: https://github.com/dvryaboy/pig and not sure if it is
>>>> still
>>>> > active.
>>>> >
>>>> > Can someone please let me know the status of Spork or any other
>>>> effort that
>>>> > will let us run Pig on Spark? We can significantly benefit by using
>>>> Spark,
>>>> > but we would like to keep using the existing Pig scripts.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > "...:::Aniket:::... Quetzalco@tl"
>>>> >
>>>> >
>>>>
>>>
>>>
>>
>

Re: Pig on Spark

Reply via email to