Re: Pig on Spark

suman bharadwaj Wed, 23 Apr 2014 14:17:33 -0700

We currently are in the process of converting PIG and Java map reduce jobs
to SPARK jobs. And we have written couple of PIG UDFs as well. Hence was
checking if we can leverage SPORK without converting to SPARK jobs.


And is there any way I can port my existing Java MR jobs to SPARK ?
I know this thread has a different subject, let me know if need to ask this
question in separate thread.

Thanks in advance.


On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi <mayur.rust...@gmail.com>wrote:

> UDF
> Generate
> & many many more are not working :)
>
> Several of them work. Joins, filters, group by etc.
> I am translating the ones we need, would be happy to get help on others.
> Will host a jira to track them if you are intersted.
>
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj <suman....@gmail.com>wrote:
>
>> Are all the features available in PIG working in SPORK ?? Like for eg:
>> UDFs ?
>>
>> Thanks.
>>
>>
>> On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi 
>> <mayur.rust...@gmail.com>wrote:
>>
>>> Thr are two benefits I get as of now
>>> 1. Most of the time a lot of customers dont want the full power but they
>>> want something dead simple with which they can do dsl. They end up using
>>> Hive for a lot of ETL just cause its SQL & they understand it. Pig is close
>>> & wraps up a lot of framework level semantics away from the user & lets him
>>> focus on data flow
>>> 2. Some have codebases in Pig already & are just looking to do it
>>> faster. I am yet to benchmark that on Pig on spark.
>>>
>>> I agree that pig on spark cannot solve a lot problems but it can solve
>>> some without forcing the end customer to do anything even close to coding,
>>> I believe thr is quite some value in making Spark accessible to larger
>>> group of audience.
>>> End of the day to each his own :)
>>>
>>> Regards
>>> Mayur
>>>
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi <
>>> mundlap...@gmail.com> wrote:
>>>
>>>> This seems like an interesting question.
>>>>
>>>> I love Apache Pig. It is so natural and the language flows with nice
>>>> syntax.
>>>>
>>>> While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
>>>> for analytics and provided feedback to Pig Team to do much more
>>>> functionality when it was at version 0.7. Lots of new functionality got
>>>> offered now
>>>> .
>>>> End of the day, Pig is a DSL for data flows. There will be always gaps
>>>> and enhancements. I was often thought is DSL right way to solve data flow
>>>> problems? May be not, we need complete language construct. We may have
>>>> found the answer - Scala. With Scala's dynamic compilation, we can write
>>>> much power constructs than any DSL can provide.
>>>>
>>>> If I am a new organization and beginning to choose, I would go with
>>>> Scala.
>>>>
>>>> Here is the example:
>>>>
>>>> #!/bin/sh
>>>> exec scala "$0" "$@"
>>>> !#
>>>> YOUR DSL GOES HERE BUT IN SCALA!
>>>>
>>>> You have DSL like scripting, functional and complete language power! If
>>>> we can improve first 3 lines, here you go, you have most powerful DSL to
>>>> solve data problems.
>>>>
>>>> -Bharath
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <men...@gmail.com>wrote:
>>>>
>>>>> Hi Sameer,
>>>>>
>>>>> Lin (cc'ed) could also give you some updates about Pig on Spark
>>>>> development on her side.
>>>>>
>>>>> Best,
>>>>> Xiangrui
>>>>>
>>>>> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ssti...@live.com>
>>>>> wrote:
>>>>> > Hi Mayur,
>>>>> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the
>>>>> goal is
>>>>> > to get SPROK set up next month. I will keep you posted. Can you
>>>>> please keep
>>>>> > me informed about your progress as well.
>>>>> >
>>>>> > ________________________________
>>>>> > From: mayur.rust...@gmail.com
>>>>> > Date: Mon, 10 Mar 2014 11:47:56 -0700
>>>>> >
>>>>> > Subject: Re: Pig on Spark
>>>>> > To: user@spark.apache.org
>>>>> >
>>>>> >
>>>>> > Hi Sameer,
>>>>> > Did you make any progress on this. My team is also trying it out
>>>>> would love
>>>>> > to know some detail so progress.
>>>>> >
>>>>> > Mayur Rustagi
>>>>> > Ph: +1 (760) 203 3257
>>>>> > http://www.sigmoidanalytics.com
>>>>> > @mayur_rustagi
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ssti...@live.com>
>>>>> wrote:
>>>>> >
>>>>> > Hi Aniket,
>>>>> > Many thanks! I will check this out.
>>>>> >
>>>>> > ________________________________
>>>>> > Date: Thu, 6 Mar 2014 13:46:50 -0800
>>>>> > Subject: Re: Pig on Spark
>>>>> > From: aniket...@gmail.com
>>>>> > To: user@spark.apache.org; tgraves...@yahoo.com
>>>>> >
>>>>> >
>>>>> > There is some work to make this work on yarn at
>>>>> > https://github.com/aniket486/pig. (So, compile pig with ant
>>>>> > -Dhadoopversion=23)
>>>>> >
>>>>> > You can look at
>>>>> https://github.com/aniket486/pig/blob/spork/pig-spark to
>>>>> > find out what sort of env variables you need (sorry, I haven't been
>>>>> able to
>>>>> > clean this up- in-progress). There are few known issues with this, I
>>>>> will
>>>>> > work on fixing them soon.
>>>>> >
>>>>> > Known issues-
>>>>> > 1. Limit does not work (spork-fix)
>>>>> > 2. Foreach requires to turn off schema-tuple-backend (should be a
>>>>> pig-jira)
>>>>> > 3. Algebraic udfs dont work (spork-fix in-progress)
>>>>> > 4. Group by rework (to avoid OOMs)
>>>>> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
>>>>> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
>>>>> jars)
>>>>> >
>>>>> > ~Aniket
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tgraves...@yahoo.com>
>>>>> wrote:
>>>>> >
>>>>> > I had asked a similar question on the dev mailing list a while back
>>>>> (Jan
>>>>> > 22nd).
>>>>> >
>>>>> > See the archives:
>>>>> >
>>>>> http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
>>>>> > look for spork.
>>>>> >
>>>>> > Basically Matei said:
>>>>> >
>>>>> > Yup, that was it, though I believe people at Twitter picked it up
>>>>> again
>>>>> > recently. I'd suggest
>>>>> > asking Dmitriy if you know him. I've seen interest in this from
>>>>> several
>>>>> > other groups, and
>>>>> > if there's enough of it, maybe we can start another open source repo
>>>>> to
>>>>> > track it. The work
>>>>> > in that repo you pointed to was done over one week, and already had
>>>>> most of
>>>>> > Pig's operators
>>>>> > working. (I helped out with this prototype over Twitter's hack
>>>>> week.) That
>>>>> > work also calls
>>>>> > the Scala API directly, because it was done before we had a Java
>>>>> API; it
>>>>> > should be easier
>>>>> > with the Java one.
>>>>> >
>>>>> >
>>>>> > Tom
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ssti...@live.com>
>>>>> wrote:
>>>>> > Hi everyone,
>>>>> >
>>>>> > We are using to Pig to build our data pipeline. I came across Spork
>>>>> -- Pig
>>>>> > on Spark at: https://github.com/dvryaboy/pig and not sure if it is
>>>>> still
>>>>> > active.
>>>>> >
>>>>> > Can someone please let me know the status of Spork or any other
>>>>> effort that
>>>>> > will let us run Pig on Spark? We can significantly benefit by using
>>>>> Spark,
>>>>> > but we would like to keep using the existing Pig scripts.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > "...:::Aniket:::... Quetzalco@tl"
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Pig on Spark

Reply via email to