Re: Pig on Spark

Mayur Rustagi Wed, 23 Apr 2014 13:25:36 -0700

Thr are two benefits I get as of now
1. Most of the time a lot of customers dont want the full power but they
want something dead simple with which they can do dsl. They end up using
Hive for a lot of ETL just cause its SQL & they understand it. Pig is close
& wraps up a lot of framework level semantics away from the user & lets him
focus on data flow
2. Some have codebases in Pig already & are just looking to do it faster. I
am yet to benchmark that on Pig on spark.


I agree that pig on spark cannot solve a lot problems but it can solve some
without forcing the end customer to do anything even close to coding, I
believe thr is quite some value in making Spark accessible to larger group
of audience.
End of the day to each his own :)

Regards
Mayur


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi <mundlap...@gmail.com>wrote:

> This seems like an interesting question.
>
> I love Apache Pig. It is so natural and the language flows with nice
> syntax.
>
> While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
> for analytics and provided feedback to Pig Team to do much more
> functionality when it was at version 0.7. Lots of new functionality got
> offered now
> .
> End of the day, Pig is a DSL for data flows. There will be always gaps and
> enhancements. I was often thought is DSL right way to solve data flow
> problems? May be not, we need complete language construct. We may have
> found the answer - Scala. With Scala's dynamic compilation, we can write
> much power constructs than any DSL can provide.
>
> If I am a new organization and beginning to choose, I would go with Scala.
>
> Here is the example:
>
> #!/bin/sh
> exec scala "$0" "$@"
> !#
> YOUR DSL GOES HERE BUT IN SCALA!
>
> You have DSL like scripting, functional and complete language power! If we
> can improve first 3 lines, here you go, you have most powerful DSL to solve
> data problems.
>
> -Bharath
>
>
>
>
>
> On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <men...@gmail.com> wrote:
>
>> Hi Sameer,
>>
>> Lin (cc'ed) could also give you some updates about Pig on Spark
>> development on her side.
>>
>> Best,
>> Xiangrui
>>
>> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ssti...@live.com> wrote:
>> > Hi Mayur,
>> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the
>> goal is
>> > to get SPROK set up next month. I will keep you posted. Can you please
>> keep
>> > me informed about your progress as well.
>> >
>> > ________________________________
>> > From: mayur.rust...@gmail.com
>> > Date: Mon, 10 Mar 2014 11:47:56 -0700
>> >
>> > Subject: Re: Pig on Spark
>> > To: user@spark.apache.org
>> >
>> >
>> > Hi Sameer,
>> > Did you make any progress on this. My team is also trying it out would
>> love
>> > to know some detail so progress.
>> >
>> > Mayur Rustagi
>> > Ph: +1 (760) 203 3257
>> > http://www.sigmoidanalytics.com
>> > @mayur_rustagi
>> >
>> >
>> >
>> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ssti...@live.com> wrote:
>> >
>> > Hi Aniket,
>> > Many thanks! I will check this out.
>> >
>> > ________________________________
>> > Date: Thu, 6 Mar 2014 13:46:50 -0800
>> > Subject: Re: Pig on Spark
>> > From: aniket...@gmail.com
>> > To: user@spark.apache.org; tgraves...@yahoo.com
>> >
>> >
>> > There is some work to make this work on yarn at
>> > https://github.com/aniket486/pig. (So, compile pig with ant
>> > -Dhadoopversion=23)
>> >
>> > You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto
>> > find out what sort of env variables you need (sorry, I haven't been
>> able to
>> > clean this up- in-progress). There are few known issues with this, I
>> will
>> > work on fixing them soon.
>> >
>> > Known issues-
>> > 1. Limit does not work (spork-fix)
>> > 2. Foreach requires to turn off schema-tuple-backend (should be a
>> pig-jira)
>> > 3. Algebraic udfs dont work (spork-fix in-progress)
>> > 4. Group by rework (to avoid OOMs)
>> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
>> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars)
>> >
>> > ~Aniket
>> >
>> >
>> >
>> >
>> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tgraves...@yahoo.com>
>> wrote:
>> >
>> > I had asked a similar question on the dev mailing list a while back (Jan
>> > 22nd).
>> >
>> > See the archives:
>> > http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
>> > look for spork.
>> >
>> > Basically Matei said:
>> >
>> > Yup, that was it, though I believe people at Twitter picked it up again
>> > recently. I'd suggest
>> > asking Dmitriy if you know him. I've seen interest in this from several
>> > other groups, and
>> > if there's enough of it, maybe we can start another open source repo to
>> > track it. The work
>> > in that repo you pointed to was done over one week, and already had
>> most of
>> > Pig's operators
>> > working. (I helped out with this prototype over Twitter's hack week.)
>> That
>> > work also calls
>> > the Scala API directly, because it was done before we had a Java API; it
>> > should be easier
>> > with the Java one.
>> >
>> >
>> > Tom
>> >
>> >
>> >
>> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ssti...@live.com>
>> wrote:
>> > Hi everyone,
>> >
>> > We are using to Pig to build our data pipeline. I came across Spork --
>> Pig
>> > on Spark at: https://github.com/dvryaboy/pig and not sure if it is
>> still
>> > active.
>> >
>> > Can someone please let me know the status of Spork or any other effort
>> that
>> > will let us run Pig on Spark? We can significantly benefit by using
>> Spark,
>> > but we would like to keep using the existing Pig scripts.
>> >
>> >
>> >
>> >
>> >
>> > --
>> > "...:::Aniket:::... Quetzalco@tl"
>> >
>> >
>>
>
>

Re: Pig on Spark

Reply via email to