Thr are two benefits I get as of now 1. Most of the time a lot of customers dont want the full power but they want something dead simple with which they can do dsl. They end up using Hive for a lot of ETL just cause its SQL & they understand it. Pig is close & wraps up a lot of framework level semantics away from the user & lets him focus on data flow 2. Some have codebases in Pig already & are just looking to do it faster. I am yet to benchmark that on Pig on spark.
I agree that pig on spark cannot solve a lot problems but it can solve some without forcing the end customer to do anything even close to coding, I believe thr is quite some value in making Spark accessible to larger group of audience. End of the day to each his own :) Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi <mundlap...@gmail.com>wrote: > This seems like an interesting question. > > I love Apache Pig. It is so natural and the language flows with nice > syntax. > > While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot > for analytics and provided feedback to Pig Team to do much more > functionality when it was at version 0.7. Lots of new functionality got > offered now > . > End of the day, Pig is a DSL for data flows. There will be always gaps and > enhancements. I was often thought is DSL right way to solve data flow > problems? May be not, we need complete language construct. We may have > found the answer - Scala. With Scala's dynamic compilation, we can write > much power constructs than any DSL can provide. > > If I am a new organization and beginning to choose, I would go with Scala. > > Here is the example: > > #!/bin/sh > exec scala "$0" "$@" > !# > YOUR DSL GOES HERE BUT IN SCALA! > > You have DSL like scripting, functional and complete language power! If we > can improve first 3 lines, here you go, you have most powerful DSL to solve > data problems. > > -Bharath > > > > > > On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <men...@gmail.com> wrote: > >> Hi Sameer, >> >> Lin (cc'ed) could also give you some updates about Pig on Spark >> development on her side. >> >> Best, >> Xiangrui >> >> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ssti...@live.com> wrote: >> > Hi Mayur, >> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the >> goal is >> > to get SPROK set up next month. I will keep you posted. Can you please >> keep >> > me informed about your progress as well. >> > >> > ________________________________ >> > From: mayur.rust...@gmail.com >> > Date: Mon, 10 Mar 2014 11:47:56 -0700 >> > >> > Subject: Re: Pig on Spark >> > To: user@spark.apache.org >> > >> > >> > Hi Sameer, >> > Did you make any progress on this. My team is also trying it out would >> love >> > to know some detail so progress. >> > >> > Mayur Rustagi >> > Ph: +1 (760) 203 3257 >> > http://www.sigmoidanalytics.com >> > @mayur_rustagi >> > >> > >> > >> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ssti...@live.com> wrote: >> > >> > Hi Aniket, >> > Many thanks! I will check this out. >> > >> > ________________________________ >> > Date: Thu, 6 Mar 2014 13:46:50 -0800 >> > Subject: Re: Pig on Spark >> > From: aniket...@gmail.com >> > To: user@spark.apache.org; tgraves...@yahoo.com >> > >> > >> > There is some work to make this work on yarn at >> > https://github.com/aniket486/pig. (So, compile pig with ant >> > -Dhadoopversion=23) >> > >> > You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto >> > find out what sort of env variables you need (sorry, I haven't been >> able to >> > clean this up- in-progress). There are few known issues with this, I >> will >> > work on fixing them soon. >> > >> > Known issues- >> > 1. Limit does not work (spork-fix) >> > 2. Foreach requires to turn off schema-tuple-backend (should be a >> pig-jira) >> > 3. Algebraic udfs dont work (spork-fix in-progress) >> > 4. Group by rework (to avoid OOMs) >> > 5. UDF Classloader issue (requires SPARK-1053, then you can put >> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars) >> > >> > ~Aniket >> > >> > >> > >> > >> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tgraves...@yahoo.com> >> wrote: >> > >> > I had asked a similar question on the dev mailing list a while back (Jan >> > 22nd). >> > >> > See the archives: >> > http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser-> >> > look for spork. >> > >> > Basically Matei said: >> > >> > Yup, that was it, though I believe people at Twitter picked it up again >> > recently. I'd suggest >> > asking Dmitriy if you know him. I've seen interest in this from several >> > other groups, and >> > if there's enough of it, maybe we can start another open source repo to >> > track it. The work >> > in that repo you pointed to was done over one week, and already had >> most of >> > Pig's operators >> > working. (I helped out with this prototype over Twitter's hack week.) >> That >> > work also calls >> > the Scala API directly, because it was done before we had a Java API; it >> > should be easier >> > with the Java one. >> > >> > >> > Tom >> > >> > >> > >> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ssti...@live.com> >> wrote: >> > Hi everyone, >> > >> > We are using to Pig to build our data pipeline. I came across Spork -- >> Pig >> > on Spark at: https://github.com/dvryaboy/pig and not sure if it is >> still >> > active. >> > >> > Can someone please let me know the status of Spork or any other effort >> that >> > will let us run Pig on Spark? We can significantly benefit by using >> Spark, >> > but we would like to keep using the existing Pig scripts. >> > >> > >> > >> > >> > >> > -- >> > "...:::Aniket:::... Quetzalco@tl" >> > >> > >> > >