For a mix of the two, check out pig on spark, Spork. https://github.com/sigmoidanalytics/spork
Both pig and spark have the same data flow abstraction and operators. You have to think the same way. The actual syntax of Spark/PySpark and Pig are different. Pig with Python mixes well with PySpark. Pig can run via MapReduce, Spark or Tez. Check out these three books on Pig (2 I wrote): http://chimera.labs.oreilly.com/books/1234000001811/index.html http://shop.oreilly.com/product/mobile/0636920025054.do http://shop.oreilly.com/product/mobile/0636920039006.do Programming pig introduces Pig Latin. Agile Data Science uses pig to build applications. Big data for chimps teaches analytic patterns in pig. On Sunday, July 19, 2015, Yang <[email protected]> wrote: > Spark is very hot now, but after reading the paper, I found it surprisingly > similar to PIG's concept: the RDD is just Relation/set in PIG's > terminology. > > I think a great strength of Spark is that it tries to merge multiple > "narrow dependency" stages together to avoid too much IO. does PIG do that > too? otherwise, I can't figure out what other major design differences > would lead to huge performance difference, if Spark also uses on-disk > storage. The overhead to start a MR task should not be that big. > -- Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com
