For a mix of the two, check out pig on spark, Spork.
https://github.com/sigmoidanalytics/spork

Both pig and spark have the same data flow abstraction and operators. You
have to think the same way. The actual syntax of Spark/PySpark and Pig are
different. Pig with Python mixes well with PySpark. Pig can run via
MapReduce, Spark or Tez.

Check out these three books on Pig (2 I wrote):

http://chimera.labs.oreilly.com/books/1234000001811/index.html
http://shop.oreilly.com/product/mobile/0636920025054.do
http://shop.oreilly.com/product/mobile/0636920039006.do

Programming pig introduces Pig Latin. Agile Data Science uses pig to build
applications. Big data for chimps teaches analytic patterns in pig.


On Sunday, July 19, 2015, Yang <[email protected]> wrote:

> Spark is very hot now, but after reading the paper, I found it surprisingly
> similar to PIG's concept: the RDD is just Relation/set in PIG's
> terminology.
>
> I think a great strength of Spark is that it tries to merge multiple
> "narrow dependency" stages together to avoid too much IO. does PIG do that
> too? otherwise, I can't figure out what other major design differences
> would lead to huge performance difference, if Spark also uses on-disk
> storage. The overhead to start a MR task should not be that big.
>


-- 
Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com

Reply via email to