Large Bag (100GB of Data) in Reduce Step

Jerry Lam Mon, 22 Jul 2013 06:35:22 -0700

Hi Pig users,

I have a question regarding how to handle a large bag of data in reduce
step.
It happens that after I do the following (see below), each group has about
100GB of data to process. The bag is spilled continuously and the job is
very slow. What is your recommendation of speeding the processing when you
find yourself a large bag of data (over 100GB) to process?


A = LOAD '/tmp/data';
B = GROUP A by $0;
C = FOREACH B generate FLATTEN($1); -- this takes very very long because of
a large bag

Best Regards,

Jerry

Large Bag (100GB of Data) in Reduce Step

Reply via email to