Hi Pig users,

I have a question regarding how to handle a large bag of data in reduce
step.
It happens that after I do the following (see below), each group has about
100GB of data to process. The bag is spilled continuously and the job is
very slow. What is your recommendation of speeding the processing when you
find yourself a large bag of data (over 100GB) to process?

A = LOAD '/tmp/data';
B = GROUP A by $0;
C = FOREACH B generate FLATTEN($1); -- this takes very very long because of
a large bag

Best Regards,

Jerry

Reply via email to