Re: Large Bag (100GB of Data) in Reduce Step

Jerry Lam Mon, 22 Jul 2013 10:23:41 -0700

Sorry, I made a mistake in the code above for the new query. It should look
like this:


A = LOAD '/tmp/data';
D = FOREACH A generate $0 as key, FLATTEN($1); -- notice that I move the
FLATTEN operations to an earlier stage (from reduce side to map side
flattening).
B = GROUP D by key;
STORE B into 'tmp/out';


On Mon, Jul 22, 2013 at 1:15 PM, Jerry Lam <[email protected]> wrote:

> Hi Pradeep,
>
> Although this query looks too simplistic but it is very close to the real
> one. :)
> The actual one looks like:
>
> A = LOAD '/tmp/data';
> C = FOREACH (GROUP A by $0) {
>           generate FLATTEN(A.$1); -- this takes very very long because of
> a large bag
> }
>
> I did tried increase pig.cached.bag.memusage to 0.5, but it is still very
> slow.
> I followed all recommendations but it didn't help much.
>
> The above query could run for 8 hours which is bottlenecked by 1 reducer
> which has 100GB of data. The databag of a group in that particular reducer
> spill continuously.
>
> I change the above query to something like below:
>
> A = LOAD '/tmp/data';
> D = FOREACH A generate FLATTEN($1); -- notice that I move the FLATTEN
> operations to an earlier stage (from reduce side to map side flattening).
> B = GROUP D by $0;
> STORE B into 'tmp/out';
>
> This query finishes in 2 hours. Contrary to the best practice, it is best
> not to flatten the data in the reduce step if the data size is too big
> because of the spill to disk behavior.
>
> I wonder if this is a performance issue in spill to disk algorithm?
>
> Best Regards,
>
> Jerry
>
>
> On Mon, Jul 22, 2013 at 10:12 AM, Pradeep Gollakota 
> <[email protected]>wrote:
>
>> There's only one thing that comes to mind for this particular toy example.
>>
>> From the "Programming Pig" book,
>> "pig.cached.bag.memusage" property is the "Percentage of the heap that Pig
>> will allocate for all of the bags in a map or reduce task. Once the bags
>> fill up this amount, the data is spilled to disk. Setting this to a higher
>> value will reduce spills to disk during execution but increase the
>> likelihood of a task running out of heap."
>> The default value of this property is 0.1
>>
>> So, you can try setting this to a higher value to see if it can improve
>> performance.
>>
>> Other than the above setting, I can only quote the basic patterns for
>> optimizing performance (also from Programming Pig):
>> Filter early and often
>> Project early and often
>> Set up your joins properly
>> etc.
>>
>>
>>
>> On Mon, Jul 22, 2013 at 9:31 AM, Jerry Lam <[email protected]> wrote:
>>
>> > Hi Pig users,
>> >
>> > I have a question regarding how to handle a large bag of data in reduce
>> > step.
>> > It happens that after I do the following (see below), each group has
>> about
>> > 100GB of data to process. The bag is spilled continuously and the job is
>> > very slow. What is your recommendation of speeding the processing when
>> you
>> > find yourself a large bag of data (over 100GB) to process?
>> >
>> > A = LOAD '/tmp/data';
>> > B = GROUP A by $0;
>> > C = FOREACH B generate FLATTEN($1); -- this takes very very long
>> because of
>> > a large bag
>> >
>> > Best Regards,
>> >
>> > Jerry
>> >
>>
>
>

Re: Large Bag (100GB of Data) in Reduce Step

Reply via email to