Sorry, I made a mistake in the code above for the new query. It should look like this:
A = LOAD '/tmp/data'; D = FOREACH A generate $0 as key, FLATTEN($1); -- notice that I move the FLATTEN operations to an earlier stage (from reduce side to map side flattening). B = GROUP D by key; STORE B into 'tmp/out'; On Mon, Jul 22, 2013 at 1:15 PM, Jerry Lam <[email protected]> wrote: > Hi Pradeep, > > Although this query looks too simplistic but it is very close to the real > one. :) > The actual one looks like: > > A = LOAD '/tmp/data'; > C = FOREACH (GROUP A by $0) { > generate FLATTEN(A.$1); -- this takes very very long because of > a large bag > } > > I did tried increase pig.cached.bag.memusage to 0.5, but it is still very > slow. > I followed all recommendations but it didn't help much. > > The above query could run for 8 hours which is bottlenecked by 1 reducer > which has 100GB of data. The databag of a group in that particular reducer > spill continuously. > > I change the above query to something like below: > > A = LOAD '/tmp/data'; > D = FOREACH A generate FLATTEN($1); -- notice that I move the FLATTEN > operations to an earlier stage (from reduce side to map side flattening). > B = GROUP D by $0; > STORE B into 'tmp/out'; > > This query finishes in 2 hours. Contrary to the best practice, it is best > not to flatten the data in the reduce step if the data size is too big > because of the spill to disk behavior. > > I wonder if this is a performance issue in spill to disk algorithm? > > Best Regards, > > Jerry > > > On Mon, Jul 22, 2013 at 10:12 AM, Pradeep Gollakota > <[email protected]>wrote: > >> There's only one thing that comes to mind for this particular toy example. >> >> From the "Programming Pig" book, >> "pig.cached.bag.memusage" property is the "Percentage of the heap that Pig >> will allocate for all of the bags in a map or reduce task. Once the bags >> fill up this amount, the data is spilled to disk. Setting this to a higher >> value will reduce spills to disk during execution but increase the >> likelihood of a task running out of heap." >> The default value of this property is 0.1 >> >> So, you can try setting this to a higher value to see if it can improve >> performance. >> >> Other than the above setting, I can only quote the basic patterns for >> optimizing performance (also from Programming Pig): >> Filter early and often >> Project early and often >> Set up your joins properly >> etc. >> >> >> >> On Mon, Jul 22, 2013 at 9:31 AM, Jerry Lam <[email protected]> wrote: >> >> > Hi Pig users, >> > >> > I have a question regarding how to handle a large bag of data in reduce >> > step. >> > It happens that after I do the following (see below), each group has >> about >> > 100GB of data to process. The bag is spilled continuously and the job is >> > very slow. What is your recommendation of speeding the processing when >> you >> > find yourself a large bag of data (over 100GB) to process? >> > >> > A = LOAD '/tmp/data'; >> > B = GROUP A by $0; >> > C = FOREACH B generate FLATTEN($1); -- this takes very very long >> because of >> > a large bag >> > >> > Best Regards, >> > >> > Jerry >> > >> > >
