There's only one thing that comes to mind for this particular toy example. >From the "Programming Pig" book, "pig.cached.bag.memusage" property is the "Percentage of the heap that Pig will allocate for all of the bags in a map or reduce task. Once the bags fill up this amount, the data is spilled to disk. Setting this to a higher value will reduce spills to disk during execution but increase the likelihood of a task running out of heap." The default value of this property is 0.1
So, you can try setting this to a higher value to see if it can improve performance. Other than the above setting, I can only quote the basic patterns for optimizing performance (also from Programming Pig): Filter early and often Project early and often Set up your joins properly etc. On Mon, Jul 22, 2013 at 9:31 AM, Jerry Lam <[email protected]> wrote: > Hi Pig users, > > I have a question regarding how to handle a large bag of data in reduce > step. > It happens that after I do the following (see below), each group has about > 100GB of data to process. The bag is spilled continuously and the job is > very slow. What is your recommendation of speeding the processing when you > find yourself a large bag of data (over 100GB) to process? > > A = LOAD '/tmp/data'; > B = GROUP A by $0; > C = FOREACH B generate FLATTEN($1); -- this takes very very long because of > a large bag > > Best Regards, > > Jerry >
