I have a pig script that is working well for small test data sets but fails on 
a run over realistic-sized data. Logs show
  INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- job job_201106061024_0331 has failed!
  …
  job_201106061024_0331   CitedItemsGrpByDocId,DedupTCPerDocId    
GROUP_BY,COMBINER       Message: Job failed!
  …
 attempt_201106061024_0331_m_000198_0  […]   Error: java.lang.OutOfMemoryError: 
Java heap space
  and similar same for all attempts at a few of the other (many) map tasks for 
this job.

I believe  this job corresponds to these lines in my pig script:

 CitedItemsGrpByDocId = group CitedItems by citeddocid;
 DedupTCPerDocId =
     foreach CitedItemsGrpByDocId {
         CitingDocids =  CitedItems.citingdocid;
         UniqCitingDocids = distinct CitingDocids;
         generate group, COUNT(UniqCitingDocids) as tc;
      };

I tried increasing mapred.child.java.opts but the job failed in a setup stage 
with 
  Error occurred during initialization of VM
  Could not reserve enough space for object heap

Are there job configurations/parameters for Hadoop or pig I can set to get 
around this? Is there a Pig Latin circumlocution, or better way to express what 
I want, that is not as memory-hungry?

Thank in advance,

Will

William F Dowling
Sr Technical Specialist, Software Engineering


Reply via email to