Re: Pig not reading all cassandra data

Matt Kennedy Fri, 04 Feb 2011 21:47:40 -0800

Found the culprit.  There is a new feature in Pig 0.8 that will try to reduce 
the number of splits used to speed up the whole job.  Since the 
ColumnFamilyInputFormat lists the input size as zero, this feature eliminates 
all of the splits except for one.


The workaround is to disable this feature for jobs that use CassandraStorage by 
setting -Dpig.splitCombination=false in the pig_cassandra script.

Hope somebody finds this useful, you wouldn't believe how many dead-ends I ran 
down trying to figure this out.

-Matt 
On Feb 2, 2011, at 4:34 PM, Matthew E. Kennedy wrote:

> 
> I noticed in the jobtracker log that when the pig job kicks off, I get the 
> following info message:
> 
> 2011-02-02 09:13:07,269 INFO org.apache.hadoop.mapred.JobInProgress: Input 
> size for job job_201101241634_0193 = 0. Number of splits = 1
> 
> So I looked at the job.split file that is created for the Pig job and 
> compared it to the job.split file created for the map-reduce job.  The map 
> reduce file contains an entry for each split, whereas the  job.split file for 
> the Pig job contains just the one split.
> 
> I added some code to the ColumnFamilyInputFormat to output what it thinks it 
> sees as it should be creating input splits for the pig jobs, and the call to 
> getSplits() appears to be returning the correct list of splits.  I can't 
> figure out where it goes wrong though when the splits should be written to 
> the job.split file.
> 
> Does anybody know the specific class responsible for creating that file in a 
> Pig job, and why it might be affected by using the pig CassandraStorage 
> module?
> 
> Is anyone else successfully running Pig jobs against a 0.7 cluster?
> 
> Thanks,
> Matt

Re: Pig not reading all cassandra data

Reply via email to