Pig not reading all cassandra data

Matthew E. Kennedy Tue, 01 Feb 2011 11:07:31 -0800

I'm running Cassandra 0.7 and I'm trying to get Pig integration to work 
correctly.  I'm using Pig 0.8 running against Hadoop 20.2, I've also tried this 
running against CDH2.


I can log into the grunt shell, and execute scripts, but when they run, they 
don't read all of the data from Cassandra.  The job only results in one mapper 
being created, and that only reads a small fraction of the data on a node.  I 
don't see any obvious error messages anywhere, so I'm not sure how to pinpoint 
the problem.

To confirm that I had the cluster set up correctly, I wrote a simple map reduce 
job in Java that seems to use the ColumnFamily input format correctly and 
appears to distribute the job correctly across all the nodes in the cluster.  I 
had a small number of killed jobs at the end of the process though, and I'm not 
sure whether that is a symptom if something.  It looked like the Map phase 
would have been much faster if those jobs weren't waiting to be killed.  But 
the output was correct, I compared it to a job that operated on the source data 
that I used to populate the cluster and the output was identical.  In case its 
interesting, this data is 134 million records, the Cassandra Map Reduce Job ran 
in 14 minutes and the same calculation running on the raw data in HDFS took 
three minutes.

I suspected at first that I was not correctly connecting the grunt shell to the 
cluster, but when I start grunt it correctly indicates the correct URLs for 
HDFS and the job tracker.

When the job appears in the job tracker web UI, it is only executing one map.

What's really interesting, is that Pig reports that it read 65k input records. 
When I multiply 65k, by the number of maps spawned by the Java Map Reduce job 
that actually works, I get 134 million, which is the number of records I'm 
reading.  So it looks like the input split size is being calculated correctly, 
but only one of the maps gets executed.  That has me kind of stumped.

Here is the grunt session with line numbers prepended:

  1 cassandra@rdcl000:~/benchmark/cassandra-0.7.0/contrib/pig$ bin/pig_cassandra
  2 2011-02-01 12:47:02,353 [main] INFO  org.apache.pig.Main - Logging error 
messages to: 
/home/cassandra/benchmark/cassandra-0.7.0/contrib/pig/pig_1296582422349.log
  3 2011-02-01 12:47:02,538 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: hdfs://rdcl000:9000
  4 2011-02-01 12:47:02,644 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
map-reduce job tracker at: rdcl000:9001
  5 grunt> register /home/hadoop/local/pig/pig-0.8.0-core.jar; register 
/home/cassandra/benchmark/cassandra-0.7.0/lib/libthrift-0.5.jar;
  6 grunt> rows = LOAD 'cassandra://rdclks/mycftest' USING CassandraStorage();
  7 grunt> countthis = GROUP rows ALL;
  8 grunt> countedrows = FOREACH countthis GENERATE COUNT(rows.$0);
  9 grunt> dump countedrows;
 10 2011-02-01 12:47:31,219 [main] INFO  
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: 
GROUP_BY
 11 2011-02-01 12:47:31,219 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
pig.usenewlogicalplan is set to true. New logical plan will be used.
 12 2011-02-01 12:47:31,397 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: 
countedrows: 
Store(hdfs://rdcl000:9000/tmp/temp-1188844399/tmp1986503871:org.apache.pig.im   
 pl.io.InterStorage) - scope-10 Operator Key: scope-10)
 13 2011-02-01 12:47:31,408 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File 
concatenation threshold: 100 optimistic? false
 14 2011-02-01 12:47:31,419 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer 
- Choosing to move algebraic foreach to combiner
 15 2011-02-01 12:47:31,447 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size before optimization: 1
 16 2011-02-01 12:47:31,447 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size after optimization: 1
 17 2011-02-01 12:47:31,478 [main] INFO  
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to 
the job
 18 2011-02-01 12:47:31,491 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
 19 2011-02-01 12:47:35,418 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- Setting up single store job
 20 2011-02-01 12:47:35,478 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 1 map-reduce job(s) waiting for submission.
 21 2011-02-01 12:47:35,980 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 0% complete
 22 2011-02-01 12:47:35,995 [Thread-13] INFO  
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input 
paths (combined) to process : 1
 23 2011-02-01 12:47:36,750 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- HadoopJobId: job_201101241634_0183
 24 2011-02-01 12:47:36,750 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- More information at: 
http://rdcl000:50030/jobdetails.jsp?jobid=job_201101241634_0    183
 25 2011-02-01 12:47:57,793 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 50% complete
 26 2011-02-01 12:48:16,346 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 100% complete
 27 2011-02-01 12:48:16,347 [main] INFO  org.apache.pig.tools.pigstats.PigStats 
- Script Statistics:
 28 
 29 HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      
Features
 30 0.20.2  0.8.0   cassandra       2011-02-01 12:47:31     2011-02-01 12:48:16 
    GROUP_BY
 31 
 32 Success!
 33 
 34 Job Stats (time in seconds):
 35 JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      
MaxReduceTime   MinReduceTime   AvgReduceTime   Alias   Feature Outputs
 36 job_201101241634_0183   1       1       18      18      18      12      12  
    12      countedrows,countthis,rows      GROUP_BY,COMBINER       
hdfs://rdcl000:9000/tmp/temp-1188844399/tmp1986503871,
 37 
 38 Input(s):
 39 Successfully read 64985 records from: "cassandra://rdcl/famstest"
 40 
 41 Output(s):
 42 Successfully stored 1 records (14 bytes) in: 
"hdfs://rdcl000:9000/tmp/temp-1188844399/tmp1986503871"
 43 
 44 Counters:
 45 Total records written : 1
 46 Total bytes written : 14
 47 Spillable Memory Manager spill count : 0
 48 Total bags proactively spilled: 0
 49 Total records proactively spilled: 0
 50 
 51 Job DAG:
 52 job_201101241634_0183
 53 
 54 
 55 2011-02-01 12:48:16,352 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Success!
 56 2011-02-01 12:48:16,374 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 1
 57 2011-02-01 12:48:16,374 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input 
paths to process : 1
 58 (64985)

Any help is really appreciated.

-Matt Kennedy

Pig not reading all cassandra data

Reply via email to