I'm running Cassandra 0.7 and I'm trying to get Pig integration to work correctly. I'm using Pig 0.8 running against Hadoop 20.2, I've also tried this running against CDH2.
I can log into the grunt shell, and execute scripts, but when they run, they don't read all of the data from Cassandra. The job only results in one mapper being created, and that only reads a small fraction of the data on a node. I don't see any obvious error messages anywhere, so I'm not sure how to pinpoint the problem. To confirm that I had the cluster set up correctly, I wrote a simple map reduce job in Java that seems to use the ColumnFamily input format correctly and appears to distribute the job correctly across all the nodes in the cluster. I had a small number of killed jobs at the end of the process though, and I'm not sure whether that is a symptom if something. It looked like the Map phase would have been much faster if those jobs weren't waiting to be killed. But the output was correct, I compared it to a job that operated on the source data that I used to populate the cluster and the output was identical. In case its interesting, this data is 134 million records, the Cassandra Map Reduce Job ran in 14 minutes and the same calculation running on the raw data in HDFS took three minutes. I suspected at first that I was not correctly connecting the grunt shell to the cluster, but when I start grunt it correctly indicates the correct URLs for HDFS and the job tracker. When the job appears in the job tracker web UI, it is only executing one map. What's really interesting, is that Pig reports that it read 65k input records. When I multiply 65k, by the number of maps spawned by the Java Map Reduce job that actually works, I get 134 million, which is the number of records I'm reading. So it looks like the input split size is being calculated correctly, but only one of the maps gets executed. That has me kind of stumped. Here is the grunt session with line numbers prepended: 1 cassandra@rdcl000:~/benchmark/cassandra-0.7.0/contrib/pig$ bin/pig_cassandra 2 2011-02-01 12:47:02,353 [main] INFO org.apache.pig.Main - Logging error messages to: /home/cassandra/benchmark/cassandra-0.7.0/contrib/pig/pig_1296582422349.log 3 2011-02-01 12:47:02,538 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://rdcl000:9000 4 2011-02-01 12:47:02,644 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: rdcl000:9001 5 grunt> register /home/hadoop/local/pig/pig-0.8.0-core.jar; register /home/cassandra/benchmark/cassandra-0.7.0/lib/libthrift-0.5.jar; 6 grunt> rows = LOAD 'cassandra://rdclks/mycftest' USING CassandraStorage(); 7 grunt> countthis = GROUP rows ALL; 8 grunt> countedrows = FOREACH countthis GENERATE COUNT(rows.$0); 9 grunt> dump countedrows; 10 2011-02-01 12:47:31,219 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY 11 2011-02-01 12:47:31,219 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used. 12 2011-02-01 12:47:31,397 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: countedrows: Store(hdfs://rdcl000:9000/tmp/temp-1188844399/tmp1986503871:org.apache.pig.im pl.io.InterStorage) - scope-10 Operator Key: scope-10) 13 2011-02-01 12:47:31,408 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 14 2011-02-01 12:47:31,419 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner 15 2011-02-01 12:47:31,447 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 16 2011-02-01 12:47:31,447 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 17 2011-02-01 12:47:31,478 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 18 2011-02-01 12:47:31,491 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 19 2011-02-01 12:47:35,418 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 20 2011-02-01 12:47:35,478 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 21 2011-02-01 12:47:35,980 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 22 2011-02-01 12:47:35,995 [Thread-13] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 23 2011-02-01 12:47:36,750 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201101241634_0183 24 2011-02-01 12:47:36,750 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://rdcl000:50030/jobdetails.jsp?jobid=job_201101241634_0 183 25 2011-02-01 12:47:57,793 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 26 2011-02-01 12:48:16,346 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 27 2011-02-01 12:48:16,347 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics: 28 29 HadoopVersion PigVersion UserId StartedAt FinishedAt Features 30 0.20.2 0.8.0 cassandra 2011-02-01 12:47:31 2011-02-01 12:48:16 GROUP_BY 31 32 Success! 33 34 Job Stats (time in seconds): 35 JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs 36 job_201101241634_0183 1 1 18 18 18 12 12 12 countedrows,countthis,rows GROUP_BY,COMBINER hdfs://rdcl000:9000/tmp/temp-1188844399/tmp1986503871, 37 38 Input(s): 39 Successfully read 64985 records from: "cassandra://rdcl/famstest" 40 41 Output(s): 42 Successfully stored 1 records (14 bytes) in: "hdfs://rdcl000:9000/tmp/temp-1188844399/tmp1986503871" 43 44 Counters: 45 Total records written : 1 46 Total bytes written : 14 47 Spillable Memory Manager spill count : 0 48 Total bags proactively spilled: 0 49 Total records proactively spilled: 0 50 51 Job DAG: 52 job_201101241634_0183 53 54 55 2011-02-01 12:48:16,352 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 56 2011-02-01 12:48:16,374 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 57 2011-02-01 12:48:16,374 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 58 (64985) Any help is really appreciated. -Matt Kennedy