I did not set the consistency level because I didn't find this option in the ConfigHelper class. I guess it should use level one by default.
Actually, I only twisted the word count example a bit. Here is the code snippet, getConf().set(CONF_COLUMN_NAME, columnName); Job job = new Job(getConf(), KEYSPACE); job.setJarByClass(WorkIdFinder.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(ReducerToFilesystem.class); job.setReducerClass(ReducerToFilesystem.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH_PREFIX + columnFamily)); job.setInputFormatClass(ColumnFamilyInputFormat.class); ConfigHelper.setRpcPort(job.getConfiguration(), "9260"); ConfigHelper.setInitialAddress(job.getConfiguration(), "dnjsrcha01"); ConfigHelper.setPartitioner(job.getConfiguration(), "org.apache.cassandra.dht.RandomPartitioner"); ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, columnFamily); ConfigHelper.setRangeBatchSize(job.getConfiguration(), batchSize); SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(ByteBufferUtil.bytes(columnName))); ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate); Yes, I have one task tracker running on each Cassandra node. Thanks, John On Thu, Jul 28, 2011 at 3:51 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com>wrote: > Just wondering - what consistency level are you using for hadoop reads? > Also, do you have task trackers running on the cassandra nodes so that > reads will be local? > > On Jul 28, 2011, at 2:46 PM, Jian Fang wrote: > > > I changed the rpc_timeout_in_ms to 30000 and 40000, then changed the > cassandra.range.batch.size from 4096 to 1024, > > but still 40% tasks got timeout exceptions. > > > > Not sure if this is caused by Cassandra speed performance (8G heap size > for about 100G of data) or the way how the Cassandra-hadoop integration > > is implemented. I rarely saw any timeout exceptions when I use hector to > get back data. > > > > Thanks, > > > > John > > > > On Thu, Jul 28, 2011 at 12:45 PM, Jian Fang < > jian.fang.subscr...@gmail.com> wrote: > > > > My current setting is 10000. I will try 30000. > > > > Thanks, > > > > John > > > > On Thu, Jul 28, 2011 at 12:39 PM, Jeremy Hanna < > jeremy.hanna1...@gmail.com> wrote: > > See http://wiki.apache.org/cassandra/HadoopSupport#Troubleshooting - I > would probably start with setting your rpc_timeout_in_ms to something like > 30000. > > > > On Jul 28, 2011, at 11:09 AM, Jian Fang wrote: > > > > > Hi, > > > > > > I run Cassandra 0.8.2 and hadoop 0.20.2 on three nodes, each node > includes a Cassandra instance and a hadoop data node. > > > I created a simple hadoop job to scan a Cassandra column value in a > column family and write it to a file system if it meets some conditions. > > > I keep getting the following timeout exceptions. Is this caused by my > settings in Cassandra? Or how could I change the timeout value on the > > > Cassandra Hadoop API to get around this problem? > > > > > > > > > 11/07/28 12:02:47 INFO mapred.JobClient: Task Id : > attempt_201107281151_0001_m_000052_0, Status : FAILED > > > java.lang.RuntimeException: TimedOutException() > > > at > org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:265) > > > at > org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:279) > > > at > org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:177) > > > at > com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) > > > at > com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) > > > at > org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:136) > > > at > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423) > > > at > org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) > > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) > > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > > > at org.apache.hadoop.mapred.Child.main(Child.java:170) > > > Caused by: TimedOutException() > > > at > org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:12590) > > > at > org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:762) > > > at > org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:734) > > > at > org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:243) > > > ... 11 more > > > > > > Thanks in advance, > > > > > > John > > > > > > > >