Joseph, the stacktrace suggests that it's Thrift that's timing out, not the Task.
Gabriele, I believe that your problem is caused by too much load on Cassandra. Get_range_slices is presently an expensive operation. I had some success in reducing (although, it turns out, not eliminating) this problem by requesting smaller batches from get_range_slices. See ConfigHelper.setRangeBatchSize() joost On Fri, May 7, 2010 at 8:49 AM, Joseph Stein <crypt...@gmail.com> wrote: > The problem could be that you are crunching more data than will be > completed within the interval expire setting. > > In Hadoop you need to kind of tell the task tracker that you are still > doing stuff which is done by setting status or incrementing counter on > the Reporter object. > > http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/ > > "In your Java code there is a little trick to help the job be “aware” > within the cluster of tasks that are not dead but just working hard. > During execution of a task there is no built in reporting that the job > is running as expected if it is not writing out. So this means that > if your tasks are taking up a lot of time doing work it is possible > the cluster will see that task as failed (based on the > mapred.task.tracker.expiry.interval setting). > > Have no fear there is a way to tell cluster that your task is doing > just fine. You have 2 ways todo this you can either report the status > or increment a counter. Both of these will cause the task tracker to > properly know the task is ok and this will get seen by the jobtracker > in turn. Both of these options are explained in the JavaDoc > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html" > > Hope this helps > > On Fri, May 7, 2010 at 4:47 AM, gabriele renzi <rff....@gmail.com> wrote: >> Hi everyone, >> >> I am trying to develop a mapreduce job that does a simple >> selection+filter on the rows in our store. >> Of course it is mostly based on the WordCount example :) >> >> >> Sadly, while it seems the app runs fine on a test keyspace with little >> data, when run on a larger test index (but still on a single node) I >> reliably see this error in the logs >> >> 10/05/06 16:37:58 WARN mapred.LocalJobRunner: job_local_0001 >> java.lang.RuntimeException: TimedOutException() >> at >> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165) >> at >> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215) >> at >> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97) >> at >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135) >> at >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130) >> at >> org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91) >> at >> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423) >> at >> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176) >> Caused by: TimedOutException() >> at >> org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015) >> at >> org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623) >> at >> org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597) >> at >> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142) >> ... 11 more >> >> and after that the job seems to finish "normally" but no results are >> produced. >> >> FWIW this is on 0.6.0 (we didn't move to 0.6.1 yet because, well, if >> it ain't broke don't fix it). >> >> The single node has a data directory of about 127GB in two column >> families, off which the one used in the mapred job is about 100GB. >> The cassandra server is run with 6GB of heap on a box with 8GB >> available and no swap enabled. read/write latency from cfstat are >> >> Read Latency: 0.8535837762577986 ms. >> Write Latency: 0.028849603764075547 ms. >> >> row cache is not enabled, key cache percentage is default. Load on the >> machine is basically zero when the job is not running. >> >> As my code is 99% that from the wordcount contrib, I shall notice that >> In 0.6.1's contrib (and trunk) there is a RING_DELAY constant that we >> can supposedly change, but it's apparently not used anywhere, but as I >> said, running on a single node this should not be an issue anyway. >> >> Does anyone has suggestions or has seen this error before? On the >> other hand, did people run this kind of jobs in similar conditions >> flawlessly, so I can consider it just my problem? >> >> >> Thanks in advance for any help. >> > > > > -- > /* > Joe Stein > http://www.linkedin.com/in/charmalloc > */ >