The whole point is to parallelize to use the available capacity across multiple machines. If you go past that point (fairly easy when you have a single machine) then you're just contending for resources, not making things faster.
On Fri, May 7, 2010 at 7:48 AM, Joost Ouwerkerk <jo...@openplaces.org> wrote: > Huh? Isn't that the whole point of using Map/Reduce? > > On Fri, May 7, 2010 at 8:44 AM, Jonathan Ellis <jbel...@gmail.com> wrote: >> Sounds like you need to configure Hadoop to not create a whole bunch >> of Map tasks at once >> >> On Fri, May 7, 2010 at 3:47 AM, gabriele renzi <rff....@gmail.com> wrote: >>> Hi everyone, >>> >>> I am trying to develop a mapreduce job that does a simple >>> selection+filter on the rows in our store. >>> Of course it is mostly based on the WordCount example :) >>> >>> >>> Sadly, while it seems the app runs fine on a test keyspace with little >>> data, when run on a larger test index (but still on a single node) I >>> reliably see this error in the logs >>> >>> 10/05/06 16:37:58 WARN mapred.LocalJobRunner: job_local_0001 >>> java.lang.RuntimeException: TimedOutException() >>> at >>> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165) >>> at >>> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215) >>> at >>> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97) >>> at >>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135) >>> at >>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130) >>> at >>> org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91) >>> at >>> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423) >>> at >>> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) >>> at >>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176) >>> Caused by: TimedOutException() >>> at >>> org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015) >>> at >>> org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623) >>> at >>> org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597) >>> at >>> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142) >>> ... 11 more >>> >>> and after that the job seems to finish "normally" but no results are >>> produced. >>> >>> FWIW this is on 0.6.0 (we didn't move to 0.6.1 yet because, well, if >>> it ain't broke don't fix it). >>> >>> The single node has a data directory of about 127GB in two column >>> families, off which the one used in the mapred job is about 100GB. >>> The cassandra server is run with 6GB of heap on a box with 8GB >>> available and no swap enabled. read/write latency from cfstat are >>> >>> Read Latency: 0.8535837762577986 ms. >>> Write Latency: 0.028849603764075547 ms. >>> >>> row cache is not enabled, key cache percentage is default. Load on the >>> machine is basically zero when the job is not running. >>> >>> As my code is 99% that from the wordcount contrib, I shall notice that >>> In 0.6.1's contrib (and trunk) there is a RING_DELAY constant that we >>> can supposedly change, but it's apparently not used anywhere, but as I >>> said, running on a single node this should not be an issue anyway. >>> >>> Does anyone has suggestions or has seen this error before? On the >>> other hand, did people run this kind of jobs in similar conditions >>> flawlessly, so I can consider it just my problem? >>> >>> >>> Thanks in advance for any help. >>> >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of Riptano, the source for professional Cassandra support >> http://riptano.com >> > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com