Thank you Jeremy, I've already changed the max.*.failures to 20, it help jobs to finish but doesn't solve the source of the timeouts. I'll try the other tips.
Regards, Patrik On Wed, Dec 7, 2011 at 17:29, Jeremy Hanna <jeremy.hanna1...@gmail.com> wrote: > If you're getting lots of timeout exceptions with mapreduce, you might take a > look at http://wiki.apache.org/cassandra/HadoopSupport#Troubleshooting > We saw that and tweaked a variety of things - all of which are listed there. > Ultimately, we also boosted hadoop's tolerance for them as well and it was > just fine - so that it could retry more. A coworker had the same experience > running hadoop over elastic search - having to up that tolerance. An example > configuration for modifying that is shown in the link above. > > Hopefully that will help for your mapreduce jobs at least. We've had good > luck with MR/Pig over Cassandra, but it's after some lessons learned wrt > configuration of both Cassandra and Hadoop. > > On Dec 6, 2011, at 3:50 AM, Patrik Modesto wrote: > >> Hi, >> >> I'm quite desperate about Cassandra's performance in our production >> cluster. We have 8 real-HW nodes, 32core CPU, 32GB memory, 4 disks in >> raid10, cassandra 0.8.8, RF=3 and Hadoop. >> We four keyspaces, one is the large one, it has 2 CFs, one is kind of >> index, the other holds data. There are about 7milinon rows, mean row >> size is 7kB. We run several mapreduce tasks, most of them just reads >> from cassandra and writes to hdfs, but one fetch rows from cassnadra, >> compute something and write it back, for each row we compute three new >> json values, about 1kB each (they get overwritten next round). >> >> We got lots and lots of Timeout exceptions, LiveSSTablesCount over >> 100. Reapir doesn't finish even in 24hours, reading from the other >> keyspaces timeouts as well. We set compaction_throughput_mb_per_sec: >> 0 but it didn't help. >> >> Did we choose wrong DB for our usecase? >> >> Regards, >> Patrik >> >> This is from one node: >> >> INFO 10:28:40,035 Pool Name Active Pending Blocked >> INFO 10:28:40,036 ReadStage 96 695 0 >> INFO 10:28:40,037 RequestResponseStage 0 0 0 >> INFO 10:28:40,037 ReadRepairStage 0 0 0 >> INFO 10:28:40,037 MutationStage 1 1 0 >> INFO 10:28:40,038 ReplicateOnWriteStage 0 0 0 >> INFO 10:28:40,038 GossipStage 0 0 0 >> INFO 10:28:40,038 AntiEntropyStage 0 0 0 >> INFO 10:28:40,039 MigrationStage 0 0 0 >> INFO 10:28:40,039 StreamStage 0 0 0 >> INFO 10:28:40,040 MemtablePostFlusher 0 0 0 >> INFO 10:28:40,040 FlushWriter 0 0 0 >> INFO 10:28:40,040 MiscStage 0 0 0 >> INFO 10:28:40,041 FlushSorter 0 0 0 >> INFO 10:28:40,041 InternalResponseStage 0 0 0 >> INFO 10:28:40,041 HintedHandoff 1 5 0 >> INFO 10:28:40,042 CompactionManager n/a 27 >> INFO 10:28:40,042 MessagingService n/a 0,16559 >> >> And here is the nodetool ring output: >> >> 10.2.54.91 NG RAC1 Up Normal 118.04 GB >> 12.50% 0 >> 10.2.54.92 NG RAC1 Up Normal 102.74 GB >> 12.50% 21267647932558653966460912964485513216 >> 10.2.54.93 NG RAC1 Up Normal 76.95 GB >> 12.50% 42535295865117307932921825928971026432 >> 10.2.54.94 NG RAC1 Up Normal 56.97 GB >> 12.50% 63802943797675961899382738893456539648 >> 10.2.54.95 NG RAC1 Up Normal 75.55 GB >> 12.50% 85070591730234615865843651857942052864 >> 10.2.54.96 NG RAC1 Up Normal 102.57 GB >> 12.50% 106338239662793269832304564822427566080 >> 10.2.54.97 NG RAC1 Up Normal 68.03 GB >> 12.50% 127605887595351923798765477786913079296 >> 10.2.54.98 NG RAC1 Up Normal 194.6 GB >> 12.50% 148873535527910577765226390751398592512 >