a follow up for anyone that may end up on this conversation again: I kept trying and neither changing the number of concurrent map tasks, nor the slice size helped. Finally, I found out a screw up in our logging system, which had forbidden us from noticing a couple of recurring errors in the logs :
ERROR [ROW-READ-STAGE:1] 2010-05-11 16:43:32,328 DebuggableThreadPoolExecutor.java (line 101) Error in ThreadPoolExecutor java.lang.RuntimeException: java.lang.RuntimeException: corrupt sstable at org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:53) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:40) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.RuntimeException: corrupt sstable at org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:73) at org.apache.cassandra.db.ColumnFamilyStore.getKeyRange(ColumnFamilyStore.java:907) at org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1000) at org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:41) ... 4 more Caused by: java.io.FileNotFoundException: /path/to/data/Keyspace/CF-123-Index.db (Too many open files) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.<init>(RandomAccessFile.java:212) at java.io.RandomAccessFile.<init>(RandomAccessFile.java:98) at org.apache.cassandra.io.util.BufferedRandomAccessFile.<init>(BufferedRandomAccessFile.java:143) at org.apache.cassandra.io.util.BufferedRandomAccessFile.<init>(BufferedRandomAccessFile.java:138) at org.apache.cassandra.io.SSTableReader.getNearestPosition(SSTableReader.java:414) at org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:62) ... 7 more and the related WARN [main] 2010-05-11 16:43:38,076 TThreadPoolServer.java (line 190) Transport error occurred during acceptance of message. org.apache.thrift.transport.TTransportException: java.net.SocketException: Too many open files at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:124) at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35) at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31) at org.apache.thrift.server.TThreadPoolServer.serve(TThreadPoolServer.java:184) at org.apache.cassandra.thrift.CassandraDaemon.start(CassandraDaemon.java:149) at org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:190) Caused by: java.net.SocketException: Too many open files at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) at java.net.ServerSocket.implAccept(ServerSocket.java:453) at java.net.ServerSocket.accept(ServerSocket.java:421) at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:119) ... 5 more The client was reporting timeouts in this case. The max fd limit on the process was in fact not exceedingly high (1024) and raising it seems to have solved the problem. Anyway It still seems that there may be two issues: - since we had never seen this error before with normal client connections (as in: non hadoop), is it possible that the Cassandra/hadoop layer is not closing sockets properly between one connection and the other, or not reusing connections efficiently? E.g. TSocket seems to have a close() method but I don't see it used in ColumnFamilyInputFormat.(getSubSplits, getRangeMap) but it may well be inside CassandraClient. Anyway, judging by lsof's output I can only see about a hundred TCP connections, but those from the hadoop jobs seem to always be below 60 so this may just be my wrong impression. - is it possible that such errors show up on the client side as timeoutErrors when they could be reported better? this would probably help other people in diagnosing/reporting internal errors in the future. Thanks again to everyone with this, I promise I'll put the discussion on the wiki for future reference :)