Looking over the code this is in fact an issue in 0.6. It's fixed in trunk/0.7. Connections will be reused and closed properly, see https://issues.apache.org/jira/browse/CASSANDRA-1017 for more details.
We can either backport that patch or make at least close the connections properly in 0.6. Can you open an ticket for this bug? /Johan On 12 maj 2010, at 12.11, gabriele renzi wrote: > a follow up for anyone that may end up on this conversation again: > > I kept trying and neither changing the number of concurrent map tasks, > nor the slice size helped. > Finally, I found out a screw up in our logging system, which had > forbidden us from noticing a couple of recurring errors in the logs : > > ERROR [ROW-READ-STAGE:1] 2010-05-11 16:43:32,328 > DebuggableThreadPoolExecutor.java (line 101) Error in > ThreadPoolExecutor > java.lang.RuntimeException: java.lang.RuntimeException: corrupt sstable > at > org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:53) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:40) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.RuntimeException: corrupt sstable > at > org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:73) > at > org.apache.cassandra.db.ColumnFamilyStore.getKeyRange(ColumnFamilyStore.java:907) > at > org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1000) > at > org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:41) > ... 4 more > Caused by: java.io.FileNotFoundException: > /path/to/data/Keyspace/CF-123-Index.db (Too many open files) > at java.io.RandomAccessFile.open(Native Method) > at java.io.RandomAccessFile.<init>(RandomAccessFile.java:212) > at java.io.RandomAccessFile.<init>(RandomAccessFile.java:98) > at > org.apache.cassandra.io.util.BufferedRandomAccessFile.<init>(BufferedRandomAccessFile.java:143) > at > org.apache.cassandra.io.util.BufferedRandomAccessFile.<init>(BufferedRandomAccessFile.java:138) > at > org.apache.cassandra.io.SSTableReader.getNearestPosition(SSTableReader.java:414) > at > org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:62) > ... 7 more > > and the related > > WARN [main] 2010-05-11 16:43:38,076 TThreadPoolServer.java (line 190) > Transport error occurred during acceptance of message. > org.apache.thrift.transport.TTransportException: > java.net.SocketException: Too many open files > at > org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:124) > at > org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35) > at > org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31) > at > org.apache.thrift.server.TThreadPoolServer.serve(TThreadPoolServer.java:184) > at > org.apache.cassandra.thrift.CassandraDaemon.start(CassandraDaemon.java:149) > at > org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:190) > Caused by: java.net.SocketException: Too many open files > at java.net.PlainSocketImpl.socketAccept(Native Method) > at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) > at java.net.ServerSocket.implAccept(ServerSocket.java:453) > at java.net.ServerSocket.accept(ServerSocket.java:421) > at > org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:119) > ... 5 more > > The client was reporting timeouts in this case. > > > The max fd limit on the process was in fact not exceedingly high > (1024) and raising it seems to have solved the problem. > > Anyway It still seems that there may be two issues: > > - since we had never seen this error before with normal client > connections (as in: non hadoop), is it possible that the > Cassandra/hadoop layer is not closing sockets properly between one > connection and the other, or not reusing connections efficiently? > E.g. TSocket seems to have a close() method but I don't see it used in > ColumnFamilyInputFormat.(getSubSplits, getRangeMap) but it may well be > inside CassandraClient. > > Anyway, judging by lsof's output I can only see about a hundred TCP > connections, but those from the hadoop jobs seem to always be below 60 > so this may just be my wrong impression. > > - is it possible that such errors show up on the client side as > timeoutErrors when they could be reported better? this would probably > help other people in diagnosing/reporting internal errors in the > future. > > > Thanks again to everyone with this, I promise I'll put the discussion > on the wiki for future reference :)