If there is an oom it will be in the logs. On Aug 5, 2014 8:17 PM, "Clint Kelly" <clint.ke...@gmail.com> wrote:
> Hi everyone, > > For some integration tests, we start up a CassandraDaemon in a > separate process (using the Java 7 ProcessBuilder API). All of my > integration tests run beautifully on my laptop, but one of them fails > on our Jenkins cluster. > > The failing integration test does around 10k writes to different rows > and then 10k reads. After running some number of reads, the job dies > with this error: > > com.datastax.driver.core.exceptions.NoHostAvailableException: All > host(s) tried for query failed (tried: /127.0.0.10:58209 > (com.datastax.driver.core.exceptions.DriverException: Timeout during > read)) > > This error appears to have occurred because the Cassandra process has > stopped. The logs for the Cassandra process show some warnings during > batch writes (the batches are too big), no activity for a few minutes > (I assume this is because all of the read operations were proceeding > smoothly), and then look like the following: > > INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,903 > ThriftServer.java (line 141) Stop listening to thrift clients > INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,920 Server.java > (line 182) Stop listening for CQL clients > INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,930 > Gossiper.java (line 1279) Announcing shutdown > INFO [StorageServiceShutdownHook] 2014-08-05 19:14:53,930 > MessagingService.java (line 683) Waiting for messaging service to > quiesce > INFO [ACCEPT-/127.0.0.10] 2014-08-05 19:14:53,931 > MessagingService.java (line 923) MessagingService has terminated the > accept() thread > > Does anyone have any ideas about how to debug this? Looking around on > google I found some threads suggesting that this could occur from an > OOM error ( > http://stackoverflow.com/questions/23755040/cassandra-exits-with-no-errors > ). > Wouldn't such an error be logged, however? > > The test that fails is a test of our MapReduce Hadoop InputFormat and > as such it does some pretty big queries across multiple rows (over a > range of partitioning key tokens). The default fetch size I believe > is 5000 rows, and the values in the rows I am fetching are just simple > strings, so I would not think the amount of data in a single read > would be too big. > > FWIW I don't see any log messages about garbage collection for at > least 3min before the process shuts down (and no GC messages after the > test stops doing writes and starts doing reads). > > I'd greatly appreciate any help before my team kills me for breaking > our Jenkins build so consistently! :) > > Best regards, > Clint >