Hi all, TL;DR - I think my unit tests are sometimes failing because of read timeouts to an EmbeddedCassandraService when dropping a table triggers a compaction on a highly-loaded build slave. Does this sound reasonable? What options should I change in my Cluster.Builder (or elsewhere) to prevent this from happening?
Longer version of the question: We use the EmbeddedCassandraService for unit testing and we're seeing non-deterministic failures on some machines. The sequence of events that cause the failures look something like this: - We have a single EmbeddedCassandraService that runs for all of our unit tests - In every test class, we create a new keyspace, then create a bunch of tables within that keyspace and run our tests - When a given test class is finished, we execute some tear-down code that deletes the tables in the keyspace and then drops the keyspace itself - All of the unit tests share a single Session object Our tests always fail when we are executing the tear-down code. We always get an error that looks like: <error message="All host(s) tried for query failed (tried: localhost/ 127.0.0.1:57905 (com.datastax.driver.core.exceptions.DriverException: Timeout during read))" type="com.datastax.driver.core.exceptions.NoHostAvailableException">com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:57905 (com.datastax.driver.core.exceptions.DriverException: Timeout during read)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:256) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:172) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:36) I captured the output of the EmbeddedCassandraService to a log file, and the last few lines look like: 14/09/04 04:14:34 ERROR org.apache.cassandra.db.Memtable: MemoryMeter uninitialized (jamm not specified as java agent); assuming liveRatio of 10.0. Usually this means cassandra-env.sh disabled jamm because you are using a buggy JRE; upgrade to the Sun JRE instead 14/09/04 04:14:34 INFO org.kiji.schema.impl.cassandra.CassandraAdmin: Deleting table "kiji_testGet_4"."schema_id" 14/09/04 04:14:34 INFO org.apache.cassandra.service.MigrationManager: Drop ColumnFamily 'kiji_testGet_4/schema_id' 14/09/04 04:14:34 ERROR org.apache.cassandra.db.Memtable: MemoryMeter uninitialized (jamm not specified as java agent); assuming liveRatio of 10.0. Usually this means cassandra-env.sh disabled jamm because you are using a buggy JRE; upgrade to the Sun JRE instead 14/09/04 04:14:34 ERROR org.apache.cassandra.db.Memtable: MemoryMeter uninitialized (jamm not specified as java agent); assuming liveRatio of 10.0. Usually this means cassandra-env.sh disabled jamm because you are using a buggy JRE; upgrade to the Sun JRE instead 14/09/04 04:14:34 ERROR org.apache.cassandra.db.Memtable: MemoryMeter uninitialized (jamm not specified as java agent); assuming liveRatio of 10.0. Usually this means cassandra-env.sh disabled jamm because you are using a buggy JRE; upgrade to the Sun JRE instead 14/09/04 04:14:34 INFO org.apache.cassandra.db.ColumnFamilyStore: Enqueuing flush of Memtable-schema_keyspaces@822600288(138/1380 serialized/live bytes, 3 ops) 14/09/04 04:14:34 INFO org.apache.cassandra.db.Memtable: Writing Memtable-schema_keyspaces@822600288(138/1380 serialized/live bytes, 3 ops) 14/09/04 04:14:35 INFO org.apache.cassandra.db.compaction.CompactionTask: Compacted 4 sstables to [target/cassandra/data/system/local/system-local-jb-77,]. 6,147 bytes to 5,713 (~92% of original) in 497ms = 0.010962MB/s. 4 total partitions merged to 1. Partition merge counts were {4:1, } 14/09/04 04:14:35 INFO org.apache.cassandra.db.Memtable: Completed flushing target/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-jb-151-Data.db (167 bytes) for commitlog position ReplayPosition(segmentId=1409829145687, position=619631) 14/09/04 04:14:35 INFO org.apache.cassandra.db.ColumnFamilyStore: Enqueuing flush of Memtable-schema_columnfamilies@1313116194(0/0 serialized/live bytes, 2 ops) 14/09/04 04:14:35 INFO org.apache.cassandra.db.Memtable: Writing Memtable-schema_columnfamilies@1313116194(0/0 serialized/live bytes, 2 ops) 14/09/04 04:14:36 INFO org.apache.cassandra.db.Memtable: Completed flushing target/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-jb-142-Data.db (68 bytes) for commitlog position ReplayPosition(segmentId=1409829145687, position=620056) 14/09/04 04:14:36 INFO org.apache.cassandra.db.ColumnFamilyStore: Enqueuing flush of Memtable-schema_columns@1575679153(0/0 serialized/live bytes, 4 ops) 14/09/04 04:14:36 INFO org.apache.cassandra.db.Memtable: Writing Memtable-schema_columns@1575679153(0/0 serialized/live bytes, 4 ops) 14/09/04 04:14:37 INFO org.apache.cassandra.db.Memtable: Completed flushing target/cassandra/data/system/schema_columns/system-schema_columns-jb-142-Data.db (117 bytes) for commitlog position ReplayPosition(segmentId=1409829145687, position=620056) 14/09/04 04:14:37 INFO org.apache.cassandra.db.ColumnFamilyStore: Enqueuing flush of Memtable-schema_id@1307640647(17258/172580 serialized/live bytes, 18 ops) 14/09/04 04:14:37 INFO org.apache.cassandra.db.Memtable: Writing Memtable-schema_id@1307640647(17258/172580 serialized/live bytes, 18 ops) 14/09/04 04:14:47 INFO org.apache.cassandra.db.Memtable: Completed flushing target/cassandra/data/kiji_testGet_4/schema_id/kiji_testGet_4-schema_id-jb-1-Data.db (6908 bytes) for commitlog position ReplayPosition(segmentId=1409829145687, position=620056) In the greater stack trace, the last line of *our* code that gets called before the exception is a call to Session#execute that is asking Cassandra to delete the table "kiji_testGet_4"."schema_id", which you can see from the stack trace actually gets dropped: 14/09/04 04:14:34 INFO org.apache.cassandra.service.MigrationManager: Drop ColumnFamily 'kiji_testGet_4/schema_id' Any ideas about what is going on here? I know that the EmbeddedCassandraService is not dying silently, because many other test classes continue to function fine after this. My guess is that our machine is highly loaded (we are doing a lot of builds in parallel), and dropping this table appears to cause some kind of compaction. Is the EmbeddedCassandraService just timing out while it does the compaction, since there is only a single thread? What options should I change in my code to get the Session object to try reconnecting, or to give it a high timeout? Should I change the RetryPolicy in the Cluster.Builder? Or should I use SocketOptions#setConnectTimeoutMillis to something higher than the default? Any help would be greatly appreciated! Best regards, Clint