nondeterministic NoHostAvailableException occurs while dropping a table

Clint Kelly Fri, 05 Sep 2014 21:39:37 -0700

Hi all,

TL;DR - I think my unit tests are sometimes failing because of read
timeouts to an EmbeddedCassandraService when dropping a table triggers a
compaction on a highly-loaded build slave.  Does this sound reasonable?
What options should I change in my Cluster.Builder (or elsewhere) to
prevent this from happening?


Longer version of the question:

We use the EmbeddedCassandraService for unit testing and we're seeing
non-deterministic failures on some machines.  The sequence of events that
cause the failures look something like this:


   - We have a single EmbeddedCassandraService that runs for all of our
   unit tests
   - In every test class, we create a new keyspace, then create a bunch of
   tables within that keyspace and run our tests
   - When a given test class is finished, we execute some tear-down code
   that deletes the tables in the keyspace and then drops the keyspace itself
   - All of the unit tests share a single Session object

Our tests always fail when we are executing the tear-down code.  We always
get an error that looks like:

    <error message="All host(s) tried for query failed (tried: localhost/
127.0.0.1:57905 (com.datastax.driver.core.exceptions.DriverException:
Timeout during read))"
type="com.datastax.driver.core.exceptions.NoHostAvailableException">com.datastax.driver.core.exceptions.NoHostAvailableException:
All host(s) tried for query failed (tried: localhost/127.0.0.1:57905
(com.datastax.driver.core.exceptions.DriverException: Timeout during read))
  at
com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65)
  at
com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:256)
  at
com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:172)
  at
com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
  at
com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:36)

I captured the output of the EmbeddedCassandraService to a log file, and
the last few lines look like:

14/09/04 04:14:34 ERROR org.apache.cassandra.db.Memtable: MemoryMeter
uninitialized (jamm not specified as java agent); assuming liveRatio of
10.0.   Usually this means cassandra-env.sh disabled jamm because you are
using a buggy JRE;  upgrade to the Sun JRE instead
14/09/04 04:14:34 INFO org.kiji.schema.impl.cassandra.CassandraAdmin:
Deleting table "kiji_testGet_4"."schema_id"
14/09/04 04:14:34 INFO org.apache.cassandra.service.MigrationManager: Drop
ColumnFamily 'kiji_testGet_4/schema_id'
14/09/04 04:14:34 ERROR org.apache.cassandra.db.Memtable: MemoryMeter
uninitialized (jamm not specified as java agent); assuming liveRatio of
10.0.   Usually this means cassandra-env.sh disabled jamm because you are
using a buggy JRE;  upgrade to the Sun JRE instead
14/09/04 04:14:34 ERROR org.apache.cassandra.db.Memtable: MemoryMeter
uninitialized (jamm not specified as java agent); assuming liveRatio of
10.0.   Usually this means cassandra-env.sh disabled jamm because you are
using a buggy JRE;  upgrade to the Sun JRE instead
14/09/04 04:14:34 ERROR org.apache.cassandra.db.Memtable: MemoryMeter
uninitialized (jamm not specified as java agent); assuming liveRatio of
10.0.   Usually this means cassandra-env.sh disabled jamm because you are
using a buggy JRE;  upgrade to the Sun JRE instead
14/09/04 04:14:34 INFO org.apache.cassandra.db.ColumnFamilyStore: Enqueuing
flush of Memtable-schema_keyspaces@822600288(138/1380 serialized/live
bytes, 3 ops)
14/09/04 04:14:34 INFO org.apache.cassandra.db.Memtable: Writing
Memtable-schema_keyspaces@822600288(138/1380 serialized/live bytes, 3 ops)
14/09/04 04:14:35 INFO org.apache.cassandra.db.compaction.CompactionTask:
Compacted 4 sstables to
[target/cassandra/data/system/local/system-local-jb-77,].  6,147 bytes to
5,713 (~92% of original) in 497ms = 0.010962MB/s.  4 total partitions
merged to 1.  Partition merge counts were {4:1, }
14/09/04 04:14:35 INFO org.apache.cassandra.db.Memtable: Completed flushing
target/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-jb-151-Data.db
(167 bytes) for commitlog position ReplayPosition(segmentId=1409829145687,
position=619631)
14/09/04 04:14:35 INFO org.apache.cassandra.db.ColumnFamilyStore: Enqueuing
flush of Memtable-schema_columnfamilies@1313116194(0/0 serialized/live
bytes, 2 ops)
14/09/04 04:14:35 INFO org.apache.cassandra.db.Memtable: Writing
Memtable-schema_columnfamilies@1313116194(0/0 serialized/live bytes, 2 ops)
14/09/04 04:14:36 INFO org.apache.cassandra.db.Memtable: Completed flushing
target/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-jb-142-Data.db
(68 bytes) for commitlog position ReplayPosition(segmentId=1409829145687,
position=620056)
14/09/04 04:14:36 INFO org.apache.cassandra.db.ColumnFamilyStore: Enqueuing
flush of Memtable-schema_columns@1575679153(0/0 serialized/live bytes, 4
ops)
14/09/04 04:14:36 INFO org.apache.cassandra.db.Memtable: Writing
Memtable-schema_columns@1575679153(0/0 serialized/live bytes, 4 ops)
14/09/04 04:14:37 INFO org.apache.cassandra.db.Memtable: Completed flushing
target/cassandra/data/system/schema_columns/system-schema_columns-jb-142-Data.db
(117 bytes) for commitlog position ReplayPosition(segmentId=1409829145687,
position=620056)
14/09/04 04:14:37 INFO org.apache.cassandra.db.ColumnFamilyStore: Enqueuing
flush of Memtable-schema_id@1307640647(17258/172580 serialized/live bytes,
18 ops)
14/09/04 04:14:37 INFO org.apache.cassandra.db.Memtable: Writing
Memtable-schema_id@1307640647(17258/172580 serialized/live bytes, 18 ops)
14/09/04 04:14:47 INFO org.apache.cassandra.db.Memtable: Completed flushing
target/cassandra/data/kiji_testGet_4/schema_id/kiji_testGet_4-schema_id-jb-1-Data.db
(6908 bytes) for commitlog position ReplayPosition(segmentId=1409829145687,
position=620056)

In the greater stack trace, the last line of *our* code that gets called
before the exception is a call to Session#execute that is asking Cassandra
to delete the table "kiji_testGet_4"."schema_id", which you can see from
the stack trace actually gets dropped:

14/09/04 04:14:34 INFO org.apache.cassandra.service.MigrationManager: Drop
ColumnFamily 'kiji_testGet_4/schema_id'

Any ideas about what is going on here?  I know that the
EmbeddedCassandraService is not dying silently, because many other test
classes continue to function fine after this.  My guess is that our machine
is highly loaded (we are doing a lot of builds in parallel), and dropping
this table appears to cause some kind of compaction.  Is the
EmbeddedCassandraService just timing out while it does the compaction,
since there is only a single thread?

What options should I change in my code to get the Session object to try
reconnecting, or to give it a high timeout?  Should I change the
RetryPolicy in the Cluster.Builder?  Or should I use
SocketOptions#setConnectTimeoutMillis to something higher than the default?

Any help would be greatly appreciated!

Best regards,
Clint

nondeterministic NoHostAvailableException occurs while dropping a table

Reply via email to