I've posted the following to the Datastax Java Driver user forum, but no one has responded, so I thought I'd try here, too.
We have a service that writes to a few legacy (pre-CQL) counter column families in a Cassandra 2.1.11 cluster. We've been trying to migrate this service from Astyanax to the Datastax Java Driver (version 2.1.10.1). We've been testing the new version in a "shadow" deployment in a production environment, using the same Cassandra cluster as the production version, but writing to a testing-only keyspace. Occasionally, unlogged batches of counter updates in the same partition will fail with the following error from the coordinator: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write) We've only observed these errors in the service version that uses the Datastax Driver, not the version that uses Astyanax. These batches are written with CL=LOCAL_QUORUM; the CL in the error message doesn't match.This resembles the symptoms of the issue described in CASSANDRA-10041 "timeout during write query at consistency ONE" when updating counter at consistency QUORUM and 2 of 3 nodes alive In that issue, the error occurs when a node is abruptly terminated. However, we've also seen the error occur when all Cassandra nodes appeared to be healthy. There are a few possible explanations for why the errors only occur with the Datastax driver, but I'm not sure which is correct: a) There is a problem with how we're using the Datastax Driver to compose batches of counter updates b) There is a difference in the between the implementation of counter updates in the Native protocol from the Thrift protocol such that the error is reported to native clients, but not to Thrift clients. c) There is a difference between the keyspace/column family definition of the production and testing keyspaces. d) The Astyanax/Thrift version is getting the error but is ignoring it for some reason. I doubt (c) is the reason; we've made an effort to ensure that the keyspace and CF configurations are the same. Also, (d) seems unlikely because we've seen other errors (such as unavailable exceptions) reported correctly. So, I'm betting that either (a) or (b) is the reason. Would someone please suggest which of these explanations is likely to be correct, and what we might do to avoid the problem? -- - Steven