We recently upgraded C* from 2.0.5 to 2.0.9

We have some data that is partitioned in tables created periodically (once a 
day). This morning, this automated process timed out because the schema did not 
reach agreement quickly enough after we created a new empty table.

I was able to reproduce this manually via CQLSH. when I created the table, and 
ran a nodetool describecluster, it showed 3 nodes on the old schema and 3 nodes 
on the new schema instantly (or as quick as I could run the nodetool 
describecluster). It took almost exactly a minute for the other nodes to switch.

The nodes weren’t busy, machines were healthy network was healthy, JVMs were 
healthy - nodetool status, gossipinfo and OpsCenter all looked happy. We never 
saw this issue in beta on 2.0.9 or anywhere on 2.0.5, and yesterday on 2.0.9 
after the upgrade it worked correctly.

The only clue I have is that for this case, the nodes which were slow to update 
called DefsTables.mergeSchema from InternalResponseStage not MigrationStage 
(which is what it is called on as I test it now).
Looking at the logs, these InternalResponseStage happened eerily close (within 
a second) to exactly a minute.

Having discovered nothing else wrong, I restarted one of the “slow” nodes, and 
the problem went away (for that node). So now the cluster has been rolling 
restarted, and is proceeding fine.

Anyways, I will dig a little deeper as to why (when all nodes thing each other 
are up) the migration verb might not get executed (there were no errors in any 
logs)… mostly wondering if this rings a bell with anyone

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to