Hi, while upgrading our production cluster from C* 3.11.14 to 4.1.3, we experienced the issue that some SELECT queries failed due to supposedly no replica being available. The system logs on the C* nodes where full of messages like the following one:
ERROR [ReadStage-1] 2023-12-11 13:53:57,278 JVMStabilityInspector.java:68 - Exception in thread Thread[ReadStage-1,5,SharedPool] java.lang.IllegalStateException: [channel_data_id, control_system_type, server_id, decimation_levels] is not a subset of [channel_data_id] at org.apache.cassandra.db.Columns$Serializer.encodeBitmap(Columns.java:593) at org.apache.cassandra.db.Columns$Serializer.serializeSubset(Columns.java:523) at org.apache.cassandra.db.rows.UnfilteredSerializer.serializeRowBody(UnfilteredSerializer.java:231) at org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:205) at org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:137) at org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:125) at org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:140) at org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:95) at org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:80) at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:308) at org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:201) at org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:186) at org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:182) at org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:48) at org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:337) at org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(ReadCommandVerbHandler.java:63) at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78) at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97) at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45) at org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430) at org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133) at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:142) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:829) This problem only persisted while the cluster had a mix of 3.11.14 and 4.1.3 nodes. As soon as the last node was updated, the problem disappeared immediately, so I suspect that it was somehow caused by the unavoidable schema inconsistency during the upgrade. I just wanted to give everyone who hasn’t upgraded yet a heads up, so that they are aware that this problem might exist. Interestingly, it seems like not all queries involving the affected table were affected by this problem. As far as I am aware, no schema changes have ever been made to the affected table, so I am pretty certain that the schema inconsistencies were purely related to the upgrade process. We hadn’t noticed this problem when testing the upgrade on our test cluster because there we first did the upgrade and then ran the test workload. So, if you are worried you might be affected by this problem as well, you might want to run your workload on the test cluster while having mixed versions. I did not investigate the cause further because simply completing the upgrade process seemed like the quickest option to get the cluster fully operational again. Cheers, Sebastian
smime.p7s
Description: S/MIME cryptographic signature