I don't recognise those names:
* channel_data_id
* control_system_type
* server_id
* decimation_levels
I assume these are column names of a non-system table.
From the stack trace, this looks like an error from a node which was
running 4.1.3, and this node was not the coordinator for this query.
I did some research and found these bug reports which may be related:
* CASSANDRA-15899
<https://issues.apache.org/jira/browse/CASSANDRA-15899> Dropping a
column can break queries until the schema is fully propagated
* CASSANDRA-16735
<https://issues.apache.org/jira/browse/CASSANDRA-16735> Adding
columns via ALTER TABLE can generate corrupt sstables
The solution for CASSANDRA-16735 was to revert CASSANDRA-15899,
according to the comments in the ticket.
This does look like CASSANDRA-15899 is back, but I can't see why it was
only happening when the nodes were running mixed versions, and then
stopped after all nodes were upgraded.
On 12/12/2023 16:28, Sebastian Marsching wrote:
Hi,
while upgrading our production cluster from C* 3.11.14 to 4.1.3, we experienced
the issue that some SELECT queries failed due to supposedly no replica being
available. The system logs on the C* nodes where full of messages like the
following one:
ERROR [ReadStage-1] 2023-12-11 13:53:57,278 JVMStabilityInspector.java:68 -
Exception in thread Thread[ReadStage-1,5,SharedPool]
java.lang.IllegalStateException: [channel_data_id, control_system_type,
server_id, decimation_levels] is not a subset of [channel_data_id]
at
org.apache.cassandra.db.Columns$Serializer.encodeBitmap(Columns.java:593)
at
org.apache.cassandra.db.Columns$Serializer.serializeSubset(Columns.java:523)
at
org.apache.cassandra.db.rows.UnfilteredSerializer.serializeRowBody(UnfilteredSerializer.java:231)
at
org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:205)
at
org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:137)
at
org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:125)
at
org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:140)
at
org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:95)
at
org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:80)
at
org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:308)
at
org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:201)
at
org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:186)
at
org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:182)
at
org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:48)
at
org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:337)
at
org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(ReadCommandVerbHandler.java:63)
at
org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97)
at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
at
org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430)
at
org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:142)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
This problem only persisted while the cluster had a mix of 3.11.14 and 4.1.3
nodes. As soon as the last node was updated, the problem disappeared
immediately, so I suspect that it was somehow caused by the unavoidable schema
inconsistency during the upgrade.
I just wanted to give everyone who hasn’t upgraded yet a heads up, so that they
are aware that this problem might exist. Interestingly, it seems like not all
queries involving the affected table were affected by this problem. As far as I
am aware, no schema changes have ever been made to the affected table, so I am
pretty certain that the schema inconsistencies were purely related to the
upgrade process.
We hadn’t noticed this problem when testing the upgrade on our test cluster
because there we first did the upgrade and then ran the test workload. So, if
you are worried you might be affected by this problem as well, you might want
to run your workload on the test cluster while having mixed versions.
I did not investigate the cause further because simply completing the upgrade
process seemed like the quickest option to get the cluster fully operational
again.
Cheers,
Sebastian