We just upgraded our 33 node cluster from 2.0.10 to 2.1.15 to 3.0.12. The upgrade from 2.0.10 to 2.1.15 went very smoothly - both the rolling software update and the subsequent ‘nodetool upgradesstables’. However we have had a number of issues with the 3.0.12 upgrade:
The first issue we noticed (well into the rolling upgrade) was that the schema versions for the upgraded nodes were changing with each new node added. Perhaps related to https://issues.apache.org/jira/browse/CASSANDRA-13274 ? This seemed innocuous until we noticed that commit logs seemed to be growing without bound. We attempted to force replay of commit logs by restarting an upgraded node. However upon restart we hit: ERROR [main] 2017-03-27 19:22:38,929 CommitLogReplayer.java:677 - Ignoring commit log replay error org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Unexpected error deserializing mutation; saved to /tmp/mutation1621984759842154734dat. This may be caused by replaying a mutation against a table with the same name but incompatible schema. Exception follows: java.io.IOError: java.io.IOException: Corrupt empty row found in unfiltered partition We were up against the wall with disks filling and had to get nodes restarted or face an outage so so we used the cassandra.commitlog.ignorereplayerrors=true JVM option to get Cassandra nodes restarted. Once we had all nodes upgraded, the schema version stabilized and we stopped seeing any issues with commitlog replay. However we are still getting a small number of these seemingly related errors (3-4 per hour): ERROR [SharedPool-Worker-3] 2017-03-28 23:50:33,998 Message.java:621 - Unexpected exception during request; channel = [id: 0x2366241f, L:/10.7.150.165:9042 - R:/10.179.229.62:32119] java.io.IOError: java.io.IOException: Corrupt empty row found in unfiltered partition at org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer$1.computeNext(UnfilteredRowIteratorSerializer.java:222) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer$1.computeNext(UnfilteredRowIteratorSerializer.java:210) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.db.transform.BaseRows.hasNext(BaseRows.java:129) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.db.transform.FilteredRows.isEmpty(FilteredRows.java:50) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.db.transform.Filter.closeIfEmpty(Filter.java:73) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.db.transform.Filter.applyToPartition(Filter.java:43) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.db.transform.Filter.applyToPartition(Filter.java:26) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:96) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.cql3.statements.SelectStatement.process(SelectStatement.java:707) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.cql3.statements.SelectStatement.processResults(SelectStatement.java:400) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:353) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:227) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:76) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:206) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:237) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:222) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.0.12.jar:3.0.12] at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.44.Final.jar:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) [netty-all-4.0.44.Final.jar:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.44.Final.jar:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:348) [netty-all-4.0.44.Final.jar:4.0.44.Final] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_45] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [apache-cassandra-3.0.12.jar:3.0.12] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45] Caused by: java.io.IOException: Corrupt empty row found in unfiltered partition at org.apache.cassandra.db.rows.UnfilteredSerializer.deserialize(UnfilteredSerializer.java:382) ~[apache-cassandra-3.0.12.jar:3.0.12] at org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer$1.computeNext(UnfilteredRowIteratorSerializer.java:217) ~[apache-cassandra-3.0.12.jar:3.0.12] ... 27 common frames omitted Next we tried to run ’nodetool upgradetables’. The vast majority of tables are upgraded ok, but for a very small minority the nodetool command died with the below (nothing in system.log): java.lang.AssertionError at org.apache.cassandra.db.rows.Rows.collectStats(Rows.java:70) at org.apache.cassandra.io.sstable.format.big.BigTableWriter$StatsCollector.applyToRow(BigTableWriter.java:197) at org.apache.cassandra.db.transform.BaseRows.applyOne(BaseRows.java:116) at org.apache.cassandra.db.transform.BaseRows.add(BaseRows.java:107) at org.apache.cassandra.db.transform.UnfilteredRows.add(UnfilteredRows.java:41) at org.apache.cassandra.db.transform.Transformation.add(Transformation.java:156) at org.apache.cassandra.db.transform.Transformation.apply(Transformation.java:122) at org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:147) at org.apache.cassandra.io.sstable.SSTableRewriter.append(SSTableRewriter.java:125) at org.apache.cassandra.db.compaction.writers.DefaultCompactionWriter.realAppend(DefaultCompactionWriter.java:57) at org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.append(CompactionAwareWriter.java:109) at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:195) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:89) at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:61) at org.apache.cassandra.db.compaction.CompactionManager$5.execute(CompactionManager.java:415) at org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:307) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) at java.lang. Thread.run(Thread.java:745) Could this be related to https://issues.apache.org/jira/browse/CASSANDRA-13320 ? Right now it seems like we are stuck with 2.1 tables for a very small minority of our tables. I’m not sure what the implications of that are or of the small number of “Corrupt empty row” errors we are seeing in system.log. Would appreciate advice. Thanks!