Hi everyone, I was finally able to sort out my problem in an "interesting" manner that I think is worth sharing on the list!
What I did is the following: on each node, I stopped Cassandra, completely dropped the data files of the column family, started Cassandra again and issued a repair for this column family. The process took time since the cluster is formed of 40 nodes, but once done, the nodes didn't exhibit this assertion error anymore! I believe this was triggered because of me tweaking the "sstable_size_in_mb" parameter. Somehow I had data files with different sizes and it confused Cassandra. So, problem solved now :-) Cheers, Reynald On Fri, Aug 31, 2018 at 7:45 AM Reynald Borer <reynald.bo...@gmail.com> wrote: > Hi everyone, > > I'm running a Cassandra 1.2.19 cluster of 40 nodes and compactions of a > specific column family are sporadically raising an AssertionError like this > (full stack trace visible under > https://gist.github.com/rborer/46862d6d693c0163aa8fe0e74caa2d9a): > > ERROR [CompactionExecutor:9137] 2018-08-27 11:43:05,197 > org.apache.cassandra.service.CassandraDaemon - Exception in thread > Thread[CompactionExecutor:9137,1,main] > java.lang.AssertionError: 2 > at > org.apache.cassandra.db.compaction.LeveledManifest.replace(LeveledManifest.java:267) > > The data written in this column family can be seen as wide rows, that is, > rows with lots of columns. Each column has a TTL of 7 days though. > > Whenever this happens, it seems to block compactions of this column family > (I see the pending compactions increasing) until I restart the failing node. > > I have searched on jira and on this mailing-list about this issue without > too much luck. I suspect it may be related to > https://issues.apache.org/jira/browse/CASSANDRA-6563 although it's hard > for to confirm. > > I know this version is pretty old, does this issue anyway rings a bell to > one of you? > > Here are some more details about my cluster: > > - it is composed of 40 nodes > - it is pretty old and I'm in the process of upgrading it, thus it was > running without issues under version 1.0.12 & 1.1.12 > - it really affect a single column family only (schema can be seen on > https://gist.github.com/rborer/46862d6d693c0163aa8fe0e74caa2d9a#file-schema-txt > ) > - my cluster is set up with RandomPartitioner (inherited from when it was > set up on version 0.7) and a replication factor of 3 > - it's running weekly repairs (and this assertion happens mostly during > repairs) > - what I also noted is that since the cluster was upgraded to 1.2.19 the > disk size of this column family keeps increasing (it went from 400G to > 1.2T!) > > Thanks in advance for your help. > > Best regards, > Reynald > > > >