I have seen the same error in Cassandra 3.x too, and in fact quite a few
times. On a few occasions, I opened the corrupted commit log file in a
hex editor, and it was filled with a lots of 0x00s. I believe it was
caused by the combination of the way Cassandra flushes the commit log +
the way XFS handles the metadata in journal + an unexpected power cut +
the SSD write back cache. I have never experienced this again since we
moved all Cassandra servers to ZFS.
On 26/07/2021 23:11, Leon Zaruvinsky wrote:
And for completeness, a sample stack trace:
ERROR [2021-07-21T02:11:01.994Z] org.apache.cassandra.db.commitlog.CommitLog:
Failed commit log replay. Commit disk failure policy is stop_on_startup;
terminating thread (throwable0_message: Mutation checksum failure at 15167277
in CommitLog-5-1626828286977.log)
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log
at
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647)
at
org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:519)
at
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:401)
at
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:143)
at
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:175)
at
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:155)
at
org.apache.cassandra.service.CassandraDaemon.recoverCommitlogAndCompleteSetup(CassandraDaemon.java:296)
at
org.apache.cassandra.service.CassandraDaemon.completeSetupMayThrowSstableException(CassandraDaemon.java:289)
at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222)
at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630)
at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:741)
On Mon, Jul 26, 2021 at 6:08 PM Leon Zaruvinsky
<leonzaruvin...@gmail.com <mailto:leonzaruvin...@gmail.com>> wrote:
Currently we're using commitlog_batch:
commitlog_sync: batch
commitlog_sync_batch_window_in_ms: 2
commitlog_segment_size_in_mb: 32
durable_writes is also true.
Unfortunately we are still using Cassandra 2.2.x :( Though I'd be
curious if much in this space has changed since then (I've looked
through the changelogs and nothing stood out).
On Mon, Jul 26, 2021 at 5:20 PM Jeff Jirsa <jji...@gmail.com
<mailto:jji...@gmail.com>> wrote:
What commitlog settings are you using?
Default is periodic with 10s sync. That leaves you a 10s
window on hard poweroff/crash.
I would also expect cassandra to cleanup and start cleanly,
which version are you running?
On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky
<leonzaruvin...@gmail.com <mailto:leonzaruvin...@gmail.com>>
wrote:
Hi Cassandra community,
We (and others) regularly run into commit log corruptions
that are caused by Cassandra, or the underlying
infrastructure, being hard restarted. I suspect that this
is because it happens in the middle of a commitlog file
write to disk.
Could anyone point me at resources / code to understand
why this is happening? Shouldn't Cassandra not be acking
writes until the commitlog is safely written to disk? I
would expect that on startup, Cassandra should be able to
clean up bad commitlog files and recover gracefully.
I've seen various references online to this issue as
something that will be fixed in the future - so I'm
curious if there is any movement or thoughts there.
Thanks a bunch,
Leon