[
https://issues.apache.org/jira/browse/KAFKA-13773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533588#comment-17533588
]
Luke Chen commented on KAFKA-13773:
-----------------------------------
PR: https://github.com/apache/kafka/pull/12136
> Data loss after recovery from crash due to full hard disk
> ---------------------------------------------------------
>
> Key: KAFKA-13773
> URL: https://issues.apache.org/jira/browse/KAFKA-13773
> Project: Kafka
> Issue Type: Bug
> Components: log
> Affects Versions: 2.8.0, 3.1.0, 2.8.1
> Reporter: Tim Alkemade
> Assignee: Luke Chen
> Priority: Critical
> Attachments: DiskAndOffsets.png, kafka-.zip, kafka-2.7.0vs2.8.0.zip,
> kafka-2.8.0-crash.zip, kafka-logfiles.zip, kafka-start-to-finish.zip
>
>
> While doing some testing of Kafka on Kubernetes, the data disk for kafka
> filled up, which led to all 3 nodes crashing. I increased the disk size for
> all three nodes and started up kafka again (one by one, waiting for the
> previous node to become available before starting the next one). After a
> little while two out of three nodes had no data anymore.
> According to the logs, the log cleaner kicked in and decided that the latest
> timestamp on those partitions was '0' (i.e. 1970-01-01), and that is older
> than the 2 week limit specified on the topic.
>
> {code:java}
> 2022-03-28 12:17:19,740 INFO [LocalLog partition=audit-trail-0,
> dir=/var/lib/kafka/data-0/kafka-log1] Deleting segment files
> LogSegment(baseOffset=0, size=249689733, lastModifiedTime=1648460888636,
> largestRecordTimestamp=Some(0)) (kafka.log.LocalLog$) [kafka-scheduler-0]
> 2022-03-28 12:17:19,753 INFO Deleted log
> /var/lib/kafka/data-0/kafka-log1/audit-trail-0/00000000000000000000.log.deleted.
> (kafka.log.LogSegment) [kafka-scheduler-0]
> 2022-03-28 12:17:19,754 INFO Deleted offset index
> /var/lib/kafka/data-0/kafka-log1/audit-trail-0/00000000000000000000.index.deleted.
> (kafka.log.LogSegment) [kafka-scheduler-0]
> 2022-03-28 12:17:19,754 INFO Deleted time index
> /var/lib/kafka/data-0/kafka-log1/audit-trail-0/00000000000000000000.timeindex.deleted.
> (kafka.log.LogSegment) [kafka-scheduler-0]{code}
> Using kafka-dump-log.sh I was able to determine that the greatest timestamp
> in that file (before deletion) was actually 1648460888636 ( 2022-03-28,
> 09:48:08 UTC, which is today). However since this segment was the
> 'latest/current' segment much of the file is empty. The code that determines
> the last entry (TimeIndex.lastEntryFromIndexFile) doesn't seem to know this
> and just read the last position in the file, the file being mostly empty
> causes it to read 0 for that position.
> The cleaner code seems to take this into account since
> UnifiedLog.deleteOldSegments is never supposed to delete the current segment,
> judging by the scaladoc, however in this case the check doesn't seem to do
> its job. Perhaps the detected highWatermark is wrong?
> I've attached the logs and the zipped data directories (data files are over
> 3Gb in size when unzipped)
>
> I've encountered this problem with both kafka 2.8.1 and 3.1.0.
> I've also tried changing min.insync.replicas to 2: The issue still occurs.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)