[GitHub] [kafka] junrao commented on a change in pull request #8936: KAFKA-10207: Fixed padded timeindex causing premature data deletion

GitBox Tue, 07 Jul 2020 10:25:24 -0700


junrao commented on a change in pull request #8936:
URL: https://github.com/apache/kafka/pull/8936#discussion_r451024109




##########
File path: core/src/main/scala/kafka/log/LogSegment.scala
##########
@@ -87,6 +87,12 @@ class LogSegment private[log] (val log: FileRecords,
       // we will recover the segments above the recovery point in recoverLog()
       // in any case so sanity checking them here is redundant.
       txnIndex.sanityCheck()
+      // Failing to sanity check the timeIndex can result in a scenario where 
log segments are
+      // prematurely deleted (before breaching retention periods) if the index 
file was not resized
+      // to disk successfully.
+      // KAFKA-10207
+      timeIndex.sanityCheck()
+      offsetIndex.sanityCheck()

Review comment:
       @Johnny-Malizia : Thanks for reporting this. A couple of comments on 
this.
   
   (1) It seems that there are cases where the index file is not trimmed 
properly. Every rolled segment calls LogSegment.onBecomeInactiveSegment(), 
which does trimming. So, we probably should just fix the trim logic there.
   
   (2) The intention of KIP-263 is to avoid unnecessary loading (and the sanity 
check) of indexes never used. This is mostly achieved through lazy indexes 
(actual index file only opened lazily on first use). Since the biggest cost of 
loading an index file is probably the I/O part, and the sanity check is 
computational only and relatively cheap, we can probably just do the index 
sanity check on index opening. This way, in the common case, we only pay the 
index sanity checking cost on active segments.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] junrao commented on a change in pull request #8936: KAFKA-10207: Fixed padded timeindex causing premature data deletion

Reply via email to