Hi, all, I just started reading the source codes of Kafka. The current OffsetCheckpoint.write() does not look good to me. After the file rename, it still needs to do a fsync.
In addition, it should maintain a checksum for each check point. The checksum corruption needs to be checked during the recovery. Ideally, it should maintain two check points for each partition. At least, it can ensure there exists a valid checkpoint. Let me know if my concerns are valid. I think this talk might help most understand the issue. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pillai Thanks, Xiao Li