Hi, all, 

I just started reading the source codes of Kafka. The current 
OffsetCheckpoint.write() does not look good to me. After the file rename, it 
still needs to do a fsync. 

In addition, it should maintain a checksum for each check point. The checksum 
corruption needs to be checked during the recovery. 

Ideally, it should maintain two check points for each partition. At least, it 
can ensure there exists a valid checkpoint. 

Let me know if my concerns are valid. 

I think this talk might help most understand the issue. 
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pillai

Thanks, 

Xiao Li

Reply via email to