[ https://issues.apache.org/jira/browse/KAFKA-615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729087#comment-13729087 ]
Jay Kreps commented on KAFKA-615: --------------------------------- Ah thanks for the detailed review: 50.2 Yes, nice. 50.3 I thought of this but don't think it is a problem. Flushes are always up to a particular recovery point. So let us say that we are flushing on every offset and flush(100) and flush(101) are reordered since they are async. That is actually okay, flush(100) will not actually write any data, and the check to update the recovery point is always done in a lock to ensure it doesn't get clobbered by out of order flushes. Let me know if you see something I am missing. 51. Yup. 52.1 Good point 52.2 Yeah I thought of this too. My claim is that it is okay as long as the usage is something like (1) stop writes, (2) flush the checkpoints, (3) take new writes. If we do this then there are two cases (1) the metadata write that truncated the file occurred, or (2) it did not occur. If it did not occur then it is no different then if we crashed prior to the truncate. If it did occur and if the log end is before the recovery point then that is still fine because that is stable storage (we just masked off the end of the log) so we don't need to recover that. The troublesome case is if we take writes before flushing the checkpoint, then we are in trouble. The question is whether those assumptions actually hold? However, I think there is one bug at least that you lead me to which is that I need to ensure the recovery point is then reduced to the log end, or else after we append data we would think it was flushed. Let me know if you buy this analysis. FWIW I actually think our truncate logic may have a hole in 0.8 because we always recover the last segment only. However consider a case where we we truncate off the last segment, making a previous segment active, then we take some writes (but no flush) and then crash. In this case it is possible that the segment we truncated off reappears and also that we have partial writes and old data in the prior segment. But on recovery we will only check the zombie segment and ignore the prior segment. One way to simplify some of the reasoning would just be to fsync on truncate which it doesn't look like we do now. That would help us out of a lot of corner cases. The downside is it may add a lot of time to the becomeFollower because of the burst of writes. 53 Will do Let me know what you think about those two discussion points. I would rather fully think this through now than chase 1 in a million bugs later. > Avoid fsync on log segment roll > ------------------------------- > > Key: KAFKA-615 > URL: https://issues.apache.org/jira/browse/KAFKA-615 > Project: Kafka > Issue Type: Bug > Reporter: Jay Kreps > Assignee: Neha Narkhede > Attachments: KAFKA-615-v1.patch, KAFKA-615-v2.patch, > KAFKA-615-v3.patch, KAFKA-615-v4.patch, KAFKA-615-v5.patch, KAFKA-615-v6.patch > > > It still isn't feasible to run without an application level fsync policy. > This is a problem as fsync locks the file and tuning such a policy so that > the flushes aren't so frequent that seeks reduce throughput, yet not so > infrequent that the fsync is writing so much data that there is a noticable > jump in latency is very challenging. > The remaining problem is the way that log recovery works. Our current policy > is that if a clean shutdown occurs we do no recovery. If an unclean shutdown > occurs we recovery the last segment of all logs. To make this correct we need > to ensure that each segment is fsync'd before we create a new segment. Hence > the fsync during roll. > Obviously if the fsync during roll is the only time fsync occurs then it will > potentially write out the entire segment which for a 1GB segment at 50mb/sec > might take many seconds. The goal of this JIRA is to eliminate this and make > it possible to run with no application-level fsyncs at all, depending > entirely on replication and background writeback for durability. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira