[jira] [Commented] (KAFKA-615) Avoid fsync on log segment roll

Jay Kreps (JIRA) Sun, 04 Aug 2013 20:12:49 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729087#comment-13729087
 ]


Jay Kreps commented on KAFKA-615:
---------------------------------

Ah thanks for the detailed review:
50.2 Yes, nice.
50.3 I thought of this but don't think it is a problem. Flushes are always up 
to a particular recovery point. So let us say that we are flushing on every 
offset and flush(100) and flush(101) are reordered since they are async. That 
is actually okay, flush(100) will not actually write any data, and the check to 
update the recovery point is always done in a lock to ensure it doesn't get 
clobbered by out of order flushes. Let me know if you see something I am 
missing.
51. Yup.
52.1 Good point
52.2 Yeah I thought of this too. My claim is that it is okay as long as the 
usage is something like (1) stop writes, (2) flush the checkpoints, (3) take 
new writes.  If we do this then there are two cases (1) the metadata write that 
truncated the file occurred, or (2) it did not occur. If it did not occur then 
it is no different then if we crashed prior to the truncate. If it did occur 
and if the log end is before the recovery point then that is still fine because 
that is stable storage (we just masked off the end of the log) so we don't need 
to recover that. The troublesome case is if we take writes before flushing the 
checkpoint, then we are in trouble. The question is whether those assumptions 
actually hold? However, I think there is one bug at least that you lead me to 
which is that I need to ensure the recovery point is then reduced to the log 
end, or else after we append data we would think it was flushed. Let me know if 
you buy this analysis.
FWIW I actually think our truncate logic may have a hole in 0.8 because we 
always recover the last segment only. However consider a case where we we 
truncate off the last segment, making a previous segment active, then we take 
some writes (but no flush) and then crash. In this case it is possible that the 
segment we truncated off reappears and also that we have partial writes and old 
data in the prior segment. But on recovery we will only check the zombie 
segment and ignore the prior segment.
One way to simplify some of the reasoning would just be to fsync on truncate 
which it doesn't look like we do now. That would help us out of a lot of corner 
cases. The downside is it may add a lot of time to the becomeFollower because 
of the burst of writes.
53 Will do

Let me know what you think about those two discussion points. I would rather 
fully think this through now than chase 1 in a million bugs later.
                
> Avoid fsync on log segment roll
> -------------------------------
>
>                 Key: KAFKA-615
>                 URL: https://issues.apache.org/jira/browse/KAFKA-615
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jay Kreps
>            Assignee: Neha Narkhede
>         Attachments: KAFKA-615-v1.patch, KAFKA-615-v2.patch, 
> KAFKA-615-v3.patch, KAFKA-615-v4.patch, KAFKA-615-v5.patch, KAFKA-615-v6.patch
>
>
> It still isn't feasible to run without an application level fsync policy. 
> This is a problem as fsync locks the file and tuning such a policy so that 
> the flushes aren't so frequent that seeks reduce throughput, yet not so 
> infrequent that the fsync is writing so much data that there is a noticable 
> jump in latency is very challenging.
> The remaining problem is the way that log recovery works. Our current policy 
> is that if a clean shutdown occurs we do no recovery. If an unclean shutdown 
> occurs we recovery the last segment of all logs. To make this correct we need 
> to ensure that each segment is fsync'd before we create a new segment. Hence 
> the fsync during roll.
> Obviously if the fsync during roll is the only time fsync occurs then it will 
> potentially write out the entire segment which for a 1GB segment at 50mb/sec 
> might take many seconds. The goal of this JIRA is to eliminate this and make 
> it possible to run with no application-level fsyncs at all, depending 
> entirely on replication and background writeback for durability.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-615) Avoid fsync on log segment roll

Reply via email to