[ 
https://issues.apache.org/jira/browse/KUDU-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213134#comment-16213134
 ] 

Adar Dembo commented on KUDU-2195:
----------------------------------

It's important to note that while both cmeta and wal flushes are conditioned on 
log_force_fsync_all, tablet meta flushes are not; they always use "strong" 
durability semantics.

bq. \[With weaker semantics and\] if using the same disk for both the wal and 
data then the invariants are still preserved, as buffers will be flushed in the 
right order.
I don't understand this. In 1) neither flush is going to fsync, and in 2) only 
the tablet meta flush will fsync. Given that fsyncs act as barriers, don't both 
flushes in each pair need to fsync in order to establish a "happened before" 
relationship?

I also wanted to take a step back and consider the bigger picture. When 
log_force_fsync_all=true (and, I'm assuming, with 
enable_data_block_fsync=true), we agree that we think Kudu behaves correctly 
and provides strong durability in the event of a crash at any point in time. 
But what exactly are our expectations when log_force_fsync_all=false? If I'm 
understanding you correctly, you expect that in the worst case, a crash will 
result in lost operations from the wal which will be fixed automatically as the 
replica catches up. Whereas what's reported here suggests that a crash can 
hopelessly corrupt a replica, forcing it to be rereplicated elsewhere.

So here's a dumb question: is the potential loss of these "weaker durability" 
semantics actually that bad? Users who want durability should be using 
log_force_fsync_all=true, right? KUDU-2182 already suggests changing its 
default value to true for masters; we could entertain doing the same for 
tservers too, with the understanding that users who favor performance over 
durability can keep it false.

> Enforce durability happened before relationships on multiple disks
> ------------------------------------------------------------------
>
>                 Key: KUDU-2195
>                 URL: https://issues.apache.org/jira/browse/KUDU-2195
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus, tablet
>            Reporter: David Alves
>
> When using weaker durability semantics (e.g. when log_force_fsync is off) we 
> should still enforce certain happened before relationships which are not 
> currently being enforced when using different disks for the wal and data.
> The two cases that come to mind where this is relevant are:
> 1) cmeta (c) -> wal (w) : We flush cmeta before flushing the wal (for 
> instance on term change) with the intention that either {}, \{c} or \{c, w} 
> were made durable.
> 2) wal (w) -> tablet meta (t): We flush the wal before tablet metadata to 
> make sure that that all commit messages that refer to on disk row sets (and 
> deltas) are on disk before the row sets they point to, i.e. with the 
> intention that either {}, \{w} or \{w, t} were made durable.
> With strong durability semantics these are always made durable in the right 
> order. With weaker semantics that is not the case though. If using the same 
> disk for both the wal and data then the invariants are  still preserved, as 
> buffers will be flushed in the right order but if using different disks for 
> the wal and data (and because cmeta is stored with the data) that is not 
> always the case.
> 1) in ext4 is actually safe, because we perform an fsync (indirect, rename() 
> implies fsync in ext4) when flushing cmeta. But it is not for xfs.
> 2) Is not safe in either filesystem.
> --- Possible solutions --
> For 1): Store cmeta with the wal; actually always fsync cmeta.
> For 2): Store tablet meta with the wal; always fsync the wal before flushing 
> tablet meta.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to