[jira] [Commented] (KUDU-2195) Enforce durability happened before relationships on multiple disks

Todd Lipcon (JIRA) Mon, 23 Oct 2017 15:19:42 -0700

    [ 
https://issues.apache.org/jira/browse/KUDU-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215952#comment-16215952
 ]


Todd Lipcon commented on KUDU-2195:
-----------------------------------

I think we can't really have any expectations of durability when fsync is 
disabled. for example, there is no requirement that the WAL be "timeline 
truncated" - without an fsync you can have arbitrary order of flushed blocks at 
the end of a file, and therefore I'm not sure whether we can expect to recover.

That said, I think silently divergent replicas is a worst case. I'd much rather 
that we detect that the local replica is borked, because then everything will 
kick in to re-replicate from elsewhere (assuming this wasn't a cluster-wide 
power outage or somesuch which correlated failures across nodes). So, we shoudl 
do our best to avoid such issues.

> Enforce durability happened before relationships on multiple disks
> ------------------------------------------------------------------
>
>                 Key: KUDU-2195
>                 URL: https://issues.apache.org/jira/browse/KUDU-2195
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus, tablet
>            Reporter: David Alves
>
> When using weaker durability semantics (e.g. when log_force_fsync is off) we 
> should still enforce certain happened before relationships which are not 
> currently being enforced when using different disks for the wal and data.
> The two cases that come to mind where this is relevant are:
> 1) cmeta (c) -> wal (w) : We flush cmeta before flushing the wal (for 
> instance on term change) with the intention that either {}, \{c} or \{c, w} 
> were made durable.
> 2) wal (w) -> tablet meta (t): We flush the wal before tablet metadata to 
> make sure that that all commit messages that refer to on disk row sets (and 
> deltas) are on disk before the row sets they point to, i.e. with the 
> intention that either {}, \{w} or \{w, t} were made durable.
> With strong durability semantics these are always made durable in the right 
> order. With weaker semantics that is not the case though. If using the same 
> disk for both the wal and data then the invariants are  still preserved, as 
> buffers will be flushed in the right order but if using different disks for 
> the wal and data (and because cmeta is stored with the data) that is not 
> always the case.
> 1) in ext4 is actually safe, because we perform an fsync (indirect, rename() 
> implies fsync in ext4) when flushing cmeta. But it is not for xfs.
> 2) Is not safe in either filesystem.
> --- Possible solutions --
> For 1): Store cmeta with the wal; actually always fsync cmeta.
> For 2): Store tablet meta with the wal; always fsync the wal before flushing 
> tablet meta.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KUDU-2195) Enforce durability happened before relationships on multiple disks

Reply via email to