Am 20.04.2016 um 03:56 hat Ric Wheeler geschrieben: > On 04/19/2016 10:09 AM, Jeff Cody wrote: > >On Tue, Apr 19, 2016 at 08:18:39AM -0400, Ric Wheeler wrote: > >>On 04/19/2016 08:07 AM, Jeff Cody wrote: > >>>Bug fixes for gluster; third patch is to prevent > >>>a potential data loss when trying to recover from > >>>a recoverable error (such as ENOSPC). > >>Hi Jeff, > >> > >>Just a note, I have been talking to some of the disk drive people > >>here at LSF (the kernel summit for file and storage people) and got > >>a non-public confirmation that individual storage devices (s-ata > >>drives or scsi) can also dump cache state when a synchronize cache > >>command fails. Also followed up with Rik van Riel - in the page > >>cache in general, when we fail to write back dirty pages, they are > >>simply marked "clean" (which means effectively that they get > >>dropped). > >> > >>Long winded way of saying that I think that this scenario is not > >>unique to gluster - any failed fsync() to a file (or block device) > >>might be an indication of permanent data loss. > >> > >Ric, > > > >Thanks. > > > >I think you are right, we likely do need to address how QEMU handles fsync > >failures across the board in QEMU at some point (2.7?). Another point to > >consider is that QEMU is cross-platform - so not only do we have different > >protocols, and filesystems, but also different underlying host OSes as well. > >It is likely, like you said, that there are other non-gluster scenarios where > >we have non-recoverable data loss on fsync failure. > > > >With Gluster specifically, if we look at just ENOSPC, does this mean that > >even if Gluster retains its cache after fsync failure, we still won't know > >that there was no permanent data loss? If we hit ENOSPC during an fsync, I > >presume that means Gluster itself may have encountered ENOSPC from a fsync to > >the underlying storage. In that case, does Gluster just pass the error up > >the stack? > > > >Jeff > > I still worry that in many non-gluster situations we will have > permanent data loss here. Specifically, the way the page cache > works, if we fail to write back cached data *at any time*, a future > fsync() will get a failure.
And this is actually what saves the semantic correctness. If you threw away data, any following fsync() must fail. This is of course inconvenient because you won't be able to resume a VM that is configured to stop on errors, and it means some data loss, but it's safe because we never tell the guest that the data is on disk when it really isn't. gluster's behaviour (without resync-failed-syncs-after-fsync set) is different, if I understand correctly. It will throw away the data and then happily report success on the next fsync() call. And this is what causes not only data loss, but corruption. [ Hm, or having read what's below... Did I misunderstand and Linux returns failure only for a single fsync() and on the next one it returns success again? That would be bad. ] > That failure could be because of a thinly provisioned backing store, > but in the interim, the page cache is free to drop the pages that > had failed. In effect, we end up with data loss in part or in whole > without a way to detect which bits got dropped. > > Note that this is not a gluster issue, this is for any file system > on top of thinly provisioned storage (i.e., we would see this with > xfs on thin storage or ext4 on thin storage). In effect, if gluster > has written the data back to xfs and that is on top of a thinly > provisioned target, the kernel might drop that data before you can > try an fsync again. Even if you retry the fsync(), the pages are > marked clean so they will not be pushed back to storage on that > second fsync(). I'm wondering... Marking the page clean means that it can be evicted from the cache, right? Which happens whenever something more useful can be done with the memory, i.e. possibly at any time. Does this mean that two consecutive reads of the same block can return different data even though no process has written to the file in between? Also, O_DIRECT bypasses the problem, right? In that already the write request would fail there, not only the fsync(). We recommend that for production environments anyway. > Same issue with link loss - if we lose connection to a storage > target, it is likely to take time to detect that, more time to > reconnect. In the interim, any page cache data is very likely to get > dropped under memory pressure. > > In both of these cases, fsync() failure is effectively a signal of a > high chance of data that has been already lost. A retry will not > save the day. > > At LSF/MM today, we discussed an option that would allow the page > cache to hang on to data - for re-tryable errors only for example - > so that this would not happen. The impact of this is also > potentially huge (page cache/physical memory could be exhausted > while waiting for an admin to fix the issue) so it would have to be > a non-default option. Is memory pressure the most common case, though? The odd effect that I see is that calling fsync() could actually make data less safe than it was if the call fails. With the kernel marking the pages clean on failure, instead of evicting "really clean" pages, we can now evict "dirty, but failed writeout" pages even without any real memory pressure, just because they can't be distinguished any more. Or maybe they aren't even evicted, but the admin fixes the problem and we could now write them to the disk if only they were still marked dirty and wouldn't be ignored in the writeout. I'm sure there are solutions that are more intelligent than the extremes of "mark clean on error" and "keep failed pages indefinitely" and that cover a large part of use cases where qemu wants to resume a VM after a failure (for local files perhaps most commonly resuming after ENOSPC). Even just evicting pages immediately on a failure would probably be an improvement because reads would then be consistent. And keeping the data around until we *really* need memory might solve the problem for all practical purposes. If we do eventually need the memory and throw away data, fsync() consistently returning an error after throwing away data is still safe, but we have a much better behaviour in the average case. > I think that we will need some discussions with the kernel memory > management team (and some storage kernel people) to see what seems > reasonable here. It's a good discussion to have, but for the network protocols (like with gluster) we tend to use the native libraries and don't even go through the kernel page cache. So I think we shouldn't stop discussing the semantics of these protocols and APIs while talking about the kernel page cache. Network protocols are also where error like "network is down" become more relevant, so if anything, we want to have better error recovery than on local files there. Kevin