On Mon, Oct 26, 2015 at 3:23 PM, Dave Chinner <da...@fromorbit.com> wrote: > On Mon, Oct 26, 2015 at 11:48:06AM +0900, Dan Williams wrote: >> On Mon, Oct 26, 2015 at 6:22 AM, Dave Chinner <da...@fromorbit.com> wrote: >> > On Thu, Oct 22, 2015 at 11:08:18PM +0200, Jan Kara wrote: >> >> Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for >> >> filesystems since there's nothing which writeprotects pages that are >> >> writeably mapped. In normal path, page writeback does this but that >> >> doesn't >> >> happen for DAX. I remember we once talked about this but it got lost. >> >> We need something like walk all filesystem inodes during fs freeze and >> >> writeprotect all pages that are mapped. But that's going to be slow... >> > >> > fsync() has the same problem - we have no record of the pages that >> > need to be committed and then write protected when fsync() is called >> > after write()... >> >> I know Ross is still working on that implementation. However, I had a >> thought on the flight to ksummit that maybe we shouldn't worry about >> tracking dirty state on a per-page basis. For small / frequent >> synchronizations an application really should be using the nvml >> library [1] to issue cache flushes and pcommit from userspace on a >> per-cacheline basis. That leaves unmodified apps that want to be >> correct in the presence of dax mappings. Two things we can do to >> mitigate that case: >> >> 1/ Make DAX mappings opt-in with a new MMAP_DAX (page-cache bypass) >> flag. Applications shouldn't silently become incorrect simply because >> the fs is mounted with -o dax. If an app doesn't understand DAX >> mappings it should get page-cache semantics. This also protects apps >> that are not expecting DAX semantics on raw block device mappings. > > Which is the complete opposite of what we are trying to acehive with > DAX. i.e. that existing applications "just work" with DAX without > modification. So this is a non-starter.
The list of things DAX breaks is getting shorter, but certainly the page-less paths will not be without surprises for quite a while yet... > Also, DAX access isn't a property of mmap - it's a property > of the inode. We cannot do DAX access via mmap while mixing page > cache based access through file descriptor based interfaces. This > I why I'm adding an inode attribute (on disk) to enable per-file DAX > capabilities - either everything is via the DAX paths, or nothing > is. > Per-inode control sounds very useful, I'll look at a similar mechanism for the raw block case. However, still not quite convinced page-cache control is an inode-only property, especially when direct-i/o is not an inode-property. That said, I agree the complexity of handling mixed mappings of the same file is prohibitive. >> 2/ Even if we get a new flag that lets the kernel know the app >> understands DAX mappings, we shouldn't leave fsync broken. Can we >> instead get by with a simple / big hammer solution? I.e. > > Because we don't physically have to write back data the problem is > both simpler and more complex. The simplest solution is for the > underlying block device to implement blkdev_issue_flush() correctly. > > i.e. if blkdev_issue_flush() behaves according to it's required > semantics - that all volatile cached data is flushed to stable > storage - then fsync-on-DAX will work appropriately. As it is, this is > needed for journal based filesystems to work correctly, as they are > assuming that their journal writes are being treated correctly as > REQ_FLUSH | REQ_FUA to ensure correct data/metadata/journal > ordering is maintained.... > > So, to begin with, this problem needs to be solved at the block > device level. That's the simple, brute-force, big hammer solution to > the problem, and it requires no changes at the filesystem level at > all. > > However, to avoid having to flush the entire block device range on > fsync we need a much more complex solution that tracks the dirty > ranges of the file and hence what needs committing when fsync is > run.... > >> Disruptive, yes, but if an app cares about efficient persistent memory >> synchronization fsync is already the wrong api. > > I don't really care about efficiency right now - correctness comes > first. Fundamentally, the app should not care whether it is writing to > persistent memory or spinning rust - the filesystem needs to > provide the application with exactly the same integrity guarantees > regardless of the underlying storage. > Sounds good, get blkdev_issue_flush() functional first and then worry about building a more efficient solution on top. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/