On 25/04/21 11:27AM, Darrick J. Wong wrote: > On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote: > > Subject: famfs: port into fuse > > > > This is the initial RFC for the fabric-attached memory file system (famfs) > > integration into fuse. In order to function, this requires a related patch > > to libfuse [1] and the famfs user space [2]. > > > > This RFC is mainly intended to socialize the approach and get feedback from > > the fuse developers and maintainers. There is some dax work that needs to > > be done before this should be merged (see the "poisoned page|folio problem" > > below). > > Note that I'm only looking at the fuse and iomap aspects of this > patchset. I don't know the devdax code at all. > > > This patch set fully works with Linux 6.14 -- passing all existing famfs > > smoke and unit tests -- and I encourage existing famfs users to test it. > > > > This is really two patch sets mashed up: > > > > * The patches with the dev_dax_iomap: prefix fill in missing functionality > > for > > devdax to host an fs-dax file system. > > * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively > > unchanged since last year. > > > > Because this is not ready to merge yet, I have felt free to leave some debug > > prints in place because we still find them useful; those will be cleaned up > > in a subsequent revision. > > > > Famfs Overview > > > > Famfs exposes shared memory as a file system. Famfs consumes shared memory > > from dax devices, and provides memory-mappable files that map directly to > > the memory - no page cache involvement. Famfs differs from conventional > > file systems in fs-dax mode, in that it handles in-memory metadata in a > > sharable way (which begins with never caching dirty shared metadata). > > > > Famfs started as a standalone file system [3,4], but the consensus at LSFMM > > 2024 [5] was that it should be ported into fuse - and this RFC is the first > > public evidence that I've been working on that. > > This is very timely, as I just started looking into how I might connect > iomap to fuse so that most of the hot IO path continues to run in the > kernel, and userspace block device filesystem drivers merely supply the > file mappings to the kernel. In other words, we kick the metadata > parsing craziness out of the kernel.
Coool! > > > The key performance requirement is that famfs must resolve mapping faults > > without upcalls. This is achieved by fully caching the file-to-devdax > > metadata for all active files. This is done via two fuse client/server > > message/response pairs: GET_FMAP and GET_DAXDEV. > > Heh, just last week I finally got around to laying out how I think I'd > want to expose iomap through fuse to allow ->iomap_begin/->iomap_end > upcalls to a fuse server. Note that I've done zero prototyping but > "upload all the mappings at open time" seems like a reasonable place for > me to start looking, especially for a filesystem with static mappings. > > I think what I want to try to build is an in-kernel mapping cache (sort > of like the one you built), only with upcalls to the fuse server when > there is no mapping information for a given IO. I'd probably want to > have a means for the fuse server to put new mappings into the cache, or > invalidate existing mappings. > > (famfs obviously is a simple corner-case of that grandiose vision, but I > still have a long way to get to my larger vision so don't take my words > as any kind of requirement.) > > > Famfs remains the first fs-dax file system that is backed by devdax rather > > than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups). > > > > Notes > > > > * Once the dev_dax_iomap patches land, I suspect it may make sense for > > virtiofs to update to use the improved interface. > > > > * I'm currently maintaining compatibility between the famfs user space and > > both the standalone famfs kernel file system and this new fuse > > implementation. In the near future I'll be running performance comparisons > > and sharing them - but there is no reason to expect significant > > degradation > > with fuse, since famfs caches entire "fmaps" in the kernel to resolve > > I'm curious to hear what you find, performance-wise. :) > > > faults with no upcalls. This patch has a bit too much debug turned on to > > to that testing quite yet. A branch > > A branch ... what? I trail off sometimes... ;) > > > * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV. > > > > * When a file is looked up in a famfs mount, the LOOKUP is followed by a > > GET_FMAP message and response. The "fmap" is the full file-to-dax mapping, > > allowing the fuse/famfs kernel code to handle read/write/fault without any > > upcalls. > > Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading > mappings into the kernel. That may be a better approach. Miklos and I discussed it during LPC last year, and thought both were options. Having implemented it at LOOKUP time, I think moving it to open might avoid my READDIRPLUS problem (which is that RDP is a mashup of READDIR and LOOKUP), therefore might need to add the GET_FMAP payload. Moving GET_FMAP to open time, would break that connection in a good way, I think. > > > * After each GET_FMAP, the fmap is checked for extents that reference > > previously-unknown daxdevs. Each such occurence is handled with a > > GET_DAXDEV message and response. > > I hadn't figured out how this part would work for my silly prototype. > Just out of curiosity, does the famfs fuse server hold an open fd to the > storage, in which case the fmap(ping) could just contain the open fd? > > Where are the mappings that are sent from the fuse server? Is that > struct fuse_famfs_simple_ext? See patch 17 or fs/fuse/famfs_kfmap.h for the fmap metadata explanation. Famfs currently supports either simple extents (daxdev, offset, length) or interleaved ones (which describe each "strip" as a simple extent). I think the explanation in famfs_kfmap.h is pretty clear. A key question is whether any additional basic metadata abstractions would be needed - because the kernel needs to understand the full scheme. With disaggregated memory, the interleave approach is nice because it gets aggregated performance and resolving a file offset to daxdev offset is order 1. Oh, and there are two fmap formats (ok, more, but the others are legacy ;). The fmaps-in-messages structs are currently in the famfs section of include/uapi/linux/fuse.h. And the in-memory version is in fs/fuse/famfs_kfmap.h. The former will need to be a versioned interface. (ugh...) > > > * Daxdevs are stored in a table (which might become an xarray at some > > point). > > When entries are added to the table, we acquire exclusive access to the > > daxdev via the fs_dax_get() call (modeled after how fs-dax handles this > > with pmem devices). famfs provides holder_operations to devdax, providing > > a notification path in the event of memory errors. > > > > * If devdax notifies famfs of memory errors on a dax device, famfs currently > > bocks all subsequent accesses to data on that device. The recovery is to > > re-initialize the memory and file system. Famfs is memory, not storage... > > Ouch. :) Cautious initial approach (i.e. I'm trying not to scare people too much ;) > > > * Because famfs uses backing (devdax) devices, only privileged mounts are > > supported. > > > > * The famfs kernel code never accesses the memory directly - it only > > facilitates read, write and mmap on behalf of user processes. As such, > > the RAS of the shared memory affects applications, but not the kernel. > > > > * Famfs has backing device(s), but they are devdax (char) rather than > > block. Right now there is no way to tell the vfs layer that famfs has a > > char backing device (unless we say it's block, but it's not). Currently > > we use the standard anonymous fuse fs_type - but I'm not sure that's > > ultimately optimal (thoughts?) > > Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the > fuse_args object? fuse2fs does that, though I don't recall if that's a > reasonable thing to do. The kernel needs to "own" the dax devices. fs-dax on pmem/block calls fs_dax_get_by_bdev() and passes in holder_operations - which are used for error upcalls, but also effect exclusive ownership. I added fs_dax_get() since the bdev version wasn't really right or char devdax. But same holder_operations. I had originally intended to pass in "-o daxdev=<cdev>", but famfs needs to span multiple daxdevs, in order to interleave for performance. The approach of retrieving them with GET_DAXDEV handles the generalized case, so "-o" just amounts to a second way to do the same thing. "But wait"... I thought. Doesn't the "-o" approach get the primary daxdev locked up sooner, which might be good? Well, no, because famfs creates a couple of meta files during mount .meta/.superblock and .meta/.log - and those are guaranteed to reference the primary daxdev. So I concluded the -o approach wasn't worth the trouble (though it's not *much* trouble). > > > The "poisoned page|folio problem" > > > > * Background: before doing a kernel mount, the famfs user space [2] > > validates > > the superblock and log. This is done via raw mmap of the primary devdax > > device. If valid, the file system is mounted, and the superblock and log > > get exposed through a pair of files (.meta/.superblock and .meta/.log) - > > because we can't be using raw device mmap when a file system is mounted > > on the device. But this exposes a devdax bug and warning... > > > > * Pages that have been memory mapped via devdax are left in a permanently > > problematic state. Devdax sets page|folio->mapping when a page is accessed > > via raw devdax mmap (as famfs does before mount), but never cleans it up. > > When the pages of the famfs superblock and log are accessed via the "meta" > > files after mount, we see a WARN_ONCE() in dax_insert_entry(), which > > notices that page|folio->mapping is still set. I intend to address this > > prior to asking for the famfs patches to be merged. > > > > * Alistair Popple's recent dax patch series [6], which has been merged > > for 6.15, addresses some dax issues, but sadly does not fix the poisoned > > page|folio problem - its enhanced refcount checking turns the warning into > > an error. > > > > * This 6.14 patch set disables the warning; a proper fix will be required > > for > > famfs to work at all in 6.15. Dan W. and I are actively discussing how to > > do > > this properly... > > > > * In terms of the correct functionality of famfs, the warning can be > > ignored. > > > > References > > > > [1] - https://github.com/libfuse/libfuse/pull/1200 > > [2] - https://github.com/cxl-micron-reskit/famfs > > Thanks for posting links, I'll have a look there too. > > --D > I'm happy to talk if you wanna kick ideas around. Cheers, John