On 26/04/06 10:43AM, Joanne Koong wrote: > On Tue, Mar 31, 2026 at 5:37 AM John Groves <[email protected]> wrote: > > > > From: John Groves <[email protected]> > > > > NOTE: this series depends on the famfs dax series in Ira's for-7.1/dax-famfs > > branch [0] > > > > Changes v9 -> v10 > > - Rebased to Ira's for-7.1/dax-famfs branch [0], which contains the required > > dax patches > > - Add parentheses to FUSE_IS_VIRTIO_DAX() macro, in case something bad is > > passed in as fuse_inode (thanks Jonathan's AI) > > > > Description: > > > > This patch series introduces famfs into the fuse file system framework. > > Famfs depends on the bundled dax patch set. > > > > The famfs user space code can be found at [1]. > > > > Fuse Overview: > > > > Famfs started as a standalone file system, but this series is intended to > > permanently supersede that implementation. At a high level, famfs adds > > two new fuse server messages: > > > > GET_FMAP - Retrieves a famfs fmap (the file-to-dax map for a famfs > > file) > > GET_DAXDEV - Retrieves the details of a particular daxdev that was > > referenced by an fmap > > > > Famfs Overview > > > > Famfs exposes shared memory as a file system. Famfs consumes shared > > memory from dax devices, and provides memory-mappable files that map > > directly to the memory - no page cache involvement. Famfs differs from > > conventional file systems in fs-dax mode, in that it handles in-memory > > metadata in a sharable way (which begins with never caching dirty shared > > metadata). > > > > Famfs started as a standalone file system [2,3], but the consensus at > > LSFMM was that it should be ported into fuse [4,5]. > > > > The key performance requirement is that famfs must resolve mapping faults > > without upcalls. This is achieved by fully caching the file-to-devdax > > metadata for all active files. This is done via two fuse client/server > > message/response pairs: GET_FMAP and GET_DAXDEV. > > > > Famfs remains the first fs-dax file system that is backed by devdax > > rather than pmem in fs-dax mode (hence the need for the new dax mode). > > > > Notes > > > > - When a file is opened in a famfs mount, the OPEN is followed by a > > GET_FMAP message and response. The "fmap" is the full file-to-dax > > mapping, allowing the fuse/famfs kernel code to handle > > read/write/fault without any upcalls. > > > > - After each GET_FMAP, the fmap is checked for extents that reference > > previously-unknown daxdevs. Each such occurrence is handled with a > > GET_DAXDEV message and response. > > > > - Daxdevs are stored in a table (which might become an xarray at some > > point). When entries are added to the table, we acquire exclusive > > access to the daxdev via the fs_dax_get() call (modeled after how > > fs-dax handles this with pmem devices). Famfs provides > > holder_operations to devdax, providing a notification path in the > > event of memory errors or forced reconfiguration. > > > > - If devdax notifies famfs of memory errors on a dax device, famfs > > currently blocks all subsequent accesses to data on that device. The > > recovery is to re-initialize the memory and file system. Famfs is > > memory, not storage... > > > > - Because famfs uses backing (devdax) devices, only privileged mounts are > > supported (i.e. the fuse server requires CAP_SYS_RAWIO). > > > > - The famfs kernel code never accesses the memory directly - it only > > facilitates read, write and mmap on behalf of user processes, using > > fmap metadata provided by its privileged fuse server. As such, the > > RAS of the shared memory affects applications, but not the kernel. > > > > - Famfs has backing device(s), but they are devdax (char) rather than > > block. Right now there is no way to tell the vfs layer that famfs has a > > char backing device (unless we say it's block, but it's not). Currently > > we use the standard anonymous fuse fs_type - but I'm not sure that's > > ultimately optimal (thoughts?) > > > > Changes v8 -> v9 > > - Kconfig: fs/fuse/Kconfig:CONFIG_FUSE_FAMFS_DAX now depends on the > > new CONFIG_DEV_DAX_FSDEV (from drivers/dax/Kconfig) rather than > > just CONFIG_DEV_DAX and CONFIG_FS_DAX. (CONFIG_FUSE_FAMFS_DAX > > depends on those...) > > > > Changes v7 -> v8 > > - Moved to inline __free declaration in fuse_get_fmap() and > > famfs_fuse_meta_alloc(), famfs_teardown() > > - Adopted FIELD_PREP() macro rather than manual bitfield manipulation > > - Minor doc edits > > - I dropped adding magic numbers to include/uapi/linux/magic.h. That > > can be done later if appropriate > > > > Changes v6 -> v7 > > - Fixed a regression in famfs_interleave_fileofs_to_daxofs() that > > was reported by Intel's kernel test robot > > - Added a check in __fsdev_dax_direct_access() for negative return > > from pgoff_to_phys(), which would indicate an out-of-range offset > > - Fixed a bug in __famfs_meta_free(), where not all interleaved > > extents were freed > > - Added chunksize alignment checks in famfs_fuse_meta_alloc() and > > famfs_interleave_fileofs_to_daxofs() as interleaved chunks must > > be PTE or PMD aligned > > - Simplified famfs_file_init_dax() a bit > > - Re-ran CM's kernel code review prompts on the entire series and > > fixed several minor issues > > > > Changes v4 -> v5 -> v6 > > - None. Re-sending due to technical difficulties > > > > Changes v3 [9] -> v4 > > - The patch "dax: prevent driver unbind while filesystem holds device" > > has been dropped. Dan Williams indicated that the favored behavior is > > for a file system to stop working if an underlying driver is unbound, > > rather than preventing the unbind. > > - The patch "famfs_fuse: Famfs mount opt: -o shadow=<shadowpath>" has > > been dropped. Found a way for the famfs user space to do without the > > -o opt (via getxattr). > > - Squashed the fs/fuse/Kconfig patch into the first subsequent patch > > that needed the change > > ("famfs_fuse: Basic fuse kernel ABI enablement for famfs") > > - Many review comments addressed. > > - Addressed minor kerneldoc infractions reported by test robot. > > > > Changes v2 [7] -> v3 > > - Dax: Completely new fsdev driver (drivers/dax/fsdev.c) replaces the > > dev_dax_iomap modifications to bus.c/device.c. Devdax devices can now > > be switched among 'devdax', 'famfs' and 'system-ram' modes via daxctl > > or sysfs. > > - Dax: fsdev uses MEMORY_DEVICE_FS_DAX type and leaves folios at order-0 > > (no vmemmap_shift), allowing fs-dax to manage folio lifecycles > > dynamically like pmem does. > > - Dax: The "poisoned page" problem is properly fixed via > > fsdev_clear_folio_state(), which clears stale mapping/compound state > > when fsdev binds. The temporary WARN_ON_ONCE workaround in fs/dax.c > > has been removed. > > - Dax: Added dax_set_ops() so fsdev can set dax_operations at bind time > > (and clear them on unbind), since the dax_device is created before we > > know which driver will bind. > > - Dax: Added custom bind/unbind sysfs handlers; unbind return -EBUSY if a > > filesystem holds the device, preventing unbind while famfs is mounted. > > - Fuse: Famfs mounts now require that the fuse server/daemon has > > CAP_SYS_RAWIO because they expose raw memory devices. > > - Fuse: Added DAX address_space_operations with noop_dirty_folio since > > famfs is memory-backed with no writeback required. > > - Rebased to latest kernels, fully compatible with Alistair Popple > > et. al's recent dax refactoring. > > - Ran this series through Chris Mason's code review AI prompts to check > > for issues - several subtle problems found and fixed. > > - Dropped RFC status - this version is intended to be mergeable. > > > > Changes v1 [8] -> v2: > > > > - The GET_FMAP message/response has been moved from LOOKUP to OPEN, as > > was the pretty much unanimous consensus. > > - Made the response payload to GET_FMAP variable sized (patch 12) > > - Dodgy kerneldoc comments cleaned up or removed. > > - Fixed memory leak of fc->shadow in patch 11 (thanks Joanne) > > - Dropped many pr_debug and pr_notice calls > > > > > > References > > > > [0] - https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/ > > [1] - https://famfs.org (famfs user space) > > [2] - > > https://lore.kernel.org/linux-cxl/[email protected]/ > > [3] - > > https://lore.kernel.org/linux-cxl/[email protected]/ > > [4] - https://lwn.net/Articles/983105/ (lsfmm 2024) > > [5] - https://lwn.net/Articles/1020170/ (lsfmm 2025) > > [6] - > > https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apop...@nvidia.com/ > > [7] - > > https://lore.kernel.org/linux-fsdevel/[email protected]/ > > (famfs fuse v2) > > [8] - > > https://lore.kernel.org/linux-fsdevel/[email protected]/ > > (famfs fuse v1) > > [9] - > > https://lore.kernel.org/linux-fsdevel/[email protected]/T/#mb2c868801be16eca82dab239a1d201628534aea7 > > (famfs fuse v3) > > > > > > John Groves (10): > > famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ > > famfs_fuse: Basic fuse kernel ABI enablement for famfs > > famfs_fuse: Plumb the GET_FMAP message/response > > famfs_fuse: Create files with famfs fmaps > > famfs_fuse: GET_DAXDEV message and daxdev_table > > famfs_fuse: Plumb dax iomap and fuse read/write/mmap > > famfs_fuse: Add holder_operations for dax notify_failure() > > famfs_fuse: Add DAX address_space_operations with noop_dirty_folio > > famfs_fuse: Add famfs fmap metadata documentation > > famfs_fuse: Add documentation > > > > Documentation/filesystems/famfs.rst | 142 ++++ > > Documentation/filesystems/index.rst | 1 + > > MAINTAINERS | 10 + > > fs/fuse/Kconfig | 13 + > > fs/fuse/Makefile | 1 + > > fs/fuse/dir.c | 2 +- > > fs/fuse/famfs.c | 1180 +++++++++++++++++++++++++++ > > fs/fuse/famfs_kfmap.h | 167 ++++ > > fs/fuse/file.c | 45 +- > > fs/fuse/fuse_i.h | 116 ++- > > fs/fuse/inode.c | 35 +- > > fs/fuse/iomode.c | 2 +- > > fs/namei.c | 1 + > > include/uapi/linux/fuse.h | 88 ++ > > 14 files changed, 1790 insertions(+), 13 deletions(-) > > create mode 100644 Documentation/filesystems/famfs.rst > > create mode 100644 fs/fuse/famfs.c > > create mode 100644 fs/fuse/famfs_kfmap.h > > > > > > base-commit: 2ae624d5a555d47a735fb3f4d850402859a4db77 > > -- > > 2.53.0 > > > > Hi John, > > I’m curious to hear your thoughts on whether you think it makes sense > for the famfs-specific logic in this series to be moved to a bpf > program that goes through a generic fuse iomap dax layer. > > Based on [1], this gives feature-parity with the famfs logic in this > series. In my opinion, having famfs go through a generic fuse iomap > dax layer makes the fuse kernel code more extensible for future > servers that will also want to use dax iomap, and keeps the fuse code > cleaner by not having famfs-specific logic hardcoded in and having to > introduce new fuse uapis for something famfs-specific. In my > understanding of it, fuse is meant to be generic and it feels like > adding server-specific logic goes against that design philosophy and > sets a precedent for other servers wanting similar special-casing in > the future. I'd like to explore whether the bpf and generic fuse iomap > dax layer approach can preserve that philosophy while still giving > famfs the flexibility it needs. > > I think moving the famfs logic to bpf benefits famfs as well: > - Instead of needing to issue a FUSE_GET_FMAP request after a file is > opened, the server can directly populate the metadata map from > userspace with the mapping info when it processes the FUSE_OPEN > request, which gets rid of the roundtrip cost > - The server can dynamically update the metadata / bpf maps during > runtime from userspace if any mapping info needs to change > - Future code changes / updates for famfs are all server-side and can > be deployed immediately instead of needing to go through the upstream > kernel mailing list process > - Famfs updates / new releases can ship independently of kernel releases > > I'd appreciate the chance to discuss tradeoffs or if you'd rather > discuss this at the fuse BoF at lsf, that sounds great too. > > Thanks, > Joanne >
Hi Joanne, I'm definitely up for discussing it, and talking before LSFMM would be good because then I'd have some time to think about before we discuss at LSFMM. I have not had a chance to really study this, in part since I've never even written a "hello world" BPF program. I'll ping off-list about times to talk. However... I would object vehemently to this sort of re-write prior to going upstream, as would users and vendors who need famfs now that the memory products it enables have started to ship. This work started over 3 years ago, initial patches over 2 years ago, community decision that it should go into fuse 2 years ago, first fuse patches a year ago. This implementation is pretty much exactly in line with expectation-setting starting two years ago. Famfs is a complicated orchestration between dax, fuse, ndctl (for daxctl), libfuse and the extensive famfs user space. Famfs has a fairly small kernel footprint, but its user space is much larger. This could set it back a year if we re-write now. Two things are true at once: I think this is a serious idea worth considering, and I think it's too late to make this sort of change before going upstream. Products need this enablement, and quite a long process has run in order to make it available in a timely fashion (which means soon now). I hope we can avoid making the perfect the enemy of the good. I believe the risk of merging famfs soon is quite low, because famfs will not affect anybody who doesn't use it. I hope we can run this discussion and analysis in parallel with merging the current implementation of famfs soon. Thank you, John

