Re: [PATCH v5] tracing: Support to dump instance traces by ftrace_dump_on_oops
On 2/23/2024 9:47 AM, Steven Rostedt wrote: On Thu, 8 Feb 2024 21:18:14 +0800 Huang Yiwei wrote: Currently ftrace only dumps the global trace buffer on an OOPs. For debugging a production usecase, instance trace will be helpful to check specific problems since global trace buffer may be used for other purposes. This patch extend the ftrace_dump_on_oops parameter to dump a specific or multiple trace instances: - ftrace_dump_on_oops=0: as before -- don't dump - ftrace_dump_on_oops[=1]: as before -- dump the global trace buffer on all CPUs - ftrace_dump_on_oops=2 or =orig_cpu: as before -- dump the global trace buffer on CPU that triggered the oops - ftrace_dump_on_oops=: new behavior -- dump the tracing instance matching - ftrace_dump_on_oops[=2/orig_cpu],[=2/orig_cpu], [=2/orig_cpu]: new behavior -- dump the global trace buffer and multiple instance buffer on all CPUs, or only dump on CPU that triggered the oops if =2 or =orig_cpu is given Also, the sysctl node can handle the input accordingly. Cc: Ross Zwisler Signed-off-by: Joel Fernandes (Google) Signed-off-by: Huang Yiwei This patch failed with the following warning: kernel/trace/trace.c:10029:6: warning: no previous prototype for ‘ftrace_dump_one’ [-Wmissing-prototypes] -- Steve My bad, will add the missing 'static' keyword in next patch. Regards, Huang Yiwei
[PATCH v6] tracing: Support to dump instance traces by ftrace_dump_on_oops
Currently ftrace only dumps the global trace buffer on an OOPs. For debugging a production usecase, instance trace will be helpful to check specific problems since global trace buffer may be used for other purposes. This patch extend the ftrace_dump_on_oops parameter to dump a specific or multiple trace instances: - ftrace_dump_on_oops=0: as before -- don't dump - ftrace_dump_on_oops[=1]: as before -- dump the global trace buffer on all CPUs - ftrace_dump_on_oops=2 or =orig_cpu: as before -- dump the global trace buffer on CPU that triggered the oops - ftrace_dump_on_oops=: new behavior -- dump the tracing instance matching - ftrace_dump_on_oops[=2/orig_cpu],[=2/orig_cpu], [=2/orig_cpu]: new behavior -- dump the global trace buffer and multiple instance buffer on all CPUs, or only dump on CPU that triggered the oops if =2 or =orig_cpu is given Also, the sysctl node can handle the input accordingly. Cc: Ross Zwisler Signed-off-by: Huang Yiwei --- .../admin-guide/kernel-parameters.txt | 26 ++- Documentation/admin-guide/sysctl/kernel.rst | 30 +++- include/linux/ftrace.h| 4 +- include/linux/kernel.h| 1 + kernel/sysctl.c | 4 +- kernel/trace/trace.c | 156 +- kernel/trace/trace_selftest.c | 2 +- 7 files changed, 168 insertions(+), 55 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 31b3a25680d0..3d6ea8e80c2f 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1561,12 +1561,28 @@ The above will cause the "foo" tracing instance to trigger a snapshot at the end of boot up. - ftrace_dump_on_oops[=orig_cpu] + ftrace_dump_on_oops[=2(orig_cpu) | =][, | + ,=2(orig_cpu)] [FTRACE] will dump the trace buffers on oops. - If no parameter is passed, ftrace will dump - buffers of all CPUs, but if you pass orig_cpu, it will - dump only the buffer of the CPU that triggered the - oops. + If no parameter is passed, ftrace will dump global + buffers of all CPUs, if you pass 2 or orig_cpu, it + will dump only the buffer of the CPU that triggered + the oops, or the specific instance will be dumped if + its name is passed. Multiple instance dump is also + supported, and instances are separated by commas. Each + instance supports only dump on CPU that triggered the + oops by passing 2 or orig_cpu to it. + + ftrace_dump_on_oops=foo=orig_cpu + + The above will dump only the buffer of "foo" instance + on CPU that triggered the oops. + + ftrace_dump_on_oops,foo,bar=orig_cpu + + The above will dump global buffer on all CPUs, the + buffer of "foo" instance on all CPUs and the buffer + of "bar" instance on CPU that triggered the oops. ftrace_filter=[function-list] [FTRACE] Limit the functions traced by the function diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 6584a1f9bfe3..ea8e5f152edc 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -296,12 +296,30 @@ kernel panic). This will output the contents of the ftrace buffers to the console. This is very useful for capturing traces that lead to crashes and outputting them to a serial console. -= === -0 Disabled (default). -1 Dump buffers of all CPUs. -2 Dump the buffer of the CPU that triggered the oops. -= === - +=== === +0 Disabled (default). +1 Dump buffers of all CPUs. +2(orig_cpu) Dump the buffer of the CPU that triggered the +oops. + Dump the specific instance buffer on all CPUs. +=2(orig_cpu) Dump the specific instance buffer on the CPU +that triggered the oops. +=== === + +Multiple instance dump is also supported, and instances are separated +by commas. If global buffer also needs to be dumped, please specify +the dump mode (1/2/orig_cpu) first for global buffer. + +So for example to dump "foo" and "bar" instance buffer on all CPUs, +user can:: + +
[RFC PATCH 00/20] Introduce the famfs shared-memory file system
This patch set introduces famfs[1] - a special-purpose fs-dax file system for sharable disaggregated or fabric-attached memory (FAM). Famfs is not CXL-specific in anyway way. * Famfs creates a simple access method for storing and sharing data in sharable memory. The memory is exposed and accessed as memory-mappable dax files. * Famfs supports multiple hosts mounting the same file system from the same memory (something existing fs-dax file systems don't do). * A famfs file system can be created on either a /dev/pmem device in fs-dax mode, or a /dev/dax device in devdax mode (the latter depending on patches 2-6 of this series). The famfs kernel file system is part the famfs framework; additional components in user space[2] handle metadata and direct the famfs kernel module to instantiate files that map to specific memory. The famfs user space has documentation and a reasonably thorough test suite. The famfs kernel module never accesses the shared memory directly (either data or metadata). Because of this, shared memory managed by the famfs framework does not create a RAS "blast radius" problem that should be able to crash or de-stabilize the kernel. Poison or timeouts in famfs memory can be expected to kill apps via SIGBUS and cause mounts to be disabled due to memory failure notifications. Famfs does not attempt to solve concurrency or coherency problems for apps, although it does solve these problems in regard to its own data structures. Apps may encounter hard concurrency problems, but there are use cases that are imminently useful and uncomplicated from a concurrency perspective: serial sharing is one (only one host at a time has access), and read-only concurrent sharing is another (all hosts can read-cache without worry). Contents: * famfs kernel documentation [patch 1]. Note that evolving famfs user documentation is at [2] * dev_dax_iomap patchset [patches 2-6] - This enables fs-dax to use the iomap interface via a character /dev/dax device (e.g. /dev/dax0.0). For historical reasons the iomap infrastructure was enabled only for /dev/pmem devices (which are dax block devices). As famfs is the first fs-dax file system that works on /dev/dax, this patch series fills in the bare minimum infrastructure to enable iomap api usage with /dev/dax. * famfs patchset [patches 7-20] - this introduces the kernel component of famfs. IMPORTANT NOTE: There is a developing consensus that /dev/dax requires some fundamental re-factoring (e.g. [3]) that is related but outside the scope of this series. Some observations about using sharable memory * It does not make sense to online sharable memory as system-ram. System-ram gets zeroed when it is onlined, so sharing is basically nonsense. * It does not make sense to put struct page's in sharable memory, because those can't be shared. However, separately providing non-sharable capacity to be used for struct page's might be a sensible approach if the size of struct page array for sharable memory is too large to put in conventional system-ram (albeit with possible RAS implications). * Sharable memory is pmem-like, in that a host is likely to connect in order to gain access to data that is already in the memory. Moreover the power domain for shared memory is separate for that of the server. Having observed that, famfs is not intended for persistent storage. It is intended for sharing data sets in memory during a time frame where the memory and the compute nodes are expected to remain operational - such as during a clustered data analytics job. Could we do this with FUSE? The key performance requirement for famfs is efficient handling of VMA faults. This requires caching the complete dax extent lists for all active files so faults can be handled without upcalls, which FUSE does not do. It would probably be possible to put this capability FUSE, but we think that keeping famfs separate from FUSE is the simpler approach. This patch set is available as a branch at [5] References [1] https://lpc.events/event/17/contributions/1455/ [2] https://github.com/cxl-micron-reskit/famfs [3] https://lore.kernel.org/all/166630293549.1017198.3833687373550679565.st...@dwillia2-xfh.jf.intel.com/ [4] https://www.computeexpresslink.org/download-the-specification [5] https://github.com/cxl-micron-reskit/famfs-linux John Groves (20): famfs: Documentation dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since both need it now dev_dax_iomap: Save the kva from memremap dev_dax_iomap: Add dax_operations for use by fs-dax on devdax dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter famfs: Add include/linux/famfs_ioctl.h famfs: Add famfs_internal.h famfs: Add super_operations famfs: famfs_open_device() & dax_holder_operations famfs: Add fs_context_operations famfs: Add inode_operations and file_system_type famfs: Add iomap_ops
[RFC PATCH 01/20] famfs: Documentation
Introduce Documentation/filesystems/famfs.rst into the Documentation tree Signed-off-by: John Groves --- Documentation/filesystems/famfs.rst | 124 1 file changed, 124 insertions(+) create mode 100644 Documentation/filesystems/famfs.rst diff --git a/Documentation/filesystems/famfs.rst b/Documentation/filesystems/famfs.rst new file mode 100644 index ..c2cc50c10d03 --- /dev/null +++ b/Documentation/filesystems/famfs.rst @@ -0,0 +1,124 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. _famfs_index: + +== +famfs: The kernel component of the famfs shared memory file system +== + +- Copyright (C) 2024 Micron Technology, Inc. + +Introduction + +Compute Express Link (CXL) provides a mechanism for disaggregated or +fabric-attached memory (FAM). This creates opportunities for data sharing; +clustered apps that would otherwise have to shard or replicate data can +share one copy in disaggregated memory. + +Famfs, which is not CXL-specific in any way, provides a mechanism for +multiple hosts to use data in shared memory, by giving it a file system +interface. With famfs, any app that understands files (which is all of +them, right?) can access data sets in shared memory. Although famfs +supports read and write calls, the real point is to support mmap, which +provides direct (dax) access to the memory - either writable or read-only. + +Shared memory can pose complex coherency and synchronization issues, but +there are also simple cases. Two simple and eminently useful patterns that +occur frequently in data analytics and AI are: + +* Serial Sharing - Only one host or process at a time has access to a file +* Read-only Sharing - Multiple hosts or processes share read-only access + to a file + +The famfs kernel file system is part of the famfs framework; User space +components [1] handle metadata allocation and distribution, and direct the +famfs kernel module to instantiate files that map to specific memory. + +The famfs framework manages coherency of its own metadata and structures, +but does not attempt to manage coherency for applications. + +Famfs also provides data isolation between files. That is, even though +the host has access to an entire memory "device" (as a dax device), apps +cannot write to memory for which the file is read-only, and mapping one +file provides isolation from the memory of all other files. This is pretty +basic, but some experimental shared memory usage patterns provide no such +isolation. + +Principles of Operation +=== + +Without its user space components, the famfs kernel module is just a +semi-functional clone of ramfs with latent fs-dax support. The user space +components maintain superblocks and metadata logs, and use the famfs kernel +component to provide a file system view of shared memory across multiple +hosts. + +Each host has an independent instance of the famfs kernel module. After +mount, files are not visible until the user space component instantiates +them (normally by playing the famfs metadata log). + +Once instantiated, files on each host can point to the same shared memory, +but in-memory metadata (inodes, etc.) is ephemeral on each host that has a +famfs instance mounted. Like ramfs, the famfs in-kernel file system has no +backing store for metadata modifications. If metadata is ever persisted, +that must be done by the user space components. However, mutations to file +data are saved to the shared memory - subject to write permission and +processor cache behavior. + + +Famfs is Not a Conventional File System +--- + +Famfs files can be accessed by conventional means, but there are +limitations. The kernel component of famfs is not involved in the +allocation of backing memory for files at all; the famfs user space +creates files and passes the allocation extent lists into the kernel via +the per-file FAMFSIOC_MAP_CREATE ioctl. A file that lacks this metadata is +treated as invalid by the famfs kernel module. As a practical matter files +must be created via the famfs library or cli, but they can be consumed as +if they were conventional files. + +Famfs differs in some important ways from conventional file systems: + +* Files must be pre-allocated by the famfs framework; Allocation is never + performed on write. +* Any operation that changes a file's size is considered to put the file + in an invalid state, disabling access to the data. It may be possible to + revisit this in the future. +* (Typically the famfs user space can restore files to a valid state by + replaying the famfs metadata log.) + +Famfs exists to apply the existing file system abstractions on top of +shared memory so applications and workflows can more easily consume it. + +Key Requirements + + +The primary requirements for famfs are: + +1. Must sup
[RFC PATCH 02/20] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
This function should be called by fs-dax file systems after opening the devdax device. This adds holder_operations. This function serves the same role as fs_dax_get_by_bdev(), which dax file systems call after opening the pmem block device. Signed-off-by: John Groves --- drivers/dax/super.c | 38 ++ include/linux/dax.h | 5 + 2 files changed, 43 insertions(+) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index f4b635526345..fc96362de237 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -121,6 +121,44 @@ void fs_put_dax(struct dax_device *dax_dev, void *holder) EXPORT_SYMBOL_GPL(fs_put_dax); #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */ +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP) + +/** + * fs_dax_get() + * + * fs-dax file systems call this function to prepare to use a devdax device for fsdax. + * This is like fs_dax_get_by_bdev(), but the caller already has struct dev_dax (and there + * is no bdev). The holder makes this exclusive. + * + * @dax_dev: dev to be prepared for fs-dax usage + * @holder: filesystem or mapped device inside the dax_device + * @hops: operations for the inner holder + * + * Returns: 0 on success, -1 on failure + */ +int fs_dax_get( + struct dax_device *dax_dev, + void *holder, + const struct dax_holder_operations *hops) +{ + /* dax_dev->ops should have been populated by devm_create_dev_dax() */ + if (WARN_ON(!dax_dev->ops)) + return -1; + + if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode)) + return -1; + + if (cmpxchg(&dax_dev->holder_data, NULL, holder)) { + pr_warn("%s: holder_data already set\n", __func__); + return -1; + } + dax_dev->holder_ops = hops; + + return 0; +} +EXPORT_SYMBOL_GPL(fs_dax_get); +#endif /* DEV_DAX_IOMAP */ + enum dax_device_flags { /* !alive + rcu grace period == no new operations / mappings */ DAXDEV_ALIVE, diff --git a/include/linux/dax.h b/include/linux/dax.h index b463502b16e1..e973289bfde3 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -57,7 +57,12 @@ struct dax_holder_operations { #if IS_ENABLED(CONFIG_DAX) struct dax_device *alloc_dax(void *private, const struct dax_operations *ops); + +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP) +int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops); +#endif void *dax_holder(struct dax_device *dax_dev); +struct dax_device *inode_dax(struct inode *inode); void put_dax(struct dax_device *dax_dev); void kill_dax(struct dax_device *dax_dev); void dax_write_cache(struct dax_device *dax_dev, bool wc); -- 2.43.0
[RFC PATCH 03/20] dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since both need it now
bus.c can't call functions in device.c - that creates a circular linkage dependency. Signed-off-by: John Groves --- drivers/dax/bus.c| 24 drivers/dax/device.c | 23 --- 2 files changed, 24 insertions(+), 23 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 1ff1ab5fa105..664e8c1b9930 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -1325,6 +1325,30 @@ static const struct device_type dev_dax_type = { .groups = dax_attribute_groups, }; +/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */ +__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, + unsigned long size) +{ + int i; + + for (i = 0; i < dev_dax->nr_range; i++) { + struct dev_dax_range *dax_range = &dev_dax->ranges[i]; + struct range *range = &dax_range->range; + unsigned long long pgoff_end; + phys_addr_t phys; + + pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1; + if (pgoff < dax_range->pgoff || pgoff > pgoff_end) + continue; + phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start; + if (phys + size - 1 <= range->end) + return phys; + break; + } + return -1; +} +EXPORT_SYMBOL_GPL(dax_pgoff_to_phys); + struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) { struct dax_region *dax_region = data->dax_region; diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 93ebedc5ec8c..40ba660013cf 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -50,29 +50,6 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma, return 0; } -/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */ -__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, - unsigned long size) -{ - int i; - - for (i = 0; i < dev_dax->nr_range; i++) { - struct dev_dax_range *dax_range = &dev_dax->ranges[i]; - struct range *range = &dax_range->range; - unsigned long long pgoff_end; - phys_addr_t phys; - - pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1; - if (pgoff < dax_range->pgoff || pgoff > pgoff_end) - continue; - phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start; - if (phys + size - 1 <= range->end) - return phys; - break; - } - return -1; -} - static void dax_set_mapping(struct vm_fault *vmf, pfn_t pfn, unsigned long fault_size) { -- 2.43.0
[RFC PATCH 04/20] dev_dax_iomap: Save the kva from memremap
Save the kva from memremap because we need it for iomap rw support Prior to famfs, there were no iomap users of /dev/dax - so the virtual address from memremap was not needed. Also: in some cases dev_dax_probe() is called with the first dev_dax->range offset past pgmap[0].range. In those cases we need to add the difference to virt_addr in order to have the physaddr's in dev_dax->ranges match dev_dax->virt_addr. Dragons... Signed-off-by: John Groves --- drivers/dax/dax-private.h | 1 + drivers/dax/device.c | 15 +++ 2 files changed, 16 insertions(+) diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 446617b73aea..894eb1c66b4a 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -63,6 +63,7 @@ struct dax_mapping { struct dev_dax { struct dax_region *region; struct dax_device *dax_dev; + u64 virt_addr; unsigned int align; int target_node; bool dyn_id; diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 40ba660013cf..6cd79d00fe1b 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -372,6 +372,7 @@ static int dev_dax_probe(struct dev_dax *dev_dax) struct dax_device *dax_dev = dev_dax->dax_dev; struct device *dev = &dev_dax->dev; struct dev_pagemap *pgmap; + u64 data_offset = 0; struct inode *inode; struct cdev *cdev; void *addr; @@ -426,6 +427,20 @@ static int dev_dax_probe(struct dev_dax *dev_dax) if (IS_ERR(addr)) return PTR_ERR(addr); + /* Detect whether the data is at a non-zero offset into the memory */ + if (pgmap->range.start != dev_dax->ranges[0].range.start) { + u64 phys = (u64)dev_dax->ranges[0].range.start; + u64 pgmap_phys = (u64)dev_dax->pgmap[0].range.start; + u64 vmemmap_shift = (u64)dev_dax->pgmap[0].vmemmap_shift; + + if (!WARN_ON(pgmap_phys > phys)) + data_offset = phys - pgmap_phys; + + pr_notice("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx shift=%llx\n", + __func__, phys, pgmap_phys, data_offset, vmemmap_shift); + } + dev_dax->virt_addr = (u64)addr + data_offset; + inode = dax_inode(dax_dev); cdev = inode->i_cdev; cdev_init(cdev, &dax_fops); -- 2.43.0
[RFC PATCH 05/20] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
Notes about this commit: * These methods are based somewhat loosely on pmem_dax_ops from drivers/nvdimm/pmem.c * dev_dax_direct_access() is returns the hpa, pfn and kva. The kva was newly stored as dev_dax->virt_addr by dev_dax_probe(). * The hpa/pfn are used for mmap (dax_iomap_fault()), and the kva is used for read/write (dax_iomap_rw()) * dev_dax_recovery_write() and dev_dax_zero_page_range() have not been tested yet. I'm looking for suggestions as to how to test those. Signed-off-by: John Groves --- drivers/dax/bus.c | 107 ++ 1 file changed, 107 insertions(+) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 664e8c1b9930..06fcda810674 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -10,6 +10,12 @@ #include "dax-private.h" #include "bus.h" +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP) +#include +#include +#include +#endif + static DEFINE_MUTEX(dax_bus_lock); #define DAX_NAME_LEN 30 @@ -1349,6 +1355,101 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, } EXPORT_SYMBOL_GPL(dax_pgoff_to_phys); +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP) + +static void write_dax(void *pmem_addr, struct page *page, + unsigned int off, unsigned int len) +{ + unsigned int chunk; + void *mem; + + while (len) { + mem = kmap_local_page(page); + chunk = min_t(unsigned int, len, PAGE_SIZE - off); + memcpy_flushcache(pmem_addr, mem + off, chunk); + kunmap_local(mem); + len -= chunk; + off = 0; + page++; + pmem_addr += chunk; + } +} + +static long __dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, +long nr_pages, enum dax_access_mode mode, void **kaddr, +pfn_t *pfn) +{ + struct dev_dax *dev_dax = dax_get_private(dax_dev); + size_t dax_size = dev_dax_size(dev_dax); + size_t size = nr_pages << PAGE_SHIFT; + size_t offset = pgoff << PAGE_SHIFT; + phys_addr_t phys; + u64 virt_addr = dev_dax->virt_addr + offset; + pfn_t local_pfn; + u64 flags = PFN_DEV|PFN_MAP; + + WARN_ON(!dev_dax->virt_addr); /* virt_addr must be saved for direct_access */ + + phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT); + + if (kaddr) + *kaddr = (void *)virt_addr; + + local_pfn = phys_to_pfn_t(phys, flags); /* are flags correct? */ + if (pfn) + *pfn = local_pfn; + + /* This the valid size at the specified address */ + return PHYS_PFN(min_t(size_t, size, dax_size - offset)); +} + +static int dev_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, + size_t nr_pages) +{ + long resid = nr_pages << PAGE_SHIFT; + long offset = pgoff << PAGE_SHIFT; + + /* Break into one write per dax region */ + while (resid > 0) { + void *kaddr; + pgoff_t poff = offset >> PAGE_SHIFT; + long len = __dev_dax_direct_access(dax_dev, poff, + nr_pages, DAX_ACCESS, &kaddr, NULL); + len = min_t(long, len, PAGE_SIZE); + write_dax(kaddr, ZERO_PAGE(0), offset, len); + + offset += len; + resid -= len; + } + return 0; +} + +static long dev_dax_direct_access(struct dax_device *dax_dev, + pgoff_t pgoff, long nr_pages, enum dax_access_mode mode, + void **kaddr, pfn_t *pfn) +{ + return __dev_dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn); +} + +static size_t dev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, + void *addr, size_t bytes, struct iov_iter *i) +{ + size_t len, off; + + off = offset_in_page(addr); + len = PFN_PHYS(PFN_UP(off + bytes)); + + return _copy_from_iter_flushcache(addr, bytes, i); +} + +static const struct dax_operations dev_dax_ops = { + .direct_access = dev_dax_direct_access, + .zero_page_range = dev_dax_zero_page_range, + .recovery_write = dev_dax_recovery_write, +}; + +#endif /* IS_ENABLED(CONFIG_DEV_DAX_IOMAP) */ + struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) { struct dax_region *dax_region = data->dax_region; @@ -1404,11 +1505,17 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) } } +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP) + /* holder_ops currently populated separately in a slightly hacky way */ + dax_dev = alloc_dax(dev_dax, &dev_dax_ops); +#else /* * No dax_operations since there is no access to this device outside of * mmap of the resulting character device. */ dax_dev = alloc_dax(dev_dax, NULL); +#endif + if (IS_ERR(dax_dev)) {
[RFC PATCH 06/20] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter
Add the CONFIG_DEV_DAX_IOMAP kernel config parameter to control building of the iomap functionality to support fsdax on devdax. Signed-off-by: John Groves --- drivers/dax/Kconfig | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index a88744244149..b1ebcc77120b 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -78,4 +78,10 @@ config DEV_DAX_KMEM Say N if unsure. +config DEV_DAX_IOMAP + depends on DEV_DAX && DAX + def_bool y + help + Support iomap mapping of devdax devices (for FS-DAX file + systems that reside on character /dev/dax devices) endif -- 2.43.0
[RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h
Add uapi include file for famfs. The famfs user space uses ioctl on individual files to pass in mapping information and file size. This would be hard to do via sysfs or other means, since it's file-specific. Signed-off-by: John Groves --- include/uapi/linux/famfs_ioctl.h | 56 1 file changed, 56 insertions(+) create mode 100644 include/uapi/linux/famfs_ioctl.h diff --git a/include/uapi/linux/famfs_ioctl.h b/include/uapi/linux/famfs_ioctl.h new file mode 100644 index ..6b3e6452d02f --- /dev/null +++ b/include/uapi/linux/famfs_ioctl.h @@ -0,0 +1,56 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * famfs - dax file system for shared fabric-attached memory + * + * Copyright 2023-2024 Micron Technology, Inc. + * + * This file system, originally based on ramfs the dax support from xfs, + * is intended to allow multiple host systems to mount a common file system + * view of dax files that map to shared memory. + */ +#ifndef FAMFS_IOCTL_H +#define FAMFS_IOCTL_H + +#include +#include + +#define FAMFS_MAX_EXTENTS 2 + +enum extent_type { + SIMPLE_DAX_EXTENT = 13, + INVALID_EXTENT_TYPE, +}; + +struct famfs_extent { + __u64 offset; + __u64 len; +}; + +enum famfs_file_type { + FAMFS_REG, + FAMFS_SUPERBLOCK, + FAMFS_LOG, +}; + +/** + * struct famfs_ioc_map + * + * This is the metadata that indicates where the memory is for a famfs file + */ +struct famfs_ioc_map { + enum extent_type extent_type; + enum famfs_file_type file_type; + __u64 file_size; + __u64 ext_list_count; + struct famfs_extent ext_list[FAMFS_MAX_EXTENTS]; +}; + +#define FAMFSIOC_MAGIC 'u' + +/* famfs file ioctl opcodes */ +#define FAMFSIOC_MAP_CREATE_IOW(FAMFSIOC_MAGIC, 1, struct famfs_ioc_map) +#define FAMFSIOC_MAP_GET _IOR(FAMFSIOC_MAGIC, 2, struct famfs_ioc_map) +#define FAMFSIOC_MAP_GETEXT_IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent) +#define FAMFSIOC_NOP _IO(FAMFSIOC_MAGIC, 4) + +#endif /* FAMFS_IOCTL_H */ -- 2.43.0
[RFC PATCH 08/20] famfs: Add famfs_internal.h
Add the famfs_internal.h include file. This contains internal data structures such as the per-file metadata structure (famfs_file_meta) and extent formats. Signed-off-by: John Groves --- fs/famfs/famfs_internal.h | 53 +++ 1 file changed, 53 insertions(+) create mode 100644 fs/famfs/famfs_internal.h diff --git a/fs/famfs/famfs_internal.h b/fs/famfs/famfs_internal.h new file mode 100644 index ..af3990d43305 --- /dev/null +++ b/fs/famfs/famfs_internal.h @@ -0,0 +1,53 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * famfs - dax file system for shared fabric-attached memory + * + * Copyright 2023-2024 Micron Technology, Inc. + * + * This file system, originally based on ramfs the dax support from xfs, + * is intended to allow multiple host systems to mount a common file system + * view of dax files that map to shared memory. + */ +#ifndef FAMFS_INTERNAL_H +#define FAMFS_INTERNAL_H + +#include +#include + +#define FAMFS_MAGIC 0x87b282ff + +#define FAMFS_BLKDEV_MODE (FMODE_READ|FMODE_WRITE) + +extern const struct file_operations famfs_file_operations; + +/* + * Each famfs dax file has this hanging from its inode->i_private. + */ +struct famfs_file_meta { + int error; + enum famfs_file_type file_type; + size_tfile_size; + enum extent_type tfs_extent_type; + size_ttfs_extent_ct; + struct famfs_extent tfs_extents[]; /* flexible array */ +}; + +struct famfs_mount_opts { + umode_t mode; +}; + +extern const struct iomap_ops famfs_iomap_ops; +extern const struct vm_operations_struct famfs_file_vm_ops; + +#define ROOTDEV_STRLEN 80 + +struct famfs_fs_info { + struct famfs_mount_opts mount_opts; + struct file *dax_filp; + struct dax_device *dax_devp; + struct bdev_handle *bdev_handle; + struct list_head fsi_list; + char*rootdev; +}; + +#endif /* FAMFS_INTERNAL_H */ -- 2.43.0
[RFC PATCH 09/20] famfs: Add super_operations
Introduce the famfs superblock operations Signed-off-by: John Groves --- fs/famfs/famfs_inode.c | 72 ++ 1 file changed, 72 insertions(+) create mode 100644 fs/famfs/famfs_inode.c diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c new file mode 100644 index ..3329aff000d1 --- /dev/null +++ b/fs/famfs/famfs_inode.c @@ -0,0 +1,72 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * famfs - dax file system for shared fabric-attached memory + * + * Copyright 2023-2024 Micron Technology, inc + * + * This file system, originally based on ramfs the dax support from xfs, + * is intended to allow multiple host systems to mount a common file system + * view of dax files that map to shared memory. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "famfs_internal.h" + +#define FAMFS_DEFAULT_MODE 0755 + +static const struct super_operations famfs_ops; +static const struct inode_operations famfs_file_inode_operations; +static const struct inode_operations famfs_dir_inode_operations; + +/** + * famfs super_operations + * + * TODO: implement a famfs_statfs() that shows size, free and available space, etc. + */ + +/** + * famfs_show_options() - Display the mount options in /proc/mounts. + */ +static int famfs_show_options( + struct seq_file *m, + struct dentry *root) +{ + struct famfs_fs_info *fsi = root->d_sb->s_fs_info; + + if (fsi->mount_opts.mode != FAMFS_DEFAULT_MODE) + seq_printf(m, ",mode=%o", fsi->mount_opts.mode); + + return 0; +} + +static const struct super_operations famfs_ops = { + .statfs = simple_statfs, + .drop_inode = generic_delete_inode, + .show_options = famfs_show_options, +}; + + +MODULE_LICENSE("GPL"); -- 2.43.0
[RFC PATCH 10/20] famfs: famfs_open_device() & dax_holder_operations
Famfs works on both /dev/pmem and /dev/dax devices. This commit introduces the function that opens a block (pmem) device and the struct dax_holder_operations that are needed for that ABI. In this commit, support for opening character /dev/dax is stubbed. A later commit introduces this capability. Signed-off-by: John Groves --- fs/famfs/famfs_inode.c | 83 ++ 1 file changed, 83 insertions(+) diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c index 3329aff000d1..82c861998093 100644 --- a/fs/famfs/famfs_inode.c +++ b/fs/famfs/famfs_inode.c @@ -68,5 +68,88 @@ static const struct super_operations famfs_ops = { .show_options = famfs_show_options, }; +/*** + * dax_holder_operations for block dax + */ + +static int +famfs_blk_dax_notify_failure( + struct dax_device *dax_devp, + u64 offset, + u64 len, + int mf_flags) +{ + + pr_err("%s: dax_devp %llx offset %llx len %lld mf_flags %x\n", + __func__, (u64)dax_devp, (u64)offset, (u64)len, mf_flags); + return -EOPNOTSUPP; +} + +const struct dax_holder_operations famfs_blk_dax_holder_ops = { + .notify_failure = famfs_blk_dax_notify_failure, +}; + +static int +famfs_open_char_device( + struct super_block *sb, + struct fs_context *fc) +{ + pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n", + __func__, fc->source); + return -ENODEV; +} + +/** + * famfs_open_device() + * + * Open the memory device. If it looks like /dev/dax, call famfs_open_char_device(). + * Otherwise try to open it as a block/pmem device. + */ +static int +famfs_open_device( + struct super_block *sb, + struct fs_context *fc) +{ + struct famfs_fs_info *fsi = sb->s_fs_info; + struct dax_device*dax_devp; + u64 start_off = 0; + struct bdev_handle *handlep; + + if (fsi->dax_devp) { + pr_err("%s: already mounted\n", __func__); + return -EALREADY; + } + + if (strstr(fc->source, "/dev/dax")) /* There is probably a better way to check this */ + return famfs_open_char_device(sb, fc); + + if (!strstr(fc->source, "/dev/pmem")) { /* There is probably a better way to check this */ + pr_err("%s: primary backing dev (%s) is not pmem\n", + __func__, fc->source); + return -EINVAL; + } + + handlep = bdev_open_by_path(fc->source, FAMFS_BLKDEV_MODE, fsi, &fs_holder_ops); + if (IS_ERR(handlep->bdev)) { + pr_err("%s: failed blkdev_get_by_path(%s)\n", __func__, fc->source); + return PTR_ERR(handlep->bdev); + } + + dax_devp = fs_dax_get_by_bdev(handlep->bdev, &start_off, + fsi /* holder */, + &famfs_blk_dax_holder_ops); + if (IS_ERR(dax_devp)) { + pr_err("%s: unable to get daxdev from handlep->bdev\n", __func__); + bdev_release(handlep); + return -ENODEV; + } + fsi->bdev_handle = handlep; + fsi->dax_devp= dax_devp; + + pr_notice("%s: root device is block dax (%s)\n", __func__, fc->source); + return 0; +} + + MODULE_LICENSE("GPL"); -- 2.43.0
[RFC PATCH 11/20] famfs: Add fs_context_operations
This commit introduces the famfs fs_context_operations and famfs_get_inode() which is used by the context operations. Signed-off-by: John Groves --- fs/famfs/famfs_inode.c | 178 + 1 file changed, 178 insertions(+) diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c index 82c861998093..f98f82962d7b 100644 --- a/fs/famfs/famfs_inode.c +++ b/fs/famfs/famfs_inode.c @@ -41,6 +41,50 @@ static const struct super_operations famfs_ops; static const struct inode_operations famfs_file_inode_operations; static const struct inode_operations famfs_dir_inode_operations; +static struct inode *famfs_get_inode( + struct super_block *sb, + const struct inode *dir, + umode_t mode, + dev_t dev) +{ + struct inode *inode = new_inode(sb); + + if (inode) { + struct timespec64 tv; + + inode->i_ino = get_next_ino(); + inode_init_owner(&nop_mnt_idmap, inode, dir, mode); + inode->i_mapping->a_ops = &ram_aops; + mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); + mapping_set_unevictable(inode->i_mapping); + tv = inode_set_ctime_current(inode); + inode_set_mtime_to_ts(inode, tv); + inode_set_atime_to_ts(inode, tv); + + switch (mode & S_IFMT) { + default: + init_special_inode(inode, mode, dev); + break; + case S_IFREG: + inode->i_op = &famfs_file_inode_operations; + inode->i_fop = &famfs_file_operations; + break; + case S_IFDIR: + inode->i_op = &famfs_dir_inode_operations; + inode->i_fop = &simple_dir_operations; + + /* Directory inodes start off with i_nlink == 2 (for "." entry) */ + inc_nlink(inode); + break; + case S_IFLNK: + inode->i_op = &page_symlink_inode_operations; + inode_nohighmem(inode); + break; + } + } + return inode; +} + /** * famfs super_operations * @@ -150,6 +194,140 @@ famfs_open_device( return 0; } +/* + * fs_context_operations + */ +static int +famfs_fill_super( + struct super_block *sb, + struct fs_context *fc) +{ + struct famfs_fs_info *fsi = sb->s_fs_info; + struct inode *inode; + int rc = 0; + + sb->s_maxbytes = MAX_LFS_FILESIZE; + sb->s_blocksize = PAGE_SIZE; + sb->s_blocksize_bits= PAGE_SHIFT; + sb->s_magic = FAMFS_MAGIC; + sb->s_op= &famfs_ops; + sb->s_time_gran = 1; + + rc = famfs_open_device(sb, fc); + if (rc) + goto out; + + inode = famfs_get_inode(sb, NULL, S_IFDIR | fsi->mount_opts.mode, 0); + sb->s_root = d_make_root(inode); + if (!sb->s_root) + rc = -ENOMEM; + +out: + return rc; +} + +enum famfs_param { + Opt_mode, + Opt_dax, +}; + +const struct fs_parameter_spec famfs_fs_parameters[] = { + fsparam_u32oct("mode",Opt_mode), + fsparam_string("dax", Opt_dax), + {} +}; + +static int famfs_parse_param( + struct fs_context *fc, + struct fs_parameter *param) +{ + struct famfs_fs_info *fsi = fc->s_fs_info; + struct fs_parse_result result; + int opt; + + opt = fs_parse(fc, famfs_fs_parameters, param, &result); + if (opt == -ENOPARAM) { + opt = vfs_parse_fs_param_source(fc, param); + if (opt != -ENOPARAM) + return opt; + + return 0; + } + if (opt < 0) + return opt; + + switch (opt) { + case Opt_mode: + fsi->mount_opts.mode = result.uint_32 & S_IALLUGO; + break; + case Opt_dax: + if (strcmp(param->string, "always")) + pr_notice("%s: invalid dax mode %s\n", + __func__, param->string); + break; + } + + return 0; +} + +static DEFINE_MUTEX(famfs_context_mutex); +static LIST_HEAD(famfs_context_list); + +static int famfs_get_tree(struct fs_context *fc) +{ + struct famfs_fs_info *fsi_entry; + struct famfs_fs_info *fsi = fc->s_fs_info; + + fsi->rootdev = kstrdup(fc->source, GFP_KERNEL); + if (!fsi->rootdev) + return -ENOMEM; + + /* Fail if famfs is already mounted from the same device */ + mutex_lock(&famfs_context_mutex); + list_for_each_entry(fsi_entry, &famfs_conte
[RFC PATCH 12/20] famfs: Add inode_operations and file_system_type
This commit introduces the famfs inode_operations. There is nothing really unique to famfs here in the inode_operations.. This commit also introduces the famfs_file_system_type struct and the famfs_kill_sb() function. Signed-off-by: John Groves --- fs/famfs/famfs_inode.c | 132 + 1 file changed, 132 insertions(+) diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c index f98f82962d7b..ab46ec50b70d 100644 --- a/fs/famfs/famfs_inode.c +++ b/fs/famfs/famfs_inode.c @@ -85,6 +85,109 @@ static struct inode *famfs_get_inode( return inode; } +/*** + * famfs inode_operations: these are currently pretty much boilerplate + */ + +static const struct inode_operations famfs_file_inode_operations = { + /* All generic */ + .setattr = simple_setattr, + .getattr = simple_getattr, +}; + + +/* + * File creation. Allocate an inode, and we're done.. + */ +/* SMP-safe */ +static int +famfs_mknod( + struct mnt_idmap *idmap, + struct inode *dir, + struct dentry*dentry, + umode_t mode, + dev_t dev) +{ + struct inode *inode = famfs_get_inode(dir->i_sb, dir, mode, dev); + int error = -ENOSPC; + + if (inode) { + struct timespec64 tv; + + d_instantiate(dentry, inode); + dget(dentry); /* Extra count - pin the dentry in core */ + error = 0; + tv = inode_set_ctime_current(inode); + inode_set_mtime_to_ts(inode, tv); + inode_set_atime_to_ts(inode, tv); + } + return error; +} + +static int famfs_mkdir( + struct mnt_idmap *idmap, + struct inode *dir, + struct dentry*dentry, + umode_t mode) +{ + int retval = famfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFDIR, 0); + + if (!retval) + inc_nlink(dir); + + return retval; +} + +static int famfs_create( + struct mnt_idmap *idmap, + struct inode *dir, + struct dentry*dentry, + umode_t mode, + bool excl) +{ + return famfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFREG, 0); +} + +static int famfs_symlink( + struct mnt_idmap *idmap, + struct inode *dir, + struct dentry*dentry, + const char *symname) +{ + struct inode *inode; + int error = -ENOSPC; + + inode = famfs_get_inode(dir->i_sb, dir, S_IFLNK | 0777, 0); + if (inode) { + int l = strlen(symname)+1; + + error = page_symlink(inode, symname, l); + if (!error) { + struct timespec64 tv; + + d_instantiate(dentry, inode); + dget(dentry); + tv = inode_set_ctime_current(inode); + inode_set_mtime_to_ts(inode, tv); + inode_set_atime_to_ts(inode, tv); + } else + iput(inode); + } + return error; +} + +static const struct inode_operations famfs_dir_inode_operations = { + .create = famfs_create, + .lookup = simple_lookup, + .link = simple_link, + .unlink = simple_unlink, + .symlink= famfs_symlink, + .mkdir = famfs_mkdir, + .rmdir = simple_rmdir, + .mknod = famfs_mknod, + .rename = simple_rename, +}; + /** * famfs super_operations * @@ -329,5 +432,34 @@ static int famfs_init_fs_context(struct fs_context *fc) return 0; } +static void famfs_kill_sb(struct super_block *sb) +{ + struct famfs_fs_info *fsi = sb->s_fs_info; + + mutex_lock(&famfs_context_mutex); + list_del(&fsi->fsi_list); + mutex_unlock(&famfs_context_mutex); + + if (fsi->bdev_handle) + bdev_release(fsi->bdev_handle); + if (fsi->dax_devp) + fs_put_dax(fsi->dax_devp, fsi); + if (fsi->dax_filp) /* This only happens if it's char dax */ + filp_close(fsi->dax_filp, NULL); + + if (fsi && fsi->rootdev) + kfree(fsi->rootdev); + kfree(fsi); + kill_litter_super(sb); +} + +#define MODULE_NAME "famfs" +static struct file_system_type famfs_fs_type = { + .name = MODULE_NAME, + .init_fs_context = famfs_init_fs_context, + .parameters = famfs_fs_parameters, + .kill_sb = famfs_kill_sb, + .fs_flags = FS_USERNS_MOUNT, +}; MODULE_LICENSE("GPL"); -- 2.43.0
[RFC PATCH 13/20] famfs: Add iomap_ops
This commit introduces the famfs iomap_ops. When either dax_iomap_fault() or dax_iomap_rw() is called, we get a callback via our iomap_begin() handler. The question being asked is "please resolve (file, offset) to (daxdev, offset)". The function famfs_meta_to_dax_offset() does this. The per-file metadata is just an extent list to the backing dax dev. The order of this resolution is O(N) for N extents. Note with the current user space, files usually have only one extent. Signed-off-by: John Groves --- fs/famfs/famfs_file.c | 245 ++ 1 file changed, 245 insertions(+) create mode 100644 fs/famfs/famfs_file.c diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c new file mode 100644 index ..fc667d5f7be8 --- /dev/null +++ b/fs/famfs/famfs_file.c @@ -0,0 +1,245 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * famfs - dax file system for shared fabric-attached memory + * + * Copyright 2023-2024 Micron Technology, Inc. + * + * This file system, originally based on ramfs the dax support from xfs, + * is intended to allow multiple host systems to mount a common file system + * view of dax files that map to shared memory. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include "famfs_internal.h" + +/* + * iomap_operations + * + * This stuff uses the iomap (dax-related) helpers to resolve file offsets to + * offsets within a dax device. + */ + +/** + * famfs_meta_to_dax_offset() + * + * This function is called by famfs_iomap_begin() to resolve an offset in a file to + * an offset in a dax device. This is upcalled from dax from calls to both + * dax_iomap_fault() and dax_iomap_rw(). Dax finishes the job resolving a fault to + * a specific physical page (the fault case) or doing a memcpy variant (the rw case) + * + * Pages can be PTE (4k), PMD (2MiB) or (theoretically) PuD (1GiB) + * (these sizes are for X86; may vary on other cpu architectures + * + * @inode - the file where the fault occurred + * @iomap - struct iomap to be filled in to indicate where to find the right memory, relative + * to a dax device. + * @offset - the offset within the file where the fault occurred (will be page boundary) + * @len- the length of the faulted mapping (will be a page multiple) + * (will be trimmed in *iomap if it's disjoint in the extent list) + * @flags + */ +static int +famfs_meta_to_dax_offset( + struct inode *inode, + struct iomap *iomap, + loff_toffset, + loff_tlen, + unsigned int flags) +{ + struct famfs_file_meta *meta = (struct famfs_file_meta *)inode->i_private; + int i; + loff_t local_offset = offset; + struct famfs_fs_info *fsi = inode->i_sb->s_fs_info; + + iomap->offset = offset; /* file offset */ + + for (i = 0; i < meta->tfs_extent_ct; i++) { + loff_t dax_ext_offset = meta->tfs_extents[i].offset; + loff_t dax_ext_len= meta->tfs_extents[i].len; + + if ((dax_ext_offset == 0) && (meta->file_type != FAMFS_SUPERBLOCK)) + pr_err("%s: zero offset on non-superblock file!!\n", __func__); + + /* local_offset is the offset minus the size of extents skipped so far; +* If local_offset < dax_ext_len, the data of interest starts in this extent +*/ + if (local_offset < dax_ext_len) { + loff_t ext_len_remainder = dax_ext_len - local_offset; + + /*+ +* OK, we found the file metadata extent where this data begins +* @local_offset - The offset within the current extent +* @ext_len_remainder - Remaining length of ext after skipping local_offset +* +* iomap->addr is the offset within the dax device where that data +* starts +*/ + iomap->addr= dax_ext_offset + local_offset; /* dax dev offset */ + iomap->offset = offset; /* file offset */ + iomap->length = min_t(loff_t, len, ext_len_remainder); + iomap->dax_dev = fsi->dax_devp; + iomap->type= IOMAP_MAPPED; + iomap->flags = flags; + + return 0; + } + local_offset -= dax_ext_len; /* Get ready for the next extent */ + } + + /* Set iomap to zero length in this case, and return 0 +* This just means that the r/w is past EOF +*/ + iomap->addr= offset; + iomap->offset = offset; /* file offset */ + iomap->length = 0; /* this had better result in no access to dax mem */ + iomap->dax_dev = fsi->dax_devp; + iomap
[RFC PATCH 14/20] famfs: Add struct file_operations
This commit introduces the famfs file_operations. We call thp_get_unmapped_area() to force PMD page alignment. Our read and write handlers (famfs_dax_read_iter() and famfs_dax_write_iter()) call dax_iomap_rw() to do the work. famfs_file_invalid() checks for various ways a famfs file can be in an invalid state so we can fail I/O or fault resolution in those cases. Those cases include the following: * No famfs metadata * file i_size does not match the originally allocated size * file is not flagged as DAX * errors were detected previously on the file An invalid file can often be fixed by replaying the log, or by umount/mount/log replay - all of which are user space operations. Signed-off-by: John Groves --- fs/famfs/famfs_file.c | 136 ++ 1 file changed, 136 insertions(+) diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c index fc667d5f7be8..5228e9de1e3b 100644 --- a/fs/famfs/famfs_file.c +++ b/fs/famfs/famfs_file.c @@ -19,6 +19,142 @@ #include #include "famfs_internal.h" +/* + * file_operations + */ + +/* Reject I/O to files that aren't in a valid state */ +static ssize_t +famfs_file_invalid(struct inode *inode) +{ + size_t i_size = i_size_read(inode); + struct famfs_file_meta *meta = inode->i_private; + + if (!meta) { + pr_err("%s: un-initialized famfs file\n", __func__); + return -EIO; + } + if (i_size != meta->file_size) { + pr_err("%s: something changed the size from %ld to %ld\n", + __func__, meta->file_size, i_size); + meta->error = 1; + return -ENXIO; + } + if (!IS_DAX(inode)) { + pr_err("%s: inode %llx IS_DAX is false\n", __func__, (u64)inode); + meta->error = 1; + return -ENXIO; + } + if (meta->error) { + pr_err("%s: previously detected metadata errors\n", __func__); + meta->error = 1; + return -EIO; + } + return 0; +} + +static ssize_t +famfs_dax_read_iter( + struct kiocb*iocb, + struct iov_iter *to) +{ + struct inode *inode = iocb->ki_filp->f_mapping->host; + size_t i_size = i_size_read(inode); + size_t count= iov_iter_count(to); + size_t max_count; + ssize_t rc; + + rc = famfs_file_invalid(inode); + if (rc) + return rc; + + max_count = max_t(size_t, 0, i_size - iocb->ki_pos); + + if (count > max_count) + iov_iter_truncate(to, max_count); + + if (!iov_iter_count(to)) + return 0; + + rc = dax_iomap_rw(iocb, to, &famfs_iomap_ops); + + file_accessed(iocb->ki_filp); + return rc; +} + +/** + * famfs_write_iter() + * + * We need our own write-iter in order to prevent append + */ +static ssize_t +famfs_dax_write_iter( + struct kiocb*iocb, + struct iov_iter *from) +{ + struct inode *inode = iocb->ki_filp->f_mapping->host; + size_t i_size = i_size_read(inode); + size_t count= iov_iter_count(from); + size_t max_count; + ssize_t rc; + + rc = famfs_file_invalid(inode); + if (rc) + return rc; + + /* Starting offset of write is: iocb->ki_pos +* length is iov_iter_count(from) +*/ + max_count = max_t(size_t, 0, i_size - iocb->ki_pos); + + /* If write would go past EOF, truncate it to end at EOF since famfs does not +* alloc-on-write +*/ + if (count > max_count) + iov_iter_truncate(from, max_count); + + if (!iov_iter_count(from)) + return 0; + + return dax_iomap_rw(iocb, from, &famfs_iomap_ops); +} + +static int +famfs_file_mmap( + struct file *file, + struct vm_area_struct *vma) +{ + struct inode*inode = file_inode(file); + ssize_t rc; + + rc = famfs_file_invalid(inode); + if (rc) + return (int)rc; + + file_accessed(file); + vma->vm_ops = &famfs_file_vm_ops; + vm_flags_set(vma, VM_HUGEPAGE); + return 0; +} + +const struct file_operations famfs_file_operations = { + .owner = THIS_MODULE, + + /* Custom famfs operations */ + .write_iter= famfs_dax_write_iter, + .read_iter = famfs_dax_read_iter, + .mmap = famfs_file_mmap, + + /* Force PMD alignment for mmap */ + .get_unmapped_area = thp_get_unmapped_area, + + /* Generic Operations */ + .fsync = noop_fsync, + .splice_read = filemap_splice_read, + .splice_write = iter_file_splice_write, + .llseek= generic_file_llseek, +}; + /* * iomap_operatio
[RFC PATCH 15/20] famfs: Add ioctl to file_operations
This commit introduces the per-file ioctl function famfs_file_ioctl() into struct file_operations, and introduces the famfs_file_init_dax() function (which is called by famfs_file_ioct()) famfs_file_init_dax() associates a dax extent list with a file, making it into a proper famfs file. It is called from the FAMFSIOC_MAP_CREATE ioctl. Starting with an empty file (which is basically a ramfs file), this turns the file into a DAX file backed by the specified extent list. The other ioctls are: FAMFSIOC_NOP - A convenient way for user space to verify it's a famfs file FAMFSIOC_MAP_GET - Get the header of the metadata for a file FAMFSIOC_MAP_GETEXT - Get the extents for a file The latter two, together, are comparable to xfs_bmap. Our user space tools use them primarly in testing. Signed-off-by: John Groves --- fs/famfs/famfs_file.c | 226 ++ 1 file changed, 226 insertions(+) diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c index 5228e9de1e3b..fd42d5966982 100644 --- a/fs/famfs/famfs_file.c +++ b/fs/famfs/famfs_file.c @@ -19,6 +19,231 @@ #include #include "famfs_internal.h" +/** + * famfs_map_meta_alloc() - Allocate famfs file metadata + * @mapp: Pointer to an mcache_map_meta pointer + * @ext_count: The number of extents needed + */ +static int +famfs_meta_alloc( + struct famfs_file_meta **metap, + size_text_count) +{ + struct famfs_file_meta *meta; + size_t metasz; + + *metap = NULL; + + metasz = sizeof(*meta) + sizeof(*(meta->tfs_extents)) * ext_count; + + meta = kzalloc(metasz, GFP_KERNEL); + if (!meta) + return -ENOMEM; + + meta->tfs_extent_ct = ext_count; + *metap = meta; + + return 0; +} + +static void +famfs_meta_free( + struct famfs_file_meta *map) +{ + kfree(map); +} + +/** + * famfs_file_init_dax() - FAMFSIOC_MAP_CREATE ioctl handler + * @file: + * @arg:ptr to struct mcioc_map in user space + * + * Setup the dax mapping for a file. Files are created empty, and then function is called + * (by famfs_file_ioctl()) to setup the mapping and set the file size. + */ +static int +famfs_file_init_dax( + struct file*file, + void __user*arg) +{ + struct famfs_extent*tfs_extents = NULL; + struct famfs_file_meta *meta = NULL; + struct inode *inode; + struct famfs_ioc_mapimap; + struct famfs_fs_info *fsi; + struct super_block *sb; + intalignment_errs = 0; + size_t extent_total = 0; + size_t ext_count; + intrc = 0; + inti; + + rc = copy_from_user(&imap, arg, sizeof(imap)); + if (rc) + return -EFAULT; + + ext_count = imap.ext_list_count; + if (ext_count < 1) { + rc = -ENOSPC; + goto errout; + } + + if (ext_count > FAMFS_MAX_EXTENTS) { + rc = -E2BIG; + goto errout; + } + + inode = file_inode(file); + if (!inode) { + rc = -EBADF; + goto errout; + } + sb = inode->i_sb; + fsi = inode->i_sb->s_fs_info; + + tfs_extents = &imap.ext_list[0]; + + rc = famfs_meta_alloc(&meta, ext_count); + if (rc) + goto errout; + + meta->file_type = imap.file_type; + meta->file_size = imap.file_size; + + /* Fill in the internal file metadata structure */ + for (i = 0; i < imap.ext_list_count; i++) { + size_t len; + off_t offset; + + offset = imap.ext_list[i].offset; + len= imap.ext_list[i].len; + + extent_total += len; + + if (WARN_ON(offset == 0 && meta->file_type != FAMFS_SUPERBLOCK)) { + rc = -EINVAL; + goto errout; + } + + meta->tfs_extents[i].offset = offset; + meta->tfs_extents[i].len= len; + + /* All extent addresses/offsets must be 2MiB aligned, +* and all but the last length must be a 2MiB multiple. +*/ + if (!IS_ALIGNED(offset, PMD_SIZE)) { + pr_err("%s: error ext %d hpa %lx not aligned\n", + __func__, i, offset); + alignment_errs++; + } + if (i < (imap.ext_list_count - 1) && !IS_ALIGNED(len, PMD_SIZE)) { + pr_err("%s: error ext %d length %ld not aligned\n", + __func__, i, len); + alignment_errs++; + } + } + + /* +* File size can be <= ext list size, since extent sizes are constrained +* to PMD multiples +*/ + if (imap.file_size > extent_total) { + pr_err("%s: file size %lld larger than ext list size %lld\n", +
[RFC PATCH 16/20] famfs: Add fault counters
One of the key requirements for famfs is that it service vma faults efficiently. Our metadata helps - the search order is n for n extents, and n is usually 1. But we can still observe gnarly lock contention in mm if PTE faults are happening. This commit introduces fault counters that can be enabled and read via /sys/fs/famfs/... These counters have proved useful in troubleshooting situations where PTE faults were happening instead of PMD. No performance impact when disabled. Signed-off-by: John Groves --- fs/famfs/famfs_file.c | 97 +++ fs/famfs/famfs_internal.h | 73 + 2 files changed, 170 insertions(+) diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c index fd42d5966982..a626f8a89790 100644 --- a/fs/famfs/famfs_file.c +++ b/fs/famfs/famfs_file.c @@ -19,6 +19,100 @@ #include #include "famfs_internal.h" +/*** + * filemap_fault counters + * + * The counters and the fault_count_enable file live at + * /sys/fs/famfs/ + */ +struct famfs_fault_counters ffc; +static int fault_count_enable; + +static ssize_t +fault_count_enable_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%d\n", fault_count_enable); +} + +static ssize_t +fault_count_enable_store(struct kobject*kobj, +struct kobj_attribute *attr, +const char*buf, +size_t count) +{ + int value; + int rc; + + rc = sscanf(buf, "%d", &value); + if (rc != 1) + return 0; + + if (value > 0) /* clear fault counters when enabling, but not when disabling */ + famfs_clear_fault_counters(&ffc); + + fault_count_enable = value; + return count; +} + +/* Individual fault counters are read-only */ +static ssize_t +fault_count_pte_show(struct kobject *kobj, +struct kobj_attribute *attr, +char *buf) +{ + return sprintf(buf, "%llu", famfs_pte_fault_ct(&ffc)); +} + +static ssize_t +fault_count_pmd_show(struct kobject *kobj, +struct kobj_attribute *attr, +char *buf) +{ + return sprintf(buf, "%llu", famfs_pmd_fault_ct(&ffc)); +} + +static ssize_t +fault_count_pud_show(struct kobject *kobj, +struct kobj_attribute *attr, +char *buf) +{ + return sprintf(buf, "%llu", famfs_pud_fault_ct(&ffc)); +} + +static struct kobj_attribute fault_count_enable_attribute = __ATTR(fault_count_enable, + 0660, + fault_count_enable_show, + fault_count_enable_store); +static struct kobj_attribute fault_count_pte_attribute = __ATTR(pte_fault_ct, + 0440, + fault_count_pte_show, + NULL); +static struct kobj_attribute fault_count_pmd_attribute = __ATTR(pmd_fault_ct, + 0440, + fault_count_pmd_show, + NULL); +static struct kobj_attribute fault_count_pud_attribute = __ATTR(pud_fault_ct, + 0440, + fault_count_pud_show, + NULL); + + +static struct attribute *attrs[] = { + &fault_count_enable_attribute.attr, + &fault_count_pte_attribute.attr, + &fault_count_pmd_attribute.attr, + &fault_count_pud_attribute.attr, + NULL, +}; + +struct attribute_group famfs_attr_group = { + .attrs = attrs, +}; + +/* End fault counters */ + /** * famfs_map_meta_alloc() - Allocate famfs file metadata * @mapp: Pointer to an mcache_map_meta pointer @@ -525,6 +619,9 @@ __famfs_filemap_fault( if (IS_DAX(inode)) { pfn_t pfn; + if (fault_count_enable) + famfs_inc_fault_counter_by_order(&ffc, pe_size); + ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &famfs_iomap_ops); if (ret & VM_FAULT_NEEDDSYNC) ret = dax_finish_sync_fault(vmf, pe_size, pfn); diff --git a/fs/famfs/famfs_internal.h b/fs/famfs/famfs_internal.h index af3990d43305..987cb172a149 100644 --- a/fs/famfs/famfs_internal.h +++ b/fs/famfs/famfs_internal.h @@ -50,4 +50,77 @@ struct famfs_fs_info {
[RFC PATCH 17/20] famfs: Add module stuff
This commit introduces the module init and exit machinery for famfs. Signed-off-by: John Groves --- fs/famfs/famfs_inode.c | 44 ++ 1 file changed, 44 insertions(+) diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c index ab46ec50b70d..0d659820e8ff 100644 --- a/fs/famfs/famfs_inode.c +++ b/fs/famfs/famfs_inode.c @@ -462,4 +462,48 @@ static struct file_system_type famfs_fs_type = { .fs_flags = FS_USERNS_MOUNT, }; +/* + * Module stuff + */ +static struct kobject *famfs_kobj; + +static int __init init_famfs_fs(void) +{ + int rc; + +#if defined(CONFIG_DEV_DAX_IOMAP) + pr_notice("%s: Your kernel supports famfs on /dev/dax\n", __func__); +#else + pr_notice("%s: Your kernel does not support famfs on /dev/dax\n", __func__); +#endif + famfs_kobj = kobject_create_and_add(MODULE_NAME, fs_kobj); + if (!famfs_kobj) { + pr_warn("Failed to create kobject\n"); + return -ENOMEM; + } + + rc = sysfs_create_group(famfs_kobj, &famfs_attr_group); + if (rc) { + kobject_put(famfs_kobj); + pr_warn("%s: Failed to create sysfs group\n", __func__); + return rc; + } + + return register_filesystem(&famfs_fs_type); +} + +static void +__exit famfs_exit(void) +{ + sysfs_remove_group(famfs_kobj, &famfs_attr_group); + kobject_put(famfs_kobj); + unregister_filesystem(&famfs_fs_type); + pr_info("%s: unregistered\n", __func__); +} + + +fs_initcall(init_famfs_fs); +module_exit(famfs_exit); + +MODULE_AUTHOR("John Groves, Micron Technology"); MODULE_LICENSE("GPL"); -- 2.43.0
[RFC PATCH 18/20] famfs: Support character dax via the dev_dax_iomap patch
This commit introduces the ability to open a character /dev/dax device instead of a block /dev/pmem device. This rests on the dev_dax_iomap patches earlier in this series. Signed-off-by: John Groves --- fs/famfs/famfs_inode.c | 97 +- 1 file changed, 87 insertions(+), 10 deletions(-) diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c index 0d659820e8ff..7d65ac497147 100644 --- a/fs/famfs/famfs_inode.c +++ b/fs/famfs/famfs_inode.c @@ -215,6 +215,93 @@ static const struct super_operations famfs_ops = { .show_options = famfs_show_options, }; +/*/ + +#if defined(CONFIG_DEV_DAX_IOMAP) + +/* + * famfs dax_operations (for char dax) + */ +static int +famfs_dax_notify_failure(struct dax_device *dax_dev, u64 offset, + u64 len, int mf_flags) +{ + pr_err("%s: offset %lld len %llu flags %x\n", __func__, + offset, len, mf_flags); + return -EOPNOTSUPP; +} + +static const struct dax_holder_operations famfs_dax_holder_ops = { + .notify_failure = famfs_dax_notify_failure, +}; + +/*/ + +/** + * famfs_open_char_device() + * + * Open a /dev/dax device. This only works in kernels with the dev_dax_iomap patch + */ +static int +famfs_open_char_device( + struct super_block *sb, + struct fs_context *fc) +{ + struct famfs_fs_info *fsi = sb->s_fs_info; + struct dax_device*dax_devp; + struct inode *daxdev_inode; + + int rc = 0; + + pr_notice("%s: Opening character dax device %s\n", __func__, fc->source); + + fsi->dax_filp = filp_open(fc->source, O_RDWR, 0); + if (IS_ERR(fsi->dax_filp)) { + pr_err("%s: failed to open dax device %s\n", + __func__, fc->source); + fsi->dax_filp = NULL; + return PTR_ERR(fsi->dax_filp); + } + + daxdev_inode = file_inode(fsi->dax_filp); + dax_devp = inode_dax(daxdev_inode); + if (IS_ERR(dax_devp)) { + pr_err("%s: unable to get daxdev from inode for %s\n", + __func__, fc->source); + rc = -ENODEV; + goto char_err; + } + + rc = fs_dax_get(dax_devp, fsi, &famfs_dax_holder_ops); + if (rc) { + pr_info("%s: err attaching famfs_dax_holder_ops\n", __func__); + goto char_err; + } + + fsi->bdev_handle = NULL; + fsi->dax_devp = dax_devp; + + return 0; + +char_err: + filp_close(fsi->dax_filp, NULL); + return rc; +} + +#else /* CONFIG_DEV_DAX_IOMAP */ +static int +famfs_open_char_device( + struct super_block *sb, + struct fs_context *fc) +{ + pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n", + __func__, fc->source); + return -ENODEV; +} + + +#endif /* CONFIG_DEV_DAX_IOMAP */ + /*** * dax_holder_operations for block dax */ @@ -236,16 +323,6 @@ const struct dax_holder_operations famfs_blk_dax_holder_ops = { .notify_failure = famfs_blk_dax_notify_failure, }; -static int -famfs_open_char_device( - struct super_block *sb, - struct fs_context *fc) -{ - pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n", - __func__, fc->source); - return -ENODEV; -} - /** * famfs_open_device() * -- 2.43.0
[RFC PATCH 19/20] famfs: Update MAINTAINERS file
This patch introduces famfs into the MAINTAINERS file Signed-off-by: John Groves --- MAINTAINERS | 11 +++ 1 file changed, 11 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 73d898383e51..e4e8bf3602bb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8097,6 +8097,17 @@ F: Documentation/networking/failover.rst F: include/net/failover.h F: net/core/failover.c +FAMFS +M: John Groves +M: John Groves +M: John Groves +L: linux-...@vger.kernel.org +L: linux-fsde...@vger.kernel.org +S: Supported +F: Documentation/filesystems/famfs.rst +F: fs/famfs +F: include/uapi/linux/famfs_ioctl.h + FANOTIFY M: Jan Kara R: Amir Goldstein -- 2.43.0
[RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing
Add famfs Kconfig and Makefile, and hook into fs/Kconfig and fs/Makefile Signed-off-by: John Groves --- fs/Kconfig| 2 ++ fs/Makefile | 1 + fs/famfs/Kconfig | 10 ++ fs/famfs/Makefile | 5 + 4 files changed, 18 insertions(+) create mode 100644 fs/famfs/Kconfig create mode 100644 fs/famfs/Makefile diff --git a/fs/Kconfig b/fs/Kconfig index 89fdbefd1075..8a11625a54a2 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -141,6 +141,8 @@ source "fs/autofs/Kconfig" source "fs/fuse/Kconfig" source "fs/overlayfs/Kconfig" +source "fs/famfs/Kconfig" + menu "Caches" source "fs/netfs/Kconfig" diff --git a/fs/Makefile b/fs/Makefile index c09016257f05..382c1ea4f4c3 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -130,3 +130,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ obj-$(CONFIG_EROFS_FS) += erofs/ obj-$(CONFIG_VBOXSF_FS)+= vboxsf/ obj-$(CONFIG_ZONEFS_FS)+= zonefs/ +obj-$(CONFIG_FAMFS) += famfs/ diff --git a/fs/famfs/Kconfig b/fs/famfs/Kconfig new file mode 100644 index ..e450928d8912 --- /dev/null +++ b/fs/famfs/Kconfig @@ -0,0 +1,10 @@ + + +config FAMFS + tristate "famfs: shared memory file system" + depends on DEV_DAX && FS_DAX + help + Support for the famfs file system. Famfs is a dax file system that +can support scale-out shared access to fabric-attached memory +(e.g. CXL shared memory). Famfs is not a general purpose file system; +it is an enabler for data sets in shared memory. diff --git a/fs/famfs/Makefile b/fs/famfs/Makefile new file mode 100644 index ..8cac90c090a4 --- /dev/null +++ b/fs/famfs/Makefile @@ -0,0 +1,5 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-$(CONFIG_FAMFS) += famfs.o + +famfs-y := famfs_inode.o famfs_file.o -- 2.43.0
Re: [RFC PATCH 16/20] famfs: Add fault counters
On 2/23/24 09:42, John Groves wrote: > One of the key requirements for famfs is that it service vma faults > efficiently. Our metadata helps - the search order is n for n extents, > and n is usually 1. But we can still observe gnarly lock contention > in mm if PTE faults are happening. This commit introduces fault counters > that can be enabled and read via /sys/fs/famfs/... > > These counters have proved useful in troubleshooting situations where > PTE faults were happening instead of PMD. No performance impact when > disabled. This seems kinda wonky. Why does _this_ specific filesystem need its own fault counters. Seems like something we'd want to do much more generically, if it is needed at all. Was the issue here just that vm_ops->fault() was getting called instead of ->huge_fault()? Or something more subtle?
Re: [RFC PATCH 16/20] famfs: Add fault counters
On 24/02/23 10:23AM, Dave Hansen wrote: > On 2/23/24 09:42, John Groves wrote: > > One of the key requirements for famfs is that it service vma faults > > efficiently. Our metadata helps - the search order is n for n extents, > > and n is usually 1. But we can still observe gnarly lock contention > > in mm if PTE faults are happening. This commit introduces fault counters > > that can be enabled and read via /sys/fs/famfs/... > > > > These counters have proved useful in troubleshooting situations where > > PTE faults were happening instead of PMD. No performance impact when > > disabled. > > This seems kinda wonky. Why does _this_ specific filesystem need its > own fault counters. Seems like something we'd want to do much more > generically, if it is needed at all. > > Was the issue here just that vm_ops->fault() was getting called instead > of ->huge_fault()? Or something more subtle? Thanks for your reply Dave! First, I'm willing to pull the fault counters out if the brain trust doesn't like them. I put them in because we were running benchmarks of computational data analytics and and noted that jobs took 3x as long on famfs as raw dax - which indicated I was doing something wrong, because it should be equivalent or very close. The the solution was to call thp_get_unmapped_area() in famfs_file_operations, and performance doesn't vary significantly from raw dax now. Prior to that I wasn't making sure the mmap address was PMD aligned. After that I wanted a way to be double-secret-certain that it was servicing PMD faults as intended. Which it basically always is, so far. (The smoke tests in user space check this.) John
Re: [RFC PATCH 16/20] famfs: Add fault counters
John Groves wrote: > On 24/02/23 10:23AM, Dave Hansen wrote: > > On 2/23/24 09:42, John Groves wrote: > > > One of the key requirements for famfs is that it service vma faults > > > efficiently. Our metadata helps - the search order is n for n extents, > > > and n is usually 1. But we can still observe gnarly lock contention > > > in mm if PTE faults are happening. This commit introduces fault counters > > > that can be enabled and read via /sys/fs/famfs/... > > > > > > These counters have proved useful in troubleshooting situations where > > > PTE faults were happening instead of PMD. No performance impact when > > > disabled. > > > > This seems kinda wonky. Why does _this_ specific filesystem need its > > own fault counters. Seems like something we'd want to do much more > > generically, if it is needed at all. > > > > Was the issue here just that vm_ops->fault() was getting called instead > > of ->huge_fault()? Or something more subtle? > > Thanks for your reply Dave! > > First, I'm willing to pull the fault counters out if the brain trust doesn't > like them. > > I put them in because we were running benchmarks of computational data > analytics and and noted that jobs took 3x as long on famfs as raw dax - > which indicated I was doing something wrong, because it should be equivalent > or very close. > > The the solution was to call thp_get_unmapped_area() in > famfs_file_operations, and performance doesn't vary significantly from raw > dax now. Prior to that I wasn't making sure the mmap address was PMD aligned. > > After that I wanted a way to be double-secret-certain that it was servicing > PMD faults as intended. Which it basically always is, so far. (The smoke > tests in user space check this.) We had similar unit test regression concerns with fsdax where some upstream change silently broke PMD faults. The solution there was trace points in the fault handlers and a basic test that knows apriori that it *should* be triggering a certain number of huge faults: https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31
Re: [RFC PATCH 16/20] famfs: Add fault counters
On 24/02/23 12:04PM, Dan Williams wrote: > John Groves wrote: > > On 24/02/23 10:23AM, Dave Hansen wrote: > > > On 2/23/24 09:42, John Groves wrote: > > > > One of the key requirements for famfs is that it service vma faults > > > > efficiently. Our metadata helps - the search order is n for n extents, > > > > and n is usually 1. But we can still observe gnarly lock contention > > > > in mm if PTE faults are happening. This commit introduces fault counters > > > > that can be enabled and read via /sys/fs/famfs/... > > > > > > > > These counters have proved useful in troubleshooting situations where > > > > PTE faults were happening instead of PMD. No performance impact when > > > > disabled. > > > > > > This seems kinda wonky. Why does _this_ specific filesystem need its > > > own fault counters. Seems like something we'd want to do much more > > > generically, if it is needed at all. > > > > > > Was the issue here just that vm_ops->fault() was getting called instead > > > of ->huge_fault()? Or something more subtle? > > > > Thanks for your reply Dave! > > > > First, I'm willing to pull the fault counters out if the brain trust doesn't > > like them. > > > > I put them in because we were running benchmarks of computational data > > analytics and and noted that jobs took 3x as long on famfs as raw dax - > > which indicated I was doing something wrong, because it should be equivalent > > or very close. > > > > The the solution was to call thp_get_unmapped_area() in > > famfs_file_operations, and performance doesn't vary significantly from raw > > dax now. Prior to that I wasn't making sure the mmap address was PMD > > aligned. > > > > After that I wanted a way to be double-secret-certain that it was servicing > > PMD faults as intended. Which it basically always is, so far. (The smoke > > tests in user space check this.) > > We had similar unit test regression concerns with fsdax where some > upstream change silently broke PMD faults. The solution there was trace > points in the fault handlers and a basic test that knows apriori that it > *should* be triggering a certain number of huge faults: > > https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31 Good approach, thanks Dan! My working assumption is that we'll be able to make that approach work in the famfs tests. So the fault counters should go away in the next version. John
Re: [RFC PATCH 16/20] famfs: Add fault counters
On 2/23/24 12:39, John Groves wrote: >> We had similar unit test regression concerns with fsdax where some >> upstream change silently broke PMD faults. The solution there was trace >> points in the fault handlers and a basic test that knows apriori that it >> *should* be triggering a certain number of huge faults: >> >> https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31 > Good approach, thanks Dan! My working assumption is that we'll be able to make > that approach work in the famfs tests. So the fault counters should go away > in the next version. I do really suspect there's something more generic that should be done here. Maybe we need a generic 'huge_faults' perf event to pair up with the good ol' faults that we already have: # perf stat -e faults /bin/ls Performance counter stats for '/bin/ls': 104 faults 0.001499862 seconds time elapsed 0.00149 seconds user 0.0 seconds sys
Re: [RFC PATCH 16/20] famfs: Add fault counters
Dave Hansen wrote: > On 2/23/24 12:39, John Groves wrote: > >> We had similar unit test regression concerns with fsdax where some > >> upstream change silently broke PMD faults. The solution there was trace > >> points in the fault handlers and a basic test that knows apriori that it > >> *should* be triggering a certain number of huge faults: > >> > >> https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31 > > Good approach, thanks Dan! My working assumption is that we'll be able to > > make > > that approach work in the famfs tests. So the fault counters should go away > > in the next version. > > I do really suspect there's something more generic that should be done > here. Maybe we need a generic 'huge_faults' perf event to pair up with > the good ol' faults that we already have: > > # perf stat -e faults /bin/ls > > Performance counter stats for '/bin/ls': > >104 faults > > >0.001499862 seconds time elapsed > >0.00149 seconds user >0.0 seconds sys Certainly something like that would have satisified this sanity test use case. I will note that mm_account_fault() would need some help to figure out the size of the page table entry that got installed. Maybe extensions to vm_fault_reason to add VM_FAULT_P*D? That compliments VM_FAULT_FALLBACK to indicate whether, for example, the fallback went from PUD to PMD, or all the way back to PTE. Then use cases like this could just add a dynamic probe in mm_account_fault(). No real need for a new tracepoint unless there was a use case for this outside of regression testing fault handlers, right?
Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
On Fri, Feb 23, 2024 at 11:41:44AM -0600, John Groves wrote: > This patch set introduces famfs[1] - a special-purpose fs-dax file system > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not > CXL-specific in anyway way. > > * Famfs creates a simple access method for storing and sharing data in > sharable memory. The memory is exposed and accessed as memory-mappable > dax files. > * Famfs supports multiple hosts mounting the same file system from the > same memory (something existing fs-dax file systems don't do). > * A famfs file system can be created on either a /dev/pmem device in fs-dax > mode, or a /dev/dax device in devdax mode (the latter depending on > patches 2-6 of this series). > > The famfs kernel file system is part the famfs framework; additional > components in user space[2] handle metadata and direct the famfs kernel > module to instantiate files that map to specific memory. The famfs user > space has documentation and a reasonably thorough test suite. > > The famfs kernel module never accesses the shared memory directly (either > data or metadata). Because of this, shared memory managed by the famfs > framework does not create a RAS "blast radius" problem that should be able > to crash or de-stabilize the kernel. Poison or timeouts in famfs memory > can be expected to kill apps via SIGBUS and cause mounts to be disabled > due to memory failure notifications. > > Famfs does not attempt to solve concurrency or coherency problems for apps, > although it does solve these problems in regard to its own data structures. > Apps may encounter hard concurrency problems, but there are use cases that > are imminently useful and uncomplicated from a concurrency perspective: > serial sharing is one (only one host at a time has access), and read-only > concurrent sharing is another (all hosts can read-cache without worry). Can you do me a favor, curious if you can run a test like this: fio -name=ten-1g-per-thread --nrfiles=10 -bs=2M -ioengine=io_uring -direct=1 --group_reporting=1 --alloc-size=1048576 --filesize=1GiB --readwrite=write --fallocate=none --numjobs=$(nproc) --create_on_open=1 --directory=/mnt What do you get for throughput? The absolute large the system an capacity the better. Luis
Re: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h
Hi-- On 2/23/24 09:41, John Groves wrote: > Add uapi include file for famfs. The famfs user space uses ioctl on > individual files to pass in mapping information and file size. This > would be hard to do via sysfs or other means, since it's > file-specific. > > Signed-off-by: John Groves > --- > include/uapi/linux/famfs_ioctl.h | 56 > 1 file changed, 56 insertions(+) > create mode 100644 include/uapi/linux/famfs_ioctl.h > > diff --git a/include/uapi/linux/famfs_ioctl.h > b/include/uapi/linux/famfs_ioctl.h > new file mode 100644 > index ..6b3e6452d02f > --- /dev/null > +++ b/include/uapi/linux/famfs_ioctl.h > @@ -0,0 +1,56 @@ > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > +/* > + * famfs - dax file system for shared fabric-attached memory > + * > + * Copyright 2023-2024 Micron Technology, Inc. > + * > + * This file system, originally based on ramfs the dax support from xfs, This is confusing to me. Is it just me? ^ > + * is intended to allow multiple host systems to mount a common file system > + * view of dax files that map to shared memory. > + */ > +#ifndef FAMFS_IOCTL_H > +#define FAMFS_IOCTL_H > + > +#include > +#include > + > +#define FAMFS_MAX_EXTENTS 2 > + > +enum extent_type { > + SIMPLE_DAX_EXTENT = 13, > + INVALID_EXTENT_TYPE, > +}; > + > +struct famfs_extent { > + __u64 offset; > + __u64 len; > +}; > + > +enum famfs_file_type { > + FAMFS_REG, > + FAMFS_SUPERBLOCK, > + FAMFS_LOG, > +}; > + > +/** "/**" is used to begin kernel-doc comments, but this comment block is missing a few entries to make it be kernel-doc compatible. Please either add them or just use "/*" to begin the comment. > + * struct famfs_ioc_map > + * > + * This is the metadata that indicates where the memory is for a famfs file > + */ > +struct famfs_ioc_map { > + enum extent_type extent_type; > + enum famfs_file_type file_type; > + __u64 file_size; > + __u64 ext_list_count; > + struct famfs_extent ext_list[FAMFS_MAX_EXTENTS]; > +}; > + > +#define FAMFSIOC_MAGIC 'u' This 'u' value should be documented in Documentation/userspace-api/ioctl/ioctl-number.rst. and if possible, you might want to use values like 0x5x or 0x8x that don't conflict with the ioctl numbers that are already used in the 'u' space. > + > +/* famfs file ioctl opcodes */ > +#define FAMFSIOC_MAP_CREATE_IOW(FAMFSIOC_MAGIC, 1, struct famfs_ioc_map) > +#define FAMFSIOC_MAP_GET _IOR(FAMFSIOC_MAGIC, 2, struct famfs_ioc_map) > +#define FAMFSIOC_MAP_GETEXT_IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent) > +#define FAMFSIOC_NOP _IO(FAMFSIOC_MAGIC, 4) > + > +#endif /* FAMFS_IOCTL_H */ -- #Randy
Re: [RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing
Hi, On 2/23/24 09:42, John Groves wrote: > Add famfs Kconfig and Makefile, and hook into fs/Kconfig and fs/Makefile > > Signed-off-by: John Groves > --- > fs/Kconfig| 2 ++ > fs/Makefile | 1 + > fs/famfs/Kconfig | 10 ++ > fs/famfs/Makefile | 5 + > 4 files changed, 18 insertions(+) > create mode 100644 fs/famfs/Kconfig > create mode 100644 fs/famfs/Makefile > > diff --git a/fs/Kconfig b/fs/Kconfig > index 89fdbefd1075..8a11625a54a2 100644 > --- a/fs/Kconfig > +++ b/fs/Kconfig > @@ -141,6 +141,8 @@ source "fs/autofs/Kconfig" > source "fs/fuse/Kconfig" > source "fs/overlayfs/Kconfig" > > +source "fs/famfs/Kconfig" > + > menu "Caches" > > source "fs/netfs/Kconfig" > diff --git a/fs/Makefile b/fs/Makefile > index c09016257f05..382c1ea4f4c3 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -130,3 +130,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ > obj-$(CONFIG_EROFS_FS) += erofs/ > obj-$(CONFIG_VBOXSF_FS) += vboxsf/ > obj-$(CONFIG_ZONEFS_FS) += zonefs/ > +obj-$(CONFIG_FAMFS) += famfs/ > diff --git a/fs/famfs/Kconfig b/fs/famfs/Kconfig > new file mode 100644 > index ..e450928d8912 > --- /dev/null > +++ b/fs/famfs/Kconfig > @@ -0,0 +1,10 @@ > + > + > +config FAMFS > + tristate "famfs: shared memory file system" > + depends on DEV_DAX && FS_DAX > + help > + Support for the famfs file system. Famfs is a dax file system that > + can support scale-out shared access to fabric-attached memory > + (e.g. CXL shared memory). Famfs is not a general purpose file system; > + it is an enabler for data sets in shared memory. Please use one tab + 2 spaces to indent help text (below the "help" keyword) as documented in Documentation/process/coding-style.rst. > diff --git a/fs/famfs/Makefile b/fs/famfs/Makefile > new file mode 100644 > index ..8cac90c090a4 > --- /dev/null > +++ b/fs/famfs/Makefile > @@ -0,0 +1,5 @@ > +# SPDX-License-Identifier: GPL-2.0 > + > +obj-$(CONFIG_FAMFS) += famfs.o > + > +famfs-y := famfs_inode.o famfs_file.o -- #Randy
Re: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h
On 24/02/23 05:39PM, Randy Dunlap wrote: > Hi-- > > On 2/23/24 09:41, John Groves wrote: > > Add uapi include file for famfs. The famfs user space uses ioctl on > > individual files to pass in mapping information and file size. This > > would be hard to do via sysfs or other means, since it's > > file-specific. > > > > Signed-off-by: John Groves > > --- > > include/uapi/linux/famfs_ioctl.h | 56 > > 1 file changed, 56 insertions(+) > > create mode 100644 include/uapi/linux/famfs_ioctl.h > > > > diff --git a/include/uapi/linux/famfs_ioctl.h > > b/include/uapi/linux/famfs_ioctl.h > > new file mode 100644 > > index ..6b3e6452d02f > > --- /dev/null > > +++ b/include/uapi/linux/famfs_ioctl.h > > @@ -0,0 +1,56 @@ > > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > > +/* > > + * famfs - dax file system for shared fabric-attached memory > > + * > > + * Copyright 2023-2024 Micron Technology, Inc. > > + * > > + * This file system, originally based on ramfs the dax support from xfs, > > This is confusing to me. Is it just me? ^ Thanks Randy. I think I was trying to say "based on ramfs *plus* the dax support from xfs. But I'll try to come up with something more clear than that... > > > + * is intended to allow multiple host systems to mount a common file system > > + * view of dax files that map to shared memory. > > + */ > > +#ifndef FAMFS_IOCTL_H > > +#define FAMFS_IOCTL_H > > + > > +#include > > +#include > > + > > +#define FAMFS_MAX_EXTENTS 2 > > + > > +enum extent_type { > > + SIMPLE_DAX_EXTENT = 13, > > + INVALID_EXTENT_TYPE, > > +}; > > + > > +struct famfs_extent { > > + __u64 offset; > > + __u64 len; > > +}; > > + > > +enum famfs_file_type { > > + FAMFS_REG, > > + FAMFS_SUPERBLOCK, > > + FAMFS_LOG, > > +}; > > + > > +/** > > "/**" is used to begin kernel-doc comments, but this comment block is missing > a few entries to make it be kernel-doc compatible. Please either add them > or just use "/*" to begin the comment. Will do, thanks. And I'll check the whole code base for other instances; I won't be surprise if I was sloop about that in more than one place. > > > + * struct famfs_ioc_map > > + * > > + * This is the metadata that indicates where the memory is for a famfs file > > + */ > > +struct famfs_ioc_map { > > + enum extent_type extent_type; > > + enum famfs_file_type file_type; > > + __u64 file_size; > > + __u64 ext_list_count; > > + struct famfs_extent ext_list[FAMFS_MAX_EXTENTS]; > > +}; > > + > > +#define FAMFSIOC_MAGIC 'u' > > This 'u' value should be documented in > Documentation/userspace-api/ioctl/ioctl-number.rst. > > and if possible, you might want to use values like 0x5x or 0x8x > that don't conflict with the ioctl numbers that are already used > in the 'u' space. Will do. I was trying to be too clever there, invoking "mu" for micron. > > > + > > +/* famfs file ioctl opcodes */ > > +#define FAMFSIOC_MAP_CREATE_IOW(FAMFSIOC_MAGIC, 1, struct > > famfs_ioc_map) > > +#define FAMFSIOC_MAP_GET _IOR(FAMFSIOC_MAGIC, 2, struct > > famfs_ioc_map) > > +#define FAMFSIOC_MAP_GETEXT_IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent) > > +#define FAMFSIOC_NOP _IO(FAMFSIOC_MAGIC, 4) > > + > > +#endif /* FAMFS_IOCTL_H */ > > -- > #Randy Thank you for taking the time to look it over, Randy. John
Re: [RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing
On 24/02/23 05:50PM, Randy Dunlap wrote: > Hi, > > On 2/23/24 09:42, John Groves wrote: > > Add famfs Kconfig and Makefile, and hook into fs/Kconfig and fs/Makefile > > > > Signed-off-by: John Groves > > --- > > fs/Kconfig| 2 ++ > > fs/Makefile | 1 + > > fs/famfs/Kconfig | 10 ++ > > fs/famfs/Makefile | 5 + > > 4 files changed, 18 insertions(+) > > create mode 100644 fs/famfs/Kconfig > > create mode 100644 fs/famfs/Makefile > > > > diff --git a/fs/Kconfig b/fs/Kconfig > > index 89fdbefd1075..8a11625a54a2 100644 > > --- a/fs/Kconfig > > +++ b/fs/Kconfig > > @@ -141,6 +141,8 @@ source "fs/autofs/Kconfig" > > source "fs/fuse/Kconfig" > > source "fs/overlayfs/Kconfig" > > > > +source "fs/famfs/Kconfig" > > + > > menu "Caches" > > > > source "fs/netfs/Kconfig" > > diff --git a/fs/Makefile b/fs/Makefile > > index c09016257f05..382c1ea4f4c3 100644 > > --- a/fs/Makefile > > +++ b/fs/Makefile > > @@ -130,3 +130,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ > > obj-$(CONFIG_EROFS_FS) += erofs/ > > obj-$(CONFIG_VBOXSF_FS)+= vboxsf/ > > obj-$(CONFIG_ZONEFS_FS)+= zonefs/ > > +obj-$(CONFIG_FAMFS) += famfs/ > > diff --git a/fs/famfs/Kconfig b/fs/famfs/Kconfig > > new file mode 100644 > > index ..e450928d8912 > > --- /dev/null > > +++ b/fs/famfs/Kconfig > > @@ -0,0 +1,10 @@ > > + > > + > > +config FAMFS > > + tristate "famfs: shared memory file system" > > + depends on DEV_DAX && FS_DAX > > + help > > + Support for the famfs file system. Famfs is a dax file system that > > +can support scale-out shared access to fabric-attached memory > > +(e.g. CXL shared memory). Famfs is not a general purpose file system; > > +it is an enabler for data sets in shared memory. > > Please use one tab + 2 spaces to indent help text (below the "help" keyword) > as documented in Documentation/process/coding-style.rst. Will do, thank you! John
Re: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h
Hi John, On 2/23/24 18:23, John Groves wrote: >>> + >>> +#define FAMFSIOC_MAGIC 'u' >> This 'u' value should be documented in >> Documentation/userspace-api/ioctl/ioctl-number.rst. >> >> and if possible, you might want to use values like 0x5x or 0x8x >> that don't conflict with the ioctl numbers that are already used >> in the 'u' space. > Will do. I was trying to be too clever there, invoking "mu" for > micron. I might have been unclear about this one. It's OK to use 'u' but the values 1-4 below conflict in the 'u' space: 'u' 00-1F linux/smb_fs.h gone 'u' 20-3F linux/uvcvideo.hUSB video class host driver 'u' 40-4f linux/udmabuf.h so if you could use 'u' 50-5f or 'u' 80-8f then those conflicts wouldn't be there. HTH. >>> + >>> +/* famfs file ioctl opcodes */ >>> +#define FAMFSIOC_MAP_CREATE_IOW(FAMFSIOC_MAGIC, 1, struct >>> famfs_ioc_map) >>> +#define FAMFSIOC_MAP_GET _IOR(FAMFSIOC_MAGIC, 2, struct >>> famfs_ioc_map) >>> +#define FAMFSIOC_MAP_GETEXT_IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent) >>> +#define FAMFSIOC_NOP _IO(FAMFSIOC_MAGIC, 4) -- #Randy
Re: [RFC PATCH 16/20] famfs: Add fault counters
On Fri, Feb 23, 2024 at 03:50:33PM -0800, Dan Williams wrote: > Certainly something like that would have satisified this sanity test use > case. I will note that mm_account_fault() would need some help to figure > out the size of the page table entry that got installed. Maybe > extensions to vm_fault_reason to add VM_FAULT_P*D? That compliments > VM_FAULT_FALLBACK to indicate whether, for example, the fallback went > from PUD to PMD, or all the way back to PTE. ugh, no, it's more complicated than that. look at the recent changes to set_ptes(). we can now install PTEs of many different sizes, depending on the architecture. someday i look forward to supporting all the page sizes on parisc (4k, 16k, 64k, 256k, ... 4G)
Re: [RFC PATCH 16/20] famfs: Add fault counters
Matthew Wilcox wrote: > On Fri, Feb 23, 2024 at 03:50:33PM -0800, Dan Williams wrote: > > Certainly something like that would have satisified this sanity test use > > case. I will note that mm_account_fault() would need some help to figure > > out the size of the page table entry that got installed. Maybe > > extensions to vm_fault_reason to add VM_FAULT_P*D? That compliments > > VM_FAULT_FALLBACK to indicate whether, for example, the fallback went > > from PUD to PMD, or all the way back to PTE. > > ugh, no, it's more complicated than that. look at the recent changes to > set_ptes(). we can now install PTEs of many different sizes, depending > on the architecture. someday i look forward to supporting all the page > sizes on parisc (4k, 16k, 64k, 256k, ... 4G) Nice! There are enough bits in vm_fault_t to represent many page sizes instead of the entry type as I suggested, but I would defer to you or Dave on how to make "installed pte size" generically traceable per Dave's suggestion.