Re: [PATCH v5] tracing: Support to dump instance traces by ftrace_dump_on_oops

2024-02-23 Thread Huang Yiwei




On 2/23/2024 9:47 AM, Steven Rostedt wrote:

On Thu, 8 Feb 2024 21:18:14 +0800
Huang Yiwei  wrote:


Currently ftrace only dumps the global trace buffer on an OOPs. For
debugging a production usecase, instance trace will be helpful to
check specific problems since global trace buffer may be used for
other purposes.

This patch extend the ftrace_dump_on_oops parameter to dump a specific
or multiple trace instances:

   - ftrace_dump_on_oops=0: as before -- don't dump
   - ftrace_dump_on_oops[=1]: as before -- dump the global trace buffer
   on all CPUs
   - ftrace_dump_on_oops=2 or =orig_cpu: as before -- dump the global
   trace buffer on CPU that triggered the oops
   - ftrace_dump_on_oops=: new behavior -- dump the
   tracing instance matching 
   - ftrace_dump_on_oops[=2/orig_cpu],[=2/orig_cpu],
   [=2/orig_cpu]: new behavior -- dump the global trace
   buffer and multiple instance buffer on all CPUs, or only dump on CPU
   that triggered the oops if =2 or =orig_cpu is given

Also, the sysctl node can handle the input accordingly.

Cc: Ross Zwisler 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Huang Yiwei 


This patch failed with the following warning:

   kernel/trace/trace.c:10029:6: warning: no previous prototype for 
‘ftrace_dump_one’ [-Wmissing-prototypes]

-- Steve


My bad, will add the missing 'static' keyword in next patch.

Regards,
Huang Yiwei



[PATCH v6] tracing: Support to dump instance traces by ftrace_dump_on_oops

2024-02-23 Thread Huang Yiwei
Currently ftrace only dumps the global trace buffer on an OOPs. For
debugging a production usecase, instance trace will be helpful to
check specific problems since global trace buffer may be used for
other purposes.

This patch extend the ftrace_dump_on_oops parameter to dump a specific
or multiple trace instances:

  - ftrace_dump_on_oops=0: as before -- don't dump
  - ftrace_dump_on_oops[=1]: as before -- dump the global trace buffer
  on all CPUs
  - ftrace_dump_on_oops=2 or =orig_cpu: as before -- dump the global
  trace buffer on CPU that triggered the oops
  - ftrace_dump_on_oops=: new behavior -- dump the
  tracing instance matching 
  - ftrace_dump_on_oops[=2/orig_cpu],[=2/orig_cpu],
  [=2/orig_cpu]: new behavior -- dump the global trace
  buffer and multiple instance buffer on all CPUs, or only dump on CPU
  that triggered the oops if =2 or =orig_cpu is given

Also, the sysctl node can handle the input accordingly.

Cc: Ross Zwisler 
Signed-off-by: Huang Yiwei 
---
 .../admin-guide/kernel-parameters.txt |  26 ++-
 Documentation/admin-guide/sysctl/kernel.rst   |  30 +++-
 include/linux/ftrace.h|   4 +-
 include/linux/kernel.h|   1 +
 kernel/sysctl.c   |   4 +-
 kernel/trace/trace.c  | 156 +-
 kernel/trace/trace_selftest.c |   2 +-
 7 files changed, 168 insertions(+), 55 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 31b3a25680d0..3d6ea8e80c2f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1561,12 +1561,28 @@
The above will cause the "foo" tracing instance to 
trigger
a snapshot at the end of boot up.
 
-   ftrace_dump_on_oops[=orig_cpu]
+   ftrace_dump_on_oops[=2(orig_cpu) | =][, |
+ ,=2(orig_cpu)]
[FTRACE] will dump the trace buffers on oops.
-   If no parameter is passed, ftrace will dump
-   buffers of all CPUs, but if you pass orig_cpu, it will
-   dump only the buffer of the CPU that triggered the
-   oops.
+   If no parameter is passed, ftrace will dump global
+   buffers of all CPUs, if you pass 2 or orig_cpu, it
+   will dump only the buffer of the CPU that triggered
+   the oops, or the specific instance will be dumped if
+   its name is passed. Multiple instance dump is also
+   supported, and instances are separated by commas. Each
+   instance supports only dump on CPU that triggered the
+   oops by passing 2 or orig_cpu to it.
+
+   ftrace_dump_on_oops=foo=orig_cpu
+
+   The above will dump only the buffer of "foo" instance
+   on CPU that triggered the oops.
+
+   ftrace_dump_on_oops,foo,bar=orig_cpu
+
+   The above will dump global buffer on all CPUs, the
+   buffer of "foo" instance on all CPUs and the buffer
+   of "bar" instance on CPU that triggered the oops.
 
ftrace_filter=[function-list]
[FTRACE] Limit the functions traced by the function
diff --git a/Documentation/admin-guide/sysctl/kernel.rst 
b/Documentation/admin-guide/sysctl/kernel.rst
index 6584a1f9bfe3..ea8e5f152edc 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -296,12 +296,30 @@ kernel panic). This will output the contents of the 
ftrace buffers to
 the console.  This is very useful for capturing traces that lead to
 crashes and outputting them to a serial console.
 
-= ===
-0 Disabled (default).
-1 Dump buffers of all CPUs.
-2 Dump the buffer of the CPU that triggered the oops.
-= ===
-
+=== ===
+0   Disabled (default).
+1   Dump buffers of all CPUs.
+2(orig_cpu) Dump the buffer of the CPU that triggered the
+oops.
+  Dump the specific instance buffer on all CPUs.
+=2(orig_cpu)  Dump the specific instance buffer on the CPU
+that triggered the oops.
+=== ===
+
+Multiple instance dump is also supported, and instances are separated
+by commas. If global buffer also needs to be dumped, please specify
+the dump mode (1/2/orig_cpu) first for global buffer.
+
+So for example to dump "foo" and "bar" instance buffer on all CPUs,
+user can::
+
+ 

[RFC PATCH 00/20] Introduce the famfs shared-memory file system

2024-02-23 Thread John Groves
This patch set introduces famfs[1] - a special-purpose fs-dax file system
for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
CXL-specific in anyway way.

* Famfs creates a simple access method for storing and sharing data in
  sharable memory. The memory is exposed and accessed as memory-mappable
  dax files.
* Famfs supports multiple hosts mounting the same file system from the
  same memory (something existing fs-dax file systems don't do).
* A famfs file system can be created on either a /dev/pmem device in fs-dax
  mode, or a /dev/dax device in devdax mode (the latter depending on
  patches 2-6 of this series).

The famfs kernel file system is part the famfs framework; additional
components in user space[2] handle metadata and direct the famfs kernel
module to instantiate files that map to specific memory. The famfs user
space has documentation and a reasonably thorough test suite.

The famfs kernel module never accesses the shared memory directly (either
data or metadata). Because of this, shared memory managed by the famfs
framework does not create a RAS "blast radius" problem that should be able
to crash or de-stabilize the kernel. Poison or timeouts in famfs memory
can be expected to kill apps via SIGBUS and cause mounts to be disabled
due to memory failure notifications.

Famfs does not attempt to solve concurrency or coherency problems for apps,
although it does solve these problems in regard to its own data structures.
Apps may encounter hard concurrency problems, but there are use cases that
are imminently useful and uncomplicated from a concurrency perspective:
serial sharing is one (only one host at a time has access), and read-only
concurrent sharing is another (all hosts can read-cache without worry).

Contents:

* famfs kernel documentation [patch 1]. Note that evolving famfs user
  documentation is at [2]
* dev_dax_iomap patchset [patches 2-6] - This enables fs-dax to use the
  iomap interface via a character /dev/dax device (e.g. /dev/dax0.0). For
  historical reasons the iomap infrastructure was enabled only for
  /dev/pmem devices (which are dax block devices). As famfs is the first
  fs-dax file system that works on /dev/dax, this patch series fills in
  the bare minimum infrastructure to enable iomap api usage with /dev/dax.
* famfs patchset [patches 7-20] - this introduces the kernel component of
  famfs.

IMPORTANT NOTE: There is a developing consensus that /dev/dax requires
some fundamental re-factoring (e.g. [3]) that is related but outside the
scope of this series.

Some observations about using sharable memory

* It does not make sense to online sharable memory as system-ram.
  System-ram gets zeroed when it is onlined, so sharing is basically
  nonsense.
* It does not make sense to put struct page's in sharable memory, because
  those can't be shared. However, separately providing non-sharable
  capacity to be used for struct page's might be a sensible approach if the
  size of struct page array for sharable memory is too large to put in
  conventional system-ram (albeit with possible RAS implications).
* Sharable memory is pmem-like, in that a host is likely to connect in
  order to gain access to data that is already in the memory. Moreover
  the power domain for shared memory is separate for that of the server.
  Having observed that, famfs is not intended for persistent storage. It is
  intended for sharing data sets in memory during a time frame where the
  memory and the compute nodes are expected to remain operational - such
  as during a clustered data analytics job.

Could we do this with FUSE?

The key performance requirement for famfs is efficient handling of VMA
faults. This requires caching the complete dax extent lists for all active
files so faults can be handled without upcalls, which FUSE does not do.
It would probably be possible to put this capability FUSE, but we think
that keeping famfs separate from FUSE is the simpler approach.

This patch set is available as a branch at [5]

References

[1] https://lpc.events/event/17/contributions/1455/
[2] https://github.com/cxl-micron-reskit/famfs
[3] 
https://lore.kernel.org/all/166630293549.1017198.3833687373550679565.st...@dwillia2-xfh.jf.intel.com/
[4] https://www.computeexpresslink.org/download-the-specification
[5] https://github.com/cxl-micron-reskit/famfs-linux

John Groves (20):
  famfs: Documentation
  dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since
both need it now
  dev_dax_iomap: Save the kva from memremap
  dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter
  famfs: Add include/linux/famfs_ioctl.h
  famfs: Add famfs_internal.h
  famfs: Add super_operations
  famfs: famfs_open_device() & dax_holder_operations
  famfs: Add fs_context_operations
  famfs: Add inode_operations and file_system_type
  famfs: Add iomap_ops

[RFC PATCH 01/20] famfs: Documentation

2024-02-23 Thread John Groves
Introduce Documentation/filesystems/famfs.rst into the Documentation
tree

Signed-off-by: John Groves 
---
 Documentation/filesystems/famfs.rst | 124 
 1 file changed, 124 insertions(+)
 create mode 100644 Documentation/filesystems/famfs.rst

diff --git a/Documentation/filesystems/famfs.rst 
b/Documentation/filesystems/famfs.rst
new file mode 100644
index ..c2cc50c10d03
--- /dev/null
+++ b/Documentation/filesystems/famfs.rst
@@ -0,0 +1,124 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _famfs_index:
+
+==
+famfs: The kernel component of the famfs shared memory file system
+==
+
+- Copyright (C) 2024 Micron Technology, Inc.
+
+Introduction
+
+Compute Express Link (CXL) provides a mechanism for disaggregated or
+fabric-attached memory (FAM). This creates opportunities for data sharing;
+clustered apps that would otherwise have to shard or replicate data can
+share one copy in disaggregated memory.
+
+Famfs, which is not CXL-specific in any way, provides a mechanism for
+multiple hosts to use data in shared memory, by giving it a file system
+interface. With famfs, any app that understands files (which is all of
+them, right?) can access data sets in shared memory. Although famfs
+supports read and write calls, the real point is to support mmap, which
+provides direct (dax) access to the memory - either writable or read-only.
+
+Shared memory can pose complex coherency and synchronization issues, but
+there are also simple cases. Two simple and eminently useful patterns that
+occur frequently in data analytics and AI are:
+
+* Serial Sharing - Only one host or process at a time has access to a file
+* Read-only Sharing - Multiple hosts or processes share read-only access
+  to a file
+
+The famfs kernel file system is part of the famfs framework; User space
+components [1] handle metadata allocation and distribution, and direct the
+famfs kernel module to instantiate files that map to specific memory.
+
+The famfs framework manages coherency of its own metadata and structures,
+but does not attempt to manage coherency for applications.
+
+Famfs also provides data isolation between files. That is, even though
+the host has access to an entire memory "device" (as a dax device), apps
+cannot write to memory for which the file is read-only, and mapping one
+file provides isolation from the memory of all other files. This is pretty
+basic, but some experimental shared memory usage patterns provide no such
+isolation.
+
+Principles of Operation
+===
+
+Without its user space components, the famfs kernel module is just a
+semi-functional clone of ramfs with latent fs-dax support. The user space
+components maintain superblocks and metadata logs, and use the famfs kernel
+component to provide a file system view of shared memory across multiple
+hosts.
+
+Each host has an independent instance of the famfs kernel module. After
+mount, files are not visible until the user space component instantiates
+them (normally by playing the famfs metadata log).
+
+Once instantiated, files on each host can point to the same shared memory,
+but in-memory metadata (inodes, etc.) is ephemeral on each host that has a
+famfs instance mounted. Like ramfs, the famfs in-kernel file system has no
+backing store for metadata modifications. If metadata is ever persisted,
+that must be done by the user space components. However, mutations to file
+data are saved to the shared memory - subject to write permission and
+processor cache behavior.
+
+
+Famfs is Not a Conventional File System
+---
+
+Famfs files can be accessed by conventional means, but there are
+limitations. The kernel component of famfs is not involved in the
+allocation of backing memory for files at all; the famfs user space
+creates files and passes the allocation extent lists into the kernel via
+the per-file FAMFSIOC_MAP_CREATE ioctl. A file that lacks this metadata is
+treated as invalid by the famfs kernel module. As a practical matter files
+must be created via the famfs library or cli, but they can be consumed as
+if they were conventional files.
+
+Famfs differs in some important ways from conventional file systems:
+
+* Files must be pre-allocated by the famfs framework; Allocation is never
+  performed on write.
+* Any operation that changes a file's size is considered to put the file
+  in an invalid state, disabling access to the data. It may be possible to
+  revisit this in the future.
+* (Typically the famfs user space can restore files to a valid state by
+  replaying the famfs metadata log.)
+
+Famfs exists to apply the existing file system abstractions on top of
+shared memory so applications and workflows can more easily consume it.
+
+Key Requirements
+
+
+The primary requirements for famfs are:
+
+1. Must sup

[RFC PATCH 02/20] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage

2024-02-23 Thread John Groves
This function should be called by fs-dax file systems after opening the
devdax device. This adds holder_operations.

This function serves the same role as fs_dax_get_by_bdev(), which dax
file systems call after opening the pmem block device.

Signed-off-by: John Groves 
---
 drivers/dax/super.c | 38 ++
 include/linux/dax.h |  5 +
 2 files changed, 43 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index f4b635526345..fc96362de237 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -121,6 +121,44 @@ void fs_put_dax(struct dax_device *dax_dev, void *holder)
 EXPORT_SYMBOL_GPL(fs_put_dax);
 #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+
+/**
+ * fs_dax_get()
+ *
+ * fs-dax file systems call this function to prepare to use a devdax device 
for fsdax.
+ * This is like fs_dax_get_by_bdev(), but the caller already has struct 
dev_dax (and there
+ * is no bdev). The holder makes this exclusive.
+ *
+ * @dax_dev: dev to be prepared for fs-dax usage
+ * @holder: filesystem or mapped device inside the dax_device
+ * @hops: operations for the inner holder
+ *
+ * Returns: 0 on success, -1 on failure
+ */
+int fs_dax_get(
+   struct dax_device *dax_dev,
+   void *holder,
+   const struct dax_holder_operations *hops)
+{
+   /* dax_dev->ops should have been populated by devm_create_dev_dax() */
+   if (WARN_ON(!dax_dev->ops))
+   return -1;
+
+   if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode))
+   return -1;
+
+   if (cmpxchg(&dax_dev->holder_data, NULL, holder)) {
+   pr_warn("%s: holder_data already set\n", __func__);
+   return -1;
+   }
+   dax_dev->holder_ops = hops;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(fs_dax_get);
+#endif /* DEV_DAX_IOMAP */
+
 enum dax_device_flags {
/* !alive + rcu grace period == no new operations / mappings */
DAXDEV_ALIVE,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..e973289bfde3 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -57,7 +57,12 @@ struct dax_holder_operations {
 
 #if IS_ENABLED(CONFIG_DAX)
 struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
+
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct 
dax_holder_operations *hops);
+#endif
 void *dax_holder(struct dax_device *dax_dev);
+struct dax_device *inode_dax(struct inode *inode);
 void put_dax(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void dax_write_cache(struct dax_device *dax_dev, bool wc);
-- 
2.43.0




[RFC PATCH 03/20] dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since both need it now

2024-02-23 Thread John Groves
bus.c can't call functions in device.c - that creates a circular linkage
dependency.

Signed-off-by: John Groves 
---
 drivers/dax/bus.c| 24 
 drivers/dax/device.c | 23 ---
 2 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1ff1ab5fa105..664e8c1b9930 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1325,6 +1325,30 @@ static const struct device_type dev_dax_type = {
.groups = dax_attribute_groups,
 };
 
+/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c  */
+__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
+ unsigned long size)
+{
+   int i;
+
+   for (i = 0; i < dev_dax->nr_range; i++) {
+   struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+   struct range *range = &dax_range->range;
+   unsigned long long pgoff_end;
+   phys_addr_t phys;
+
+   pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
+   if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
+   continue;
+   phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
+   if (phys + size - 1 <= range->end)
+   return phys;
+   break;
+   }
+   return -1;
+}
+EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
+
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 {
struct dax_region *dax_region = data->dax_region;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 93ebedc5ec8c..40ba660013cf 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -50,29 +50,6 @@ static int check_vma(struct dev_dax *dev_dax, struct 
vm_area_struct *vma,
return 0;
 }
 
-/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
-__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
-   unsigned long size)
-{
-   int i;
-
-   for (i = 0; i < dev_dax->nr_range; i++) {
-   struct dev_dax_range *dax_range = &dev_dax->ranges[i];
-   struct range *range = &dax_range->range;
-   unsigned long long pgoff_end;
-   phys_addr_t phys;
-
-   pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
-   if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
-   continue;
-   phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
-   if (phys + size - 1 <= range->end)
-   return phys;
-   break;
-   }
-   return -1;
-}
-
 static void dax_set_mapping(struct vm_fault *vmf, pfn_t pfn,
  unsigned long fault_size)
 {
-- 
2.43.0




[RFC PATCH 04/20] dev_dax_iomap: Save the kva from memremap

2024-02-23 Thread John Groves
Save the kva from memremap because we need it for iomap rw support

Prior to famfs, there were no iomap users of /dev/dax - so the virtual
address from memremap was not needed.

Also: in some cases dev_dax_probe() is called with the first
dev_dax->range offset past pgmap[0].range. In those cases we need to
add the difference to virt_addr in order to have the physaddr's in
dev_dax->ranges match dev_dax->virt_addr.

Dragons...

Signed-off-by: John Groves 
---
 drivers/dax/dax-private.h |  1 +
 drivers/dax/device.c  | 15 +++
 2 files changed, 16 insertions(+)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 446617b73aea..894eb1c66b4a 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -63,6 +63,7 @@ struct dax_mapping {
 struct dev_dax {
struct dax_region *region;
struct dax_device *dax_dev;
+   u64 virt_addr;
unsigned int align;
int target_node;
bool dyn_id;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 40ba660013cf..6cd79d00fe1b 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -372,6 +372,7 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
struct dax_device *dax_dev = dev_dax->dax_dev;
struct device *dev = &dev_dax->dev;
struct dev_pagemap *pgmap;
+   u64 data_offset = 0;
struct inode *inode;
struct cdev *cdev;
void *addr;
@@ -426,6 +427,20 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
if (IS_ERR(addr))
return PTR_ERR(addr);
 
+   /* Detect whether the data is at a non-zero offset into the memory */
+   if (pgmap->range.start != dev_dax->ranges[0].range.start) {
+   u64 phys = (u64)dev_dax->ranges[0].range.start;
+   u64 pgmap_phys = (u64)dev_dax->pgmap[0].range.start;
+   u64 vmemmap_shift = (u64)dev_dax->pgmap[0].vmemmap_shift;
+
+   if (!WARN_ON(pgmap_phys > phys))
+   data_offset = phys - pgmap_phys;
+
+   pr_notice("%s: offset detected phys=%llx pgmap_phys=%llx 
offset=%llx shift=%llx\n",
+  __func__, phys, pgmap_phys, data_offset, vmemmap_shift);
+   }
+   dev_dax->virt_addr = (u64)addr + data_offset;
+
inode = dax_inode(dax_dev);
cdev = inode->i_cdev;
cdev_init(cdev, &dax_fops);
-- 
2.43.0




[RFC PATCH 05/20] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax

2024-02-23 Thread John Groves
Notes about this commit:

* These methods are based somewhat loosely on pmem_dax_ops from
  drivers/nvdimm/pmem.c

* dev_dax_direct_access() is returns the hpa, pfn and kva. The kva was
  newly stored as dev_dax->virt_addr by dev_dax_probe().

* The hpa/pfn are used for mmap (dax_iomap_fault()), and the kva is used
  for read/write (dax_iomap_rw())

* dev_dax_recovery_write() and dev_dax_zero_page_range() have not been
  tested yet. I'm looking for suggestions as to how to test those.

Signed-off-by: John Groves 
---
 drivers/dax/bus.c | 107 ++
 1 file changed, 107 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 664e8c1b9930..06fcda810674 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -10,6 +10,12 @@
 #include "dax-private.h"
 #include "bus.h"
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+#include 
+#include 
+#include 
+#endif
+
 static DEFINE_MUTEX(dax_bus_lock);
 
 #define DAX_NAME_LEN 30
@@ -1349,6 +1355,101 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax 
*dev_dax, pgoff_t pgoff,
 }
 EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+
+static void write_dax(void *pmem_addr, struct page *page,
+   unsigned int off, unsigned int len)
+{
+   unsigned int chunk;
+   void *mem;
+
+   while (len) {
+   mem = kmap_local_page(page);
+   chunk = min_t(unsigned int, len, PAGE_SIZE - off);
+   memcpy_flushcache(pmem_addr, mem + off, chunk);
+   kunmap_local(mem);
+   len -= chunk;
+   off = 0;
+   page++;
+   pmem_addr += chunk;
+   }
+}
+
+static long __dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+long nr_pages, enum dax_access_mode mode, void 
**kaddr,
+pfn_t *pfn)
+{
+   struct dev_dax *dev_dax = dax_get_private(dax_dev);
+   size_t dax_size = dev_dax_size(dev_dax);
+   size_t size = nr_pages << PAGE_SHIFT;
+   size_t offset = pgoff << PAGE_SHIFT;
+   phys_addr_t phys;
+   u64 virt_addr = dev_dax->virt_addr + offset;
+   pfn_t local_pfn;
+   u64 flags = PFN_DEV|PFN_MAP;
+
+   WARN_ON(!dev_dax->virt_addr); /* virt_addr must be saved for 
direct_access */
+
+   phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
+
+   if (kaddr)
+   *kaddr = (void *)virt_addr;
+
+   local_pfn = phys_to_pfn_t(phys, flags); /* are flags correct? */
+   if (pfn)
+   *pfn = local_pfn;
+
+   /* This the valid size at the specified address */
+   return PHYS_PFN(min_t(size_t, size, dax_size - offset));
+}
+
+static int dev_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
+   size_t nr_pages)
+{
+   long resid = nr_pages << PAGE_SHIFT;
+   long offset = pgoff << PAGE_SHIFT;
+
+   /* Break into one write per dax region */
+   while (resid > 0) {
+   void *kaddr;
+   pgoff_t poff = offset >> PAGE_SHIFT;
+   long len = __dev_dax_direct_access(dax_dev, poff,
+  nr_pages, DAX_ACCESS, 
&kaddr, NULL);
+   len = min_t(long, len, PAGE_SIZE);
+   write_dax(kaddr, ZERO_PAGE(0), offset, len);
+
+   offset += len;
+   resid  -= len;
+   }
+   return 0;
+}
+
+static long dev_dax_direct_access(struct dax_device *dax_dev,
+   pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+   void **kaddr, pfn_t *pfn)
+{
+   return __dev_dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, 
pfn);
+}
+
+static size_t dev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+   void *addr, size_t bytes, struct iov_iter *i)
+{
+   size_t len, off;
+
+   off = offset_in_page(addr);
+   len = PFN_PHYS(PFN_UP(off + bytes));
+
+   return _copy_from_iter_flushcache(addr, bytes, i);
+}
+
+static const struct dax_operations dev_dax_ops = {
+   .direct_access = dev_dax_direct_access,
+   .zero_page_range = dev_dax_zero_page_range,
+   .recovery_write = dev_dax_recovery_write,
+};
+
+#endif /* IS_ENABLED(CONFIG_DEV_DAX_IOMAP) */
+
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 {
struct dax_region *dax_region = data->dax_region;
@@ -1404,11 +1505,17 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data 
*data)
}
}
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+   /* holder_ops currently populated separately in a slightly hacky way */
+   dax_dev = alloc_dax(dev_dax, &dev_dax_ops);
+#else
/*
 * No dax_operations since there is no access to this device outside of
 * mmap of the resulting character device.
 */
dax_dev = alloc_dax(dev_dax, NULL);
+#endif
+
if (IS_ERR(dax_dev)) {
  

[RFC PATCH 06/20] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter

2024-02-23 Thread John Groves
Add the CONFIG_DEV_DAX_IOMAP kernel config parameter to control building
of the iomap functionality to support fsdax on devdax.

Signed-off-by: John Groves 
---
 drivers/dax/Kconfig | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index a88744244149..b1ebcc77120b 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -78,4 +78,10 @@ config DEV_DAX_KMEM
 
  Say N if unsure.
 
+config DEV_DAX_IOMAP
+   depends on DEV_DAX && DAX
+   def_bool y
+   help
+ Support iomap mapping of devdax devices (for FS-DAX file
+ systems that reside on character /dev/dax devices)
 endif
-- 
2.43.0




[RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h

2024-02-23 Thread John Groves
Add uapi include file for famfs. The famfs user space uses ioctl on
individual files to pass in mapping information and file size. This
would be hard to do via sysfs or other means, since it's
file-specific.

Signed-off-by: John Groves 
---
 include/uapi/linux/famfs_ioctl.h | 56 
 1 file changed, 56 insertions(+)
 create mode 100644 include/uapi/linux/famfs_ioctl.h

diff --git a/include/uapi/linux/famfs_ioctl.h b/include/uapi/linux/famfs_ioctl.h
new file mode 100644
index ..6b3e6452d02f
--- /dev/null
+++ b/include/uapi/linux/famfs_ioctl.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2024 Micron Technology, Inc.
+ *
+ * This file system, originally based on ramfs the dax support from xfs,
+ * is intended to allow multiple host systems to mount a common file system
+ * view of dax files that map to shared memory.
+ */
+#ifndef FAMFS_IOCTL_H
+#define FAMFS_IOCTL_H
+
+#include 
+#include 
+
+#define FAMFS_MAX_EXTENTS 2
+
+enum extent_type {
+   SIMPLE_DAX_EXTENT = 13,
+   INVALID_EXTENT_TYPE,
+};
+
+struct famfs_extent {
+   __u64  offset;
+   __u64  len;
+};
+
+enum famfs_file_type {
+   FAMFS_REG,
+   FAMFS_SUPERBLOCK,
+   FAMFS_LOG,
+};
+
+/**
+ * struct famfs_ioc_map
+ *
+ * This is the metadata that indicates where the memory is for a famfs file
+ */
+struct famfs_ioc_map {
+   enum extent_type  extent_type;
+   enum famfs_file_type  file_type;
+   __u64 file_size;
+   __u64 ext_list_count;
+   struct famfs_extent   ext_list[FAMFS_MAX_EXTENTS];
+};
+
+#define FAMFSIOC_MAGIC 'u'
+
+/* famfs file ioctl opcodes */
+#define FAMFSIOC_MAP_CREATE_IOW(FAMFSIOC_MAGIC, 1, struct famfs_ioc_map)
+#define FAMFSIOC_MAP_GET   _IOR(FAMFSIOC_MAGIC, 2, struct famfs_ioc_map)
+#define FAMFSIOC_MAP_GETEXT_IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent)
+#define FAMFSIOC_NOP   _IO(FAMFSIOC_MAGIC,  4)
+
+#endif /* FAMFS_IOCTL_H */
-- 
2.43.0




[RFC PATCH 08/20] famfs: Add famfs_internal.h

2024-02-23 Thread John Groves
Add the famfs_internal.h include file. This contains internal data
structures such as the per-file metadata structure (famfs_file_meta)
and extent formats.

Signed-off-by: John Groves 
---
 fs/famfs/famfs_internal.h | 53 +++
 1 file changed, 53 insertions(+)
 create mode 100644 fs/famfs/famfs_internal.h

diff --git a/fs/famfs/famfs_internal.h b/fs/famfs/famfs_internal.h
new file mode 100644
index ..af3990d43305
--- /dev/null
+++ b/fs/famfs/famfs_internal.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2024 Micron Technology, Inc.
+ *
+ * This file system, originally based on ramfs the dax support from xfs,
+ * is intended to allow multiple host systems to mount a common file system
+ * view of dax files that map to shared memory.
+ */
+#ifndef FAMFS_INTERNAL_H
+#define FAMFS_INTERNAL_H
+
+#include 
+#include 
+
+#define FAMFS_MAGIC 0x87b282ff
+
+#define FAMFS_BLKDEV_MODE (FMODE_READ|FMODE_WRITE)
+
+extern const struct file_operations  famfs_file_operations;
+
+/*
+ * Each famfs dax file has this hanging from its inode->i_private.
+ */
+struct famfs_file_meta {
+   int   error;
+   enum famfs_file_type  file_type;
+   size_tfile_size;
+   enum extent_type  tfs_extent_type;
+   size_ttfs_extent_ct;
+   struct famfs_extent   tfs_extents[];  /* flexible array */
+};
+
+struct famfs_mount_opts {
+   umode_t mode;
+};
+
+extern const struct iomap_ops famfs_iomap_ops;
+extern const struct vm_operations_struct  famfs_file_vm_ops;
+
+#define ROOTDEV_STRLEN 80
+
+struct famfs_fs_info {
+   struct famfs_mount_opts  mount_opts;
+   struct file *dax_filp;
+   struct dax_device   *dax_devp;
+   struct bdev_handle  *bdev_handle;
+   struct list_head fsi_list;
+   char*rootdev;
+};
+
+#endif /* FAMFS_INTERNAL_H */
-- 
2.43.0




[RFC PATCH 09/20] famfs: Add super_operations

2024-02-23 Thread John Groves
Introduce the famfs superblock operations

Signed-off-by: John Groves 
---
 fs/famfs/famfs_inode.c | 72 ++
 1 file changed, 72 insertions(+)
 create mode 100644 fs/famfs/famfs_inode.c

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
new file mode 100644
index ..3329aff000d1
--- /dev/null
+++ b/fs/famfs/famfs_inode.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2024 Micron Technology, inc
+ *
+ * This file system, originally based on ramfs the dax support from xfs,
+ * is intended to allow multiple host systems to mount a common file system
+ * view of dax files that map to shared memory.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "famfs_internal.h"
+
+#define FAMFS_DEFAULT_MODE 0755
+
+static const struct super_operations famfs_ops;
+static const struct inode_operations famfs_file_inode_operations;
+static const struct inode_operations famfs_dir_inode_operations;
+
+/**
+ * famfs super_operations
+ *
+ * TODO: implement a famfs_statfs() that shows size, free and available space, 
etc.
+ */
+
+/**
+ * famfs_show_options() - Display the mount options in /proc/mounts.
+ */
+static int famfs_show_options(
+   struct seq_file *m,
+   struct dentry   *root)
+{
+   struct famfs_fs_info *fsi = root->d_sb->s_fs_info;
+
+   if (fsi->mount_opts.mode != FAMFS_DEFAULT_MODE)
+   seq_printf(m, ",mode=%o", fsi->mount_opts.mode);
+
+   return 0;
+}
+
+static const struct super_operations famfs_ops = {
+   .statfs = simple_statfs,
+   .drop_inode = generic_delete_inode,
+   .show_options   = famfs_show_options,
+};
+
+
+MODULE_LICENSE("GPL");
-- 
2.43.0




[RFC PATCH 10/20] famfs: famfs_open_device() & dax_holder_operations

2024-02-23 Thread John Groves
Famfs works on both /dev/pmem and /dev/dax devices. This commit introduces
the function that opens a block (pmem) device and the struct
dax_holder_operations that are needed for that ABI.

In this commit, support for opening character /dev/dax is stubbed. A
later commit introduces this capability.

Signed-off-by: John Groves 
---
 fs/famfs/famfs_inode.c | 83 ++
 1 file changed, 83 insertions(+)

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
index 3329aff000d1..82c861998093 100644
--- a/fs/famfs/famfs_inode.c
+++ b/fs/famfs/famfs_inode.c
@@ -68,5 +68,88 @@ static const struct super_operations famfs_ops = {
.show_options   = famfs_show_options,
 };
 
+/***
+ * dax_holder_operations for block dax
+ */
+
+static int
+famfs_blk_dax_notify_failure(
+   struct dax_device   *dax_devp,
+   u64 offset,
+   u64 len,
+   int mf_flags)
+{
+
+   pr_err("%s: dax_devp %llx offset %llx len %lld mf_flags %x\n",
+  __func__, (u64)dax_devp, (u64)offset, (u64)len, mf_flags);
+   return -EOPNOTSUPP;
+}
+
+const struct dax_holder_operations famfs_blk_dax_holder_ops = {
+   .notify_failure = famfs_blk_dax_notify_failure,
+};
+
+static int
+famfs_open_char_device(
+   struct super_block *sb,
+   struct fs_context  *fc)
+{
+   pr_err("%s: Root device is %s, but your kernel does not support famfs 
on /dev/dax\n",
+  __func__, fc->source);
+   return -ENODEV;
+}
+
+/**
+ * famfs_open_device()
+ *
+ * Open the memory device. If it looks like /dev/dax, call 
famfs_open_char_device().
+ * Otherwise try to open it as a block/pmem device.
+ */
+static int
+famfs_open_device(
+   struct super_block *sb,
+   struct fs_context  *fc)
+{
+   struct famfs_fs_info *fsi = sb->s_fs_info;
+   struct dax_device*dax_devp;
+   u64 start_off = 0;
+   struct bdev_handle   *handlep;
+
+   if (fsi->dax_devp) {
+   pr_err("%s: already mounted\n", __func__);
+   return -EALREADY;
+   }
+
+   if (strstr(fc->source, "/dev/dax")) /* There is probably a better way 
to check this */
+   return famfs_open_char_device(sb, fc);
+
+   if (!strstr(fc->source, "/dev/pmem")) { /* There is probably a better 
way to check this */
+   pr_err("%s: primary backing dev (%s) is not pmem\n",
+  __func__, fc->source);
+   return -EINVAL;
+   }
+
+   handlep = bdev_open_by_path(fc->source, FAMFS_BLKDEV_MODE, fsi, 
&fs_holder_ops);
+   if (IS_ERR(handlep->bdev)) {
+   pr_err("%s: failed blkdev_get_by_path(%s)\n", __func__, 
fc->source);
+   return PTR_ERR(handlep->bdev);
+   }
+
+   dax_devp = fs_dax_get_by_bdev(handlep->bdev, &start_off,
+ fsi  /* holder */,
+ &famfs_blk_dax_holder_ops);
+   if (IS_ERR(dax_devp)) {
+   pr_err("%s: unable to get daxdev from handlep->bdev\n", 
__func__);
+   bdev_release(handlep);
+   return -ENODEV;
+   }
+   fsi->bdev_handle = handlep;
+   fsi->dax_devp= dax_devp;
+
+   pr_notice("%s: root device is block dax (%s)\n", __func__, fc->source);
+   return 0;
+}
+
+
 
 MODULE_LICENSE("GPL");
-- 
2.43.0




[RFC PATCH 11/20] famfs: Add fs_context_operations

2024-02-23 Thread John Groves
This commit introduces the famfs fs_context_operations and
famfs_get_inode() which is used by the context operations.

Signed-off-by: John Groves 
---
 fs/famfs/famfs_inode.c | 178 +
 1 file changed, 178 insertions(+)

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
index 82c861998093..f98f82962d7b 100644
--- a/fs/famfs/famfs_inode.c
+++ b/fs/famfs/famfs_inode.c
@@ -41,6 +41,50 @@ static const struct super_operations famfs_ops;
 static const struct inode_operations famfs_file_inode_operations;
 static const struct inode_operations famfs_dir_inode_operations;
 
+static struct inode *famfs_get_inode(
+   struct super_block *sb,
+   const struct inode *dir,
+   umode_t mode,
+   dev_t   dev)
+{
+   struct inode *inode = new_inode(sb);
+
+   if (inode) {
+   struct timespec64   tv;
+
+   inode->i_ino = get_next_ino();
+   inode_init_owner(&nop_mnt_idmap, inode, dir, mode);
+   inode->i_mapping->a_ops = &ram_aops;
+   mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+   mapping_set_unevictable(inode->i_mapping);
+   tv = inode_set_ctime_current(inode);
+   inode_set_mtime_to_ts(inode, tv);
+   inode_set_atime_to_ts(inode, tv);
+
+   switch (mode & S_IFMT) {
+   default:
+   init_special_inode(inode, mode, dev);
+   break;
+   case S_IFREG:
+   inode->i_op = &famfs_file_inode_operations;
+   inode->i_fop = &famfs_file_operations;
+   break;
+   case S_IFDIR:
+   inode->i_op = &famfs_dir_inode_operations;
+   inode->i_fop = &simple_dir_operations;
+
+   /* Directory inodes start off with i_nlink == 2 (for 
"." entry) */
+   inc_nlink(inode);
+   break;
+   case S_IFLNK:
+   inode->i_op = &page_symlink_inode_operations;
+   inode_nohighmem(inode);
+   break;
+   }
+   }
+   return inode;
+}
+
 
/**
  * famfs super_operations
  *
@@ -150,6 +194,140 @@ famfs_open_device(
return 0;
 }
 
+/*
+ * fs_context_operations
+ */
+static int
+famfs_fill_super(
+   struct super_block *sb,
+   struct fs_context  *fc)
+{
+   struct famfs_fs_info *fsi = sb->s_fs_info;
+   struct inode *inode;
+   int rc = 0;
+
+   sb->s_maxbytes  = MAX_LFS_FILESIZE;
+   sb->s_blocksize = PAGE_SIZE;
+   sb->s_blocksize_bits= PAGE_SHIFT;
+   sb->s_magic = FAMFS_MAGIC;
+   sb->s_op= &famfs_ops;
+   sb->s_time_gran = 1;
+
+   rc = famfs_open_device(sb, fc);
+   if (rc)
+   goto out;
+
+   inode = famfs_get_inode(sb, NULL, S_IFDIR | fsi->mount_opts.mode, 0);
+   sb->s_root = d_make_root(inode);
+   if (!sb->s_root)
+   rc = -ENOMEM;
+
+out:
+   return rc;
+}
+
+enum famfs_param {
+   Opt_mode,
+   Opt_dax,
+};
+
+const struct fs_parameter_spec famfs_fs_parameters[] = {
+   fsparam_u32oct("mode",Opt_mode),
+   fsparam_string("dax", Opt_dax),
+   {}
+};
+
+static int famfs_parse_param(
+   struct fs_context   *fc,
+   struct fs_parameter *param)
+{
+   struct famfs_fs_info *fsi = fc->s_fs_info;
+   struct fs_parse_result result;
+   int opt;
+
+   opt = fs_parse(fc, famfs_fs_parameters, param, &result);
+   if (opt == -ENOPARAM) {
+   opt = vfs_parse_fs_param_source(fc, param);
+   if (opt != -ENOPARAM)
+   return opt;
+
+   return 0;
+   }
+   if (opt < 0)
+   return opt;
+
+   switch (opt) {
+   case Opt_mode:
+   fsi->mount_opts.mode = result.uint_32 & S_IALLUGO;
+   break;
+   case Opt_dax:
+   if (strcmp(param->string, "always"))
+   pr_notice("%s: invalid dax mode %s\n",
+ __func__, param->string);
+   break;
+   }
+
+   return 0;
+}
+
+static DEFINE_MUTEX(famfs_context_mutex);
+static LIST_HEAD(famfs_context_list);
+
+static int famfs_get_tree(struct fs_context *fc)
+{
+   struct famfs_fs_info *fsi_entry;
+   struct famfs_fs_info *fsi = fc->s_fs_info;
+
+   fsi->rootdev = kstrdup(fc->source, GFP_KERNEL);
+   if (!fsi->rootdev)
+   return -ENOMEM;
+
+   /* Fail if famfs is already mounted from the same device */
+   mutex_lock(&famfs_context_mutex);
+   list_for_each_entry(fsi_entry, &famfs_conte

[RFC PATCH 12/20] famfs: Add inode_operations and file_system_type

2024-02-23 Thread John Groves
This commit introduces the famfs inode_operations. There is nothing really
unique to famfs here in the inode_operations..

This commit also introduces the famfs_file_system_type struct and the
famfs_kill_sb() function.

Signed-off-by: John Groves 
---
 fs/famfs/famfs_inode.c | 132 +
 1 file changed, 132 insertions(+)

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
index f98f82962d7b..ab46ec50b70d 100644
--- a/fs/famfs/famfs_inode.c
+++ b/fs/famfs/famfs_inode.c
@@ -85,6 +85,109 @@ static struct inode *famfs_get_inode(
return inode;
 }
 
+/***
+ * famfs inode_operations: these are currently pretty much boilerplate
+ */
+
+static const struct inode_operations famfs_file_inode_operations = {
+   /* All generic */
+   .setattr   = simple_setattr,
+   .getattr   = simple_getattr,
+};
+
+
+/*
+ * File creation. Allocate an inode, and we're done..
+ */
+/* SMP-safe */
+static int
+famfs_mknod(
+   struct mnt_idmap *idmap,
+   struct inode *dir,
+   struct dentry*dentry,
+   umode_t   mode,
+   dev_t dev)
+{
+   struct inode *inode = famfs_get_inode(dir->i_sb, dir, mode, dev);
+   int error   = -ENOSPC;
+
+   if (inode) {
+   struct timespec64   tv;
+
+   d_instantiate(dentry, inode);
+   dget(dentry);   /* Extra count - pin the dentry in core */
+   error = 0;
+   tv = inode_set_ctime_current(inode);
+   inode_set_mtime_to_ts(inode, tv);
+   inode_set_atime_to_ts(inode, tv);
+   }
+   return error;
+}
+
+static int famfs_mkdir(
+   struct mnt_idmap *idmap,
+   struct inode *dir,
+   struct dentry*dentry,
+   umode_t   mode)
+{
+   int retval = famfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFDIR, 
0);
+
+   if (!retval)
+   inc_nlink(dir);
+
+   return retval;
+}
+
+static int famfs_create(
+   struct mnt_idmap *idmap,
+   struct inode *dir,
+   struct dentry*dentry,
+   umode_t   mode,
+   bool  excl)
+{
+   return famfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFREG, 0);
+}
+
+static int famfs_symlink(
+   struct mnt_idmap *idmap,
+   struct inode *dir,
+   struct dentry*dentry,
+   const char   *symname)
+{
+   struct inode *inode;
+   int error = -ENOSPC;
+
+   inode = famfs_get_inode(dir->i_sb, dir, S_IFLNK | 0777, 0);
+   if (inode) {
+   int l = strlen(symname)+1;
+
+   error = page_symlink(inode, symname, l);
+   if (!error) {
+   struct timespec64   tv;
+
+   d_instantiate(dentry, inode);
+   dget(dentry);
+   tv = inode_set_ctime_current(inode);
+   inode_set_mtime_to_ts(inode, tv);
+   inode_set_atime_to_ts(inode, tv);
+   } else
+   iput(inode);
+   }
+   return error;
+}
+
+static const struct inode_operations famfs_dir_inode_operations = {
+   .create = famfs_create,
+   .lookup = simple_lookup,
+   .link   = simple_link,
+   .unlink = simple_unlink,
+   .symlink= famfs_symlink,
+   .mkdir  = famfs_mkdir,
+   .rmdir  = simple_rmdir,
+   .mknod  = famfs_mknod,
+   .rename = simple_rename,
+};
+
 
/**
  * famfs super_operations
  *
@@ -329,5 +432,34 @@ static int famfs_init_fs_context(struct fs_context *fc)
return 0;
 }
 
+static void famfs_kill_sb(struct super_block *sb)
+{
+   struct famfs_fs_info *fsi = sb->s_fs_info;
+
+   mutex_lock(&famfs_context_mutex);
+   list_del(&fsi->fsi_list);
+   mutex_unlock(&famfs_context_mutex);
+
+   if (fsi->bdev_handle)
+   bdev_release(fsi->bdev_handle);
+   if (fsi->dax_devp)
+   fs_put_dax(fsi->dax_devp, fsi);
+   if (fsi->dax_filp) /* This only happens if it's char dax */
+   filp_close(fsi->dax_filp, NULL);
+
+   if (fsi && fsi->rootdev)
+   kfree(fsi->rootdev);
+   kfree(fsi);
+   kill_litter_super(sb);
+}
+
+#define MODULE_NAME "famfs"
+static struct file_system_type famfs_fs_type = {
+   .name = MODULE_NAME,
+   .init_fs_context  = famfs_init_fs_context,
+   .parameters   = famfs_fs_parameters,
+   .kill_sb  = famfs_kill_sb,
+   .fs_flags = FS_USERNS_MOUNT,
+};
 
 MODULE_LICENSE("GPL");
-- 
2.43.0




[RFC PATCH 13/20] famfs: Add iomap_ops

2024-02-23 Thread John Groves
This commit introduces the famfs iomap_ops. When either
dax_iomap_fault() or dax_iomap_rw() is called, we get a callback
via our iomap_begin() handler. The question being asked is
"please resolve (file, offset) to (daxdev, offset)". The function
famfs_meta_to_dax_offset() does this.

The per-file metadata is just an extent list to the
backing dax dev.  The order of this resolution is O(N) for N
extents. Note with the current user space, files usually have
only one extent.

Signed-off-by: John Groves 
---
 fs/famfs/famfs_file.c | 245 ++
 1 file changed, 245 insertions(+)
 create mode 100644 fs/famfs/famfs_file.c

diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
new file mode 100644
index ..fc667d5f7be8
--- /dev/null
+++ b/fs/famfs/famfs_file.c
@@ -0,0 +1,245 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2024 Micron Technology, Inc.
+ *
+ * This file system, originally based on ramfs the dax support from xfs,
+ * is intended to allow multiple host systems to mount a common file system
+ * view of dax files that map to shared memory.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "famfs_internal.h"
+
+/*
+ * iomap_operations
+ *
+ * This stuff uses the iomap (dax-related) helpers to resolve file offsets to
+ * offsets within a dax device.
+ */
+
+/**
+ * famfs_meta_to_dax_offset()
+ *
+ * This function is called by famfs_iomap_begin() to resolve an offset in a 
file to
+ * an offset in a dax device. This is upcalled from dax from calls to both
+ * dax_iomap_fault() and dax_iomap_rw(). Dax finishes the job resolving a 
fault to
+ * a specific physical page (the fault case) or doing a memcpy variant (the rw 
case)
+ *
+ * Pages can be PTE (4k), PMD (2MiB) or (theoretically) PuD (1GiB)
+ * (these sizes are for X86; may vary on other cpu architectures
+ *
+ * @inode  - the file where the fault occurred
+ * @iomap  - struct iomap to be filled in to indicate where to find the right 
memory, relative
+ *   to a dax device.
+ * @offset - the offset within the file where the fault occurred (will be page 
boundary)
+ * @len- the length of the faulted mapping (will be a page multiple)
+ *   (will be trimmed in *iomap if it's disjoint in the extent list)
+ * @flags
+ */
+static int
+famfs_meta_to_dax_offset(
+   struct inode *inode,
+   struct iomap *iomap,
+   loff_toffset,
+   loff_tlen,
+   unsigned int  flags)
+{
+   struct famfs_file_meta *meta = (struct famfs_file_meta 
*)inode->i_private;
+   int i;
+   loff_t local_offset = offset;
+   struct famfs_fs_info  *fsi = inode->i_sb->s_fs_info;
+
+   iomap->offset = offset; /* file offset */
+
+   for (i = 0; i < meta->tfs_extent_ct; i++) {
+   loff_t dax_ext_offset = meta->tfs_extents[i].offset;
+   loff_t dax_ext_len= meta->tfs_extents[i].len;
+
+   if ((dax_ext_offset == 0) && (meta->file_type != 
FAMFS_SUPERBLOCK))
+   pr_err("%s: zero offset on non-superblock file!!\n", 
__func__);
+
+   /* local_offset is the offset minus the size of extents skipped 
so far;
+* If local_offset < dax_ext_len, the data of interest starts 
in this extent
+*/
+   if (local_offset < dax_ext_len) {
+   loff_t ext_len_remainder = dax_ext_len - local_offset;
+
+   /*+
+* OK, we found the file metadata extent where this 
data begins
+* @local_offset  - The offset within the current 
extent
+* @ext_len_remainder - Remaining length of ext after 
skipping local_offset
+*
+* iomap->addr is the offset within the dax device 
where that data
+* starts
+*/
+   iomap->addr= dax_ext_offset + local_offset; /* dax 
dev offset */
+   iomap->offset  = offset; /* file offset */
+   iomap->length  = min_t(loff_t, len, ext_len_remainder);
+   iomap->dax_dev = fsi->dax_devp;
+   iomap->type= IOMAP_MAPPED;
+   iomap->flags   = flags;
+
+   return 0;
+   }
+   local_offset -= dax_ext_len; /* Get ready for the next extent */
+   }
+
+   /* Set iomap to zero length in this case, and return 0
+* This just means that the r/w is past EOF
+*/
+   iomap->addr= offset;
+   iomap->offset  = offset; /* file offset */
+   iomap->length  = 0; /* this had better result in no access to dax mem */
+   iomap->dax_dev = fsi->dax_devp;
+   iomap

[RFC PATCH 14/20] famfs: Add struct file_operations

2024-02-23 Thread John Groves
This commit introduces the famfs file_operations. We call
thp_get_unmapped_area() to force PMD page alignment. Our read and
write handlers (famfs_dax_read_iter() and famfs_dax_write_iter())
call dax_iomap_rw() to do the work.

famfs_file_invalid() checks for various ways a famfs file can be
in an invalid state so we can fail I/O or fault resolution in those
cases. Those cases include the following:

* No famfs metadata
* file i_size does not match the originally allocated size
* file is not flagged as DAX
* errors were detected previously on the file

An invalid file can often be fixed by replaying the log, or by
umount/mount/log replay - all of which are user space operations.

Signed-off-by: John Groves 
---
 fs/famfs/famfs_file.c | 136 ++
 1 file changed, 136 insertions(+)

diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
index fc667d5f7be8..5228e9de1e3b 100644
--- a/fs/famfs/famfs_file.c
+++ b/fs/famfs/famfs_file.c
@@ -19,6 +19,142 @@
 #include 
 #include "famfs_internal.h"
 
+/*
+ * file_operations
+ */
+
+/* Reject I/O to files that aren't in a valid state */
+static ssize_t
+famfs_file_invalid(struct inode *inode)
+{
+   size_t i_size   = i_size_read(inode);
+   struct famfs_file_meta *meta = inode->i_private;
+
+   if (!meta) {
+   pr_err("%s: un-initialized famfs file\n", __func__);
+   return -EIO;
+   }
+   if (i_size != meta->file_size) {
+   pr_err("%s: something changed the size from  %ld to %ld\n",
+  __func__, meta->file_size, i_size);
+   meta->error = 1;
+   return -ENXIO;
+   }
+   if (!IS_DAX(inode)) {
+   pr_err("%s: inode %llx IS_DAX is false\n", __func__, 
(u64)inode);
+   meta->error = 1;
+   return -ENXIO;
+   }
+   if (meta->error) {
+   pr_err("%s: previously detected metadata errors\n", __func__);
+   meta->error = 1;
+   return -EIO;
+   }
+   return 0;
+}
+
+static ssize_t
+famfs_dax_read_iter(
+   struct kiocb*iocb,
+   struct iov_iter *to)
+{
+   struct inode *inode = iocb->ki_filp->f_mapping->host;
+   size_t i_size   = i_size_read(inode);
+   size_t count= iov_iter_count(to);
+   size_t max_count;
+   ssize_t rc;
+
+   rc = famfs_file_invalid(inode);
+   if (rc)
+   return rc;
+
+   max_count = max_t(size_t, 0, i_size - iocb->ki_pos);
+
+   if (count > max_count)
+   iov_iter_truncate(to, max_count);
+
+   if (!iov_iter_count(to))
+   return 0;
+
+   rc = dax_iomap_rw(iocb, to, &famfs_iomap_ops);
+
+   file_accessed(iocb->ki_filp);
+   return rc;
+}
+
+/**
+ * famfs_write_iter()
+ *
+ * We need our own write-iter in order to prevent append
+ */
+static ssize_t
+famfs_dax_write_iter(
+   struct kiocb*iocb,
+   struct iov_iter *from)
+{
+   struct inode *inode = iocb->ki_filp->f_mapping->host;
+   size_t i_size   = i_size_read(inode);
+   size_t count= iov_iter_count(from);
+   size_t max_count;
+   ssize_t rc;
+
+   rc = famfs_file_invalid(inode);
+   if (rc)
+   return rc;
+
+   /* Starting offset of write is: iocb->ki_pos
+* length is iov_iter_count(from)
+*/
+   max_count = max_t(size_t, 0, i_size - iocb->ki_pos);
+
+   /* If write would go past EOF, truncate it to end at EOF since famfs 
does not
+* alloc-on-write
+*/
+   if (count > max_count)
+   iov_iter_truncate(from, max_count);
+
+   if (!iov_iter_count(from))
+   return 0;
+
+   return dax_iomap_rw(iocb, from, &famfs_iomap_ops);
+}
+
+static int
+famfs_file_mmap(
+   struct file *file,
+   struct vm_area_struct   *vma)
+{
+   struct inode*inode = file_inode(file);
+   ssize_t rc;
+
+   rc = famfs_file_invalid(inode);
+   if (rc)
+   return (int)rc;
+
+   file_accessed(file);
+   vma->vm_ops = &famfs_file_vm_ops;
+   vm_flags_set(vma, VM_HUGEPAGE);
+   return 0;
+}
+
+const struct file_operations famfs_file_operations = {
+   .owner = THIS_MODULE,
+
+   /* Custom famfs operations */
+   .write_iter= famfs_dax_write_iter,
+   .read_iter = famfs_dax_read_iter,
+   .mmap  = famfs_file_mmap,
+
+   /* Force PMD alignment for mmap */
+   .get_unmapped_area = thp_get_unmapped_area,
+
+   /* Generic Operations */
+   .fsync = noop_fsync,
+   .splice_read   = filemap_splice_read,
+   .splice_write  = iter_file_splice_write,
+   .llseek= generic_file_llseek,
+};
+
 /*
  * iomap_operatio

[RFC PATCH 15/20] famfs: Add ioctl to file_operations

2024-02-23 Thread John Groves
This commit introduces the per-file ioctl function famfs_file_ioctl()
into struct file_operations, and introduces the famfs_file_init_dax()
function (which is called by famfs_file_ioct())

famfs_file_init_dax() associates a dax extent list with a file, making
it into a proper famfs file. It is called from the FAMFSIOC_MAP_CREATE
ioctl. Starting with an empty file (which is basically a ramfs file),
this turns the file into a DAX file backed by the specified extent list.

The other ioctls are:

FAMFSIOC_NOP - A convenient way for user space to verify it's a famfs file
FAMFSIOC_MAP_GET - Get the header of the metadata for a file
FAMFSIOC_MAP_GETEXT - Get the extents for a file

The latter two, together, are comparable to xfs_bmap. Our user space tools
use them primarly in testing.

Signed-off-by: John Groves 
---
 fs/famfs/famfs_file.c | 226 ++
 1 file changed, 226 insertions(+)

diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
index 5228e9de1e3b..fd42d5966982 100644
--- a/fs/famfs/famfs_file.c
+++ b/fs/famfs/famfs_file.c
@@ -19,6 +19,231 @@
 #include 
 #include "famfs_internal.h"
 
+/**
+ * famfs_map_meta_alloc() - Allocate famfs file metadata
+ * @mapp:   Pointer to an mcache_map_meta pointer
+ * @ext_count:  The number of extents needed
+ */
+static int
+famfs_meta_alloc(
+   struct famfs_file_meta  **metap,
+   size_text_count)
+{
+   struct famfs_file_meta *meta;
+   size_t  metasz;
+
+   *metap = NULL;
+
+   metasz = sizeof(*meta) + sizeof(*(meta->tfs_extents)) * ext_count;
+
+   meta = kzalloc(metasz, GFP_KERNEL);
+   if (!meta)
+   return -ENOMEM;
+
+   meta->tfs_extent_ct = ext_count;
+   *metap = meta;
+
+   return 0;
+}
+
+static void
+famfs_meta_free(
+   struct famfs_file_meta *map)
+{
+   kfree(map);
+}
+
+/**
+ * famfs_file_init_dax() - FAMFSIOC_MAP_CREATE ioctl handler
+ * @file:
+ * @arg:ptr to struct mcioc_map in user space
+ *
+ * Setup the dax mapping for a file. Files are created empty, and then 
function is called
+ * (by famfs_file_ioctl()) to setup the mapping and set the file size.
+ */
+static int
+famfs_file_init_dax(
+   struct file*file,
+   void __user*arg)
+{
+   struct famfs_extent*tfs_extents = NULL;
+   struct famfs_file_meta *meta = NULL;
+   struct inode   *inode;
+   struct famfs_ioc_mapimap;
+   struct famfs_fs_info   *fsi;
+   struct super_block *sb;
+   intalignment_errs = 0;
+   size_t extent_total = 0;
+   size_t ext_count;
+   intrc = 0;
+   inti;
+
+   rc = copy_from_user(&imap, arg, sizeof(imap));
+   if (rc)
+   return -EFAULT;
+
+   ext_count = imap.ext_list_count;
+   if (ext_count < 1) {
+   rc = -ENOSPC;
+   goto errout;
+   }
+
+   if (ext_count > FAMFS_MAX_EXTENTS) {
+   rc = -E2BIG;
+   goto errout;
+   }
+
+   inode = file_inode(file);
+   if (!inode) {
+   rc = -EBADF;
+   goto errout;
+   }
+   sb  = inode->i_sb;
+   fsi = inode->i_sb->s_fs_info;
+
+   tfs_extents = &imap.ext_list[0];
+
+   rc = famfs_meta_alloc(&meta, ext_count);
+   if (rc)
+   goto errout;
+
+   meta->file_type = imap.file_type;
+   meta->file_size = imap.file_size;
+
+   /* Fill in the internal file metadata structure */
+   for (i = 0; i < imap.ext_list_count; i++) {
+   size_t len;
+   off_t  offset;
+
+   offset = imap.ext_list[i].offset;
+   len= imap.ext_list[i].len;
+
+   extent_total += len;
+
+   if (WARN_ON(offset == 0 && meta->file_type != 
FAMFS_SUPERBLOCK)) {
+   rc = -EINVAL;
+   goto errout;
+   }
+
+   meta->tfs_extents[i].offset = offset;
+   meta->tfs_extents[i].len= len;
+
+   /* All extent addresses/offsets must be 2MiB aligned,
+* and all but the last length must be a 2MiB multiple.
+*/
+   if (!IS_ALIGNED(offset, PMD_SIZE)) {
+   pr_err("%s: error ext %d hpa %lx not aligned\n",
+  __func__, i, offset);
+   alignment_errs++;
+   }
+   if (i < (imap.ext_list_count - 1) && !IS_ALIGNED(len, 
PMD_SIZE)) {
+   pr_err("%s: error ext %d length %ld not aligned\n",
+  __func__, i, len);
+   alignment_errs++;
+   }
+   }
+
+   /*
+* File size can be <= ext list size, since extent sizes are constrained
+* to PMD multiples
+*/
+   if (imap.file_size > extent_total) {
+   pr_err("%s: file size %lld larger than ext list size %lld\n",
+ 

[RFC PATCH 16/20] famfs: Add fault counters

2024-02-23 Thread John Groves
One of the key requirements for famfs is that it service vma faults
efficiently. Our metadata helps - the search order is n for n extents,
and n is usually 1. But we can still observe gnarly lock contention
in mm if PTE faults are happening. This commit introduces fault counters
that can be enabled and read via /sys/fs/famfs/...

These counters have proved useful in troubleshooting situations where
PTE faults were happening instead of PMD. No performance impact when
disabled.

Signed-off-by: John Groves 
---
 fs/famfs/famfs_file.c | 97 +++
 fs/famfs/famfs_internal.h | 73 +
 2 files changed, 170 insertions(+)

diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
index fd42d5966982..a626f8a89790 100644
--- a/fs/famfs/famfs_file.c
+++ b/fs/famfs/famfs_file.c
@@ -19,6 +19,100 @@
 #include 
 #include "famfs_internal.h"
 
+/***
+ * filemap_fault counters
+ *
+ * The counters and the fault_count_enable file live at
+ * /sys/fs/famfs/
+ */
+struct famfs_fault_counters ffc;
+static int fault_count_enable;
+
+static ssize_t
+fault_count_enable_show(struct kobject *kobj,
+   struct kobj_attribute *attr,
+   char *buf)
+{
+   return sprintf(buf, "%d\n", fault_count_enable);
+}
+
+static ssize_t
+fault_count_enable_store(struct kobject*kobj,
+struct kobj_attribute *attr,
+const char*buf,
+size_t count)
+{
+   int value;
+   int rc;
+
+   rc = sscanf(buf, "%d", &value);
+   if (rc != 1)
+   return 0;
+
+   if (value > 0) /* clear fault counters when enabling, but not when 
disabling */
+   famfs_clear_fault_counters(&ffc);
+
+   fault_count_enable = value;
+   return count;
+}
+
+/* Individual fault counters are read-only */
+static ssize_t
+fault_count_pte_show(struct kobject *kobj,
+struct kobj_attribute *attr,
+char *buf)
+{
+   return sprintf(buf, "%llu", famfs_pte_fault_ct(&ffc));
+}
+
+static ssize_t
+fault_count_pmd_show(struct kobject *kobj,
+struct kobj_attribute *attr,
+char *buf)
+{
+   return sprintf(buf, "%llu", famfs_pmd_fault_ct(&ffc));
+}
+
+static ssize_t
+fault_count_pud_show(struct kobject *kobj,
+struct kobj_attribute *attr,
+char *buf)
+{
+   return sprintf(buf, "%llu", famfs_pud_fault_ct(&ffc));
+}
+
+static struct kobj_attribute fault_count_enable_attribute = 
__ATTR(fault_count_enable,
+  0660,
+  
fault_count_enable_show,
+  
fault_count_enable_store);
+static struct kobj_attribute fault_count_pte_attribute = __ATTR(pte_fault_ct,
+   0440,
+   
fault_count_pte_show,
+   NULL);
+static struct kobj_attribute fault_count_pmd_attribute = __ATTR(pmd_fault_ct,
+   0440,
+   
fault_count_pmd_show,
+   NULL);
+static struct kobj_attribute fault_count_pud_attribute = __ATTR(pud_fault_ct,
+   0440,
+   
fault_count_pud_show,
+   NULL);
+
+
+static struct attribute *attrs[] = {
+   &fault_count_enable_attribute.attr,
+   &fault_count_pte_attribute.attr,
+   &fault_count_pmd_attribute.attr,
+   &fault_count_pud_attribute.attr,
+   NULL,
+};
+
+struct attribute_group famfs_attr_group = {
+   .attrs = attrs,
+};
+
+/* End fault counters */
+
 /**
  * famfs_map_meta_alloc() - Allocate famfs file metadata
  * @mapp:   Pointer to an mcache_map_meta pointer
@@ -525,6 +619,9 @@ __famfs_filemap_fault(
if (IS_DAX(inode)) {
pfn_t pfn;
 
+   if (fault_count_enable)
+   famfs_inc_fault_counter_by_order(&ffc, pe_size);
+
ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, 
&famfs_iomap_ops);
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, pe_size, pfn);
diff --git a/fs/famfs/famfs_internal.h b/fs/famfs/famfs_internal.h
index af3990d43305..987cb172a149 100644
--- a/fs/famfs/famfs_internal.h
+++ b/fs/famfs/famfs_internal.h
@@ -50,4 +50,77 @@ struct famfs_fs_info {
   

[RFC PATCH 17/20] famfs: Add module stuff

2024-02-23 Thread John Groves
This commit introduces the module init and exit machinery for famfs.

Signed-off-by: John Groves 
---
 fs/famfs/famfs_inode.c | 44 ++
 1 file changed, 44 insertions(+)

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
index ab46ec50b70d..0d659820e8ff 100644
--- a/fs/famfs/famfs_inode.c
+++ b/fs/famfs/famfs_inode.c
@@ -462,4 +462,48 @@ static struct file_system_type famfs_fs_type = {
.fs_flags = FS_USERNS_MOUNT,
 };
 
+/*
+ * Module stuff
+ */
+static struct kobject *famfs_kobj;
+
+static int __init init_famfs_fs(void)
+{
+   int rc;
+
+#if defined(CONFIG_DEV_DAX_IOMAP)
+   pr_notice("%s: Your kernel supports famfs on /dev/dax\n", __func__);
+#else
+   pr_notice("%s: Your kernel does not support famfs on /dev/dax\n", 
__func__);
+#endif
+   famfs_kobj = kobject_create_and_add(MODULE_NAME, fs_kobj);
+   if (!famfs_kobj) {
+   pr_warn("Failed to create kobject\n");
+   return -ENOMEM;
+   }
+
+   rc = sysfs_create_group(famfs_kobj, &famfs_attr_group);
+   if (rc) {
+   kobject_put(famfs_kobj);
+   pr_warn("%s: Failed to create sysfs group\n", __func__);
+   return rc;
+   }
+
+   return register_filesystem(&famfs_fs_type);
+}
+
+static void
+__exit famfs_exit(void)
+{
+   sysfs_remove_group(famfs_kobj,  &famfs_attr_group);
+   kobject_put(famfs_kobj);
+   unregister_filesystem(&famfs_fs_type);
+   pr_info("%s: unregistered\n", __func__);
+}
+
+
+fs_initcall(init_famfs_fs);
+module_exit(famfs_exit);
+
+MODULE_AUTHOR("John Groves, Micron Technology");
 MODULE_LICENSE("GPL");
-- 
2.43.0




[RFC PATCH 18/20] famfs: Support character dax via the dev_dax_iomap patch

2024-02-23 Thread John Groves
This commit introduces the ability to open a character /dev/dax device
instead of a block /dev/pmem device. This rests on the dev_dax_iomap
patches earlier in this series.

Signed-off-by: John Groves 
---
 fs/famfs/famfs_inode.c | 97 +-
 1 file changed, 87 insertions(+), 10 deletions(-)

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
index 0d659820e8ff..7d65ac497147 100644
--- a/fs/famfs/famfs_inode.c
+++ b/fs/famfs/famfs_inode.c
@@ -215,6 +215,93 @@ static const struct super_operations famfs_ops = {
.show_options   = famfs_show_options,
 };
 
+/*/
+
+#if defined(CONFIG_DEV_DAX_IOMAP)
+
+/*
+ * famfs dax_operations  (for char dax)
+ */
+static int
+famfs_dax_notify_failure(struct dax_device *dax_dev, u64 offset,
+   u64 len, int mf_flags)
+{
+   pr_err("%s: offset %lld len %llu flags %x\n", __func__,
+  offset, len, mf_flags);
+   return -EOPNOTSUPP;
+}
+
+static const struct dax_holder_operations famfs_dax_holder_ops = {
+   .notify_failure = famfs_dax_notify_failure,
+};
+
+/*/
+
+/**
+ * famfs_open_char_device()
+ *
+ * Open a /dev/dax device. This only works in kernels with the dev_dax_iomap 
patch
+ */
+static int
+famfs_open_char_device(
+   struct super_block *sb,
+   struct fs_context  *fc)
+{
+   struct famfs_fs_info *fsi = sb->s_fs_info;
+   struct dax_device*dax_devp;
+   struct inode *daxdev_inode;
+
+   int rc = 0;
+
+   pr_notice("%s: Opening character dax device %s\n", __func__, 
fc->source);
+
+   fsi->dax_filp = filp_open(fc->source, O_RDWR, 0);
+   if (IS_ERR(fsi->dax_filp)) {
+   pr_err("%s: failed to open dax device %s\n",
+  __func__, fc->source);
+   fsi->dax_filp = NULL;
+   return PTR_ERR(fsi->dax_filp);
+   }
+
+   daxdev_inode = file_inode(fsi->dax_filp);
+   dax_devp = inode_dax(daxdev_inode);
+   if (IS_ERR(dax_devp)) {
+   pr_err("%s: unable to get daxdev from inode for %s\n",
+  __func__, fc->source);
+   rc = -ENODEV;
+   goto char_err;
+   }
+
+   rc = fs_dax_get(dax_devp, fsi, &famfs_dax_holder_ops);
+   if (rc) {
+   pr_info("%s: err attaching famfs_dax_holder_ops\n", __func__);
+   goto char_err;
+   }
+
+   fsi->bdev_handle = NULL;
+   fsi->dax_devp = dax_devp;
+
+   return 0;
+
+char_err:
+   filp_close(fsi->dax_filp, NULL);
+   return rc;
+}
+
+#else /* CONFIG_DEV_DAX_IOMAP */
+static int
+famfs_open_char_device(
+   struct super_block *sb,
+   struct fs_context  *fc)
+{
+   pr_err("%s: Root device is %s, but your kernel does not support famfs 
on /dev/dax\n",
+  __func__, fc->source);
+   return -ENODEV;
+}
+
+
+#endif /* CONFIG_DEV_DAX_IOMAP */
+
 
/***
  * dax_holder_operations for block dax
  */
@@ -236,16 +323,6 @@ const struct dax_holder_operations 
famfs_blk_dax_holder_ops = {
.notify_failure = famfs_blk_dax_notify_failure,
 };
 
-static int
-famfs_open_char_device(
-   struct super_block *sb,
-   struct fs_context  *fc)
-{
-   pr_err("%s: Root device is %s, but your kernel does not support famfs 
on /dev/dax\n",
-  __func__, fc->source);
-   return -ENODEV;
-}
-
 /**
  * famfs_open_device()
  *
-- 
2.43.0




[RFC PATCH 19/20] famfs: Update MAINTAINERS file

2024-02-23 Thread John Groves
This patch introduces famfs into the MAINTAINERS file

Signed-off-by: John Groves 
---
 MAINTAINERS | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 73d898383e51..e4e8bf3602bb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8097,6 +8097,17 @@ F:   Documentation/networking/failover.rst
 F: include/net/failover.h
 F: net/core/failover.c
 
+FAMFS
+M: John Groves 
+M: John Groves 
+M: John Groves 
+L: linux-...@vger.kernel.org
+L: linux-fsde...@vger.kernel.org
+S: Supported
+F: Documentation/filesystems/famfs.rst
+F: fs/famfs
+F: include/uapi/linux/famfs_ioctl.h
+
 FANOTIFY
 M: Jan Kara 
 R: Amir Goldstein 
-- 
2.43.0




[RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing

2024-02-23 Thread John Groves
Add famfs Kconfig and Makefile, and hook into fs/Kconfig and fs/Makefile

Signed-off-by: John Groves 
---
 fs/Kconfig|  2 ++
 fs/Makefile   |  1 +
 fs/famfs/Kconfig  | 10 ++
 fs/famfs/Makefile |  5 +
 4 files changed, 18 insertions(+)
 create mode 100644 fs/famfs/Kconfig
 create mode 100644 fs/famfs/Makefile

diff --git a/fs/Kconfig b/fs/Kconfig
index 89fdbefd1075..8a11625a54a2 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -141,6 +141,8 @@ source "fs/autofs/Kconfig"
 source "fs/fuse/Kconfig"
 source "fs/overlayfs/Kconfig"
 
+source "fs/famfs/Kconfig"
+
 menu "Caches"
 
 source "fs/netfs/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index c09016257f05..382c1ea4f4c3 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -130,3 +130,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/
 obj-$(CONFIG_EROFS_FS) += erofs/
 obj-$(CONFIG_VBOXSF_FS)+= vboxsf/
 obj-$(CONFIG_ZONEFS_FS)+= zonefs/
+obj-$(CONFIG_FAMFS) += famfs/
diff --git a/fs/famfs/Kconfig b/fs/famfs/Kconfig
new file mode 100644
index ..e450928d8912
--- /dev/null
+++ b/fs/famfs/Kconfig
@@ -0,0 +1,10 @@
+
+
+config FAMFS
+   tristate "famfs: shared memory file system"
+   depends on DEV_DAX && FS_DAX
+   help
+ Support for the famfs file system. Famfs is a dax file system that
+can support scale-out shared access to fabric-attached memory
+(e.g. CXL shared memory). Famfs is not a general purpose file system;
+it is an enabler for data sets in shared memory.
diff --git a/fs/famfs/Makefile b/fs/famfs/Makefile
new file mode 100644
index ..8cac90c090a4
--- /dev/null
+++ b/fs/famfs/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_FAMFS) += famfs.o
+
+famfs-y := famfs_inode.o famfs_file.o
-- 
2.43.0




Re: [RFC PATCH 16/20] famfs: Add fault counters

2024-02-23 Thread Dave Hansen
On 2/23/24 09:42, John Groves wrote:
> One of the key requirements for famfs is that it service vma faults
> efficiently. Our metadata helps - the search order is n for n extents,
> and n is usually 1. But we can still observe gnarly lock contention
> in mm if PTE faults are happening. This commit introduces fault counters
> that can be enabled and read via /sys/fs/famfs/...
> 
> These counters have proved useful in troubleshooting situations where
> PTE faults were happening instead of PMD. No performance impact when
> disabled.

This seems kinda wonky.  Why does _this_ specific filesystem need its
own fault counters.  Seems like something we'd want to do much more
generically, if it is needed at all.

Was the issue here just that vm_ops->fault() was getting called instead
of ->huge_fault()?  Or something more subtle?



Re: [RFC PATCH 16/20] famfs: Add fault counters

2024-02-23 Thread John Groves
On 24/02/23 10:23AM, Dave Hansen wrote:
> On 2/23/24 09:42, John Groves wrote:
> > One of the key requirements for famfs is that it service vma faults
> > efficiently. Our metadata helps - the search order is n for n extents,
> > and n is usually 1. But we can still observe gnarly lock contention
> > in mm if PTE faults are happening. This commit introduces fault counters
> > that can be enabled and read via /sys/fs/famfs/...
> > 
> > These counters have proved useful in troubleshooting situations where
> > PTE faults were happening instead of PMD. No performance impact when
> > disabled.
> 
> This seems kinda wonky.  Why does _this_ specific filesystem need its
> own fault counters.  Seems like something we'd want to do much more
> generically, if it is needed at all.
> 
> Was the issue here just that vm_ops->fault() was getting called instead
> of ->huge_fault()?  Or something more subtle?

Thanks for your reply Dave!

First, I'm willing to pull the fault counters out if the brain trust doesn't
like them.

I put them in because we were running benchmarks of computational data
analytics and and noted that jobs took 3x as long on famfs as raw dax -
which indicated I was doing something wrong, because it should be equivalent
or very close.

The the solution was to call thp_get_unmapped_area() in
famfs_file_operations, and performance doesn't vary significantly from raw
dax now. Prior to that I wasn't making sure the mmap address was PMD aligned.

After that I wanted a way to be double-secret-certain that it was servicing
PMD faults as intended. Which it basically always is, so far. (The smoke
tests in user space check this.)

John



Re: [RFC PATCH 16/20] famfs: Add fault counters

2024-02-23 Thread Dan Williams
John Groves wrote:
> On 24/02/23 10:23AM, Dave Hansen wrote:
> > On 2/23/24 09:42, John Groves wrote:
> > > One of the key requirements for famfs is that it service vma faults
> > > efficiently. Our metadata helps - the search order is n for n extents,
> > > and n is usually 1. But we can still observe gnarly lock contention
> > > in mm if PTE faults are happening. This commit introduces fault counters
> > > that can be enabled and read via /sys/fs/famfs/...
> > > 
> > > These counters have proved useful in troubleshooting situations where
> > > PTE faults were happening instead of PMD. No performance impact when
> > > disabled.
> > 
> > This seems kinda wonky.  Why does _this_ specific filesystem need its
> > own fault counters.  Seems like something we'd want to do much more
> > generically, if it is needed at all.
> > 
> > Was the issue here just that vm_ops->fault() was getting called instead
> > of ->huge_fault()?  Or something more subtle?
> 
> Thanks for your reply Dave!
> 
> First, I'm willing to pull the fault counters out if the brain trust doesn't
> like them.
> 
> I put them in because we were running benchmarks of computational data
> analytics and and noted that jobs took 3x as long on famfs as raw dax -
> which indicated I was doing something wrong, because it should be equivalent
> or very close.
> 
> The the solution was to call thp_get_unmapped_area() in
> famfs_file_operations, and performance doesn't vary significantly from raw
> dax now. Prior to that I wasn't making sure the mmap address was PMD aligned.
> 
> After that I wanted a way to be double-secret-certain that it was servicing
> PMD faults as intended. Which it basically always is, so far. (The smoke
> tests in user space check this.)

We had similar unit test regression concerns with fsdax where some
upstream change silently broke PMD faults. The solution there was trace
points in the fault handlers and a basic test that knows apriori that it
*should* be triggering a certain number of huge faults:

https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31



Re: [RFC PATCH 16/20] famfs: Add fault counters

2024-02-23 Thread John Groves
On 24/02/23 12:04PM, Dan Williams wrote:
> John Groves wrote:
> > On 24/02/23 10:23AM, Dave Hansen wrote:
> > > On 2/23/24 09:42, John Groves wrote:
> > > > One of the key requirements for famfs is that it service vma faults
> > > > efficiently. Our metadata helps - the search order is n for n extents,
> > > > and n is usually 1. But we can still observe gnarly lock contention
> > > > in mm if PTE faults are happening. This commit introduces fault counters
> > > > that can be enabled and read via /sys/fs/famfs/...
> > > > 
> > > > These counters have proved useful in troubleshooting situations where
> > > > PTE faults were happening instead of PMD. No performance impact when
> > > > disabled.
> > > 
> > > This seems kinda wonky.  Why does _this_ specific filesystem need its
> > > own fault counters.  Seems like something we'd want to do much more
> > > generically, if it is needed at all.
> > > 
> > > Was the issue here just that vm_ops->fault() was getting called instead
> > > of ->huge_fault()?  Or something more subtle?
> > 
> > Thanks for your reply Dave!
> > 
> > First, I'm willing to pull the fault counters out if the brain trust doesn't
> > like them.
> > 
> > I put them in because we were running benchmarks of computational data
> > analytics and and noted that jobs took 3x as long on famfs as raw dax -
> > which indicated I was doing something wrong, because it should be equivalent
> > or very close.
> > 
> > The the solution was to call thp_get_unmapped_area() in
> > famfs_file_operations, and performance doesn't vary significantly from raw
> > dax now. Prior to that I wasn't making sure the mmap address was PMD 
> > aligned.
> > 
> > After that I wanted a way to be double-secret-certain that it was servicing
> > PMD faults as intended. Which it basically always is, so far. (The smoke
> > tests in user space check this.)
> 
> We had similar unit test regression concerns with fsdax where some
> upstream change silently broke PMD faults. The solution there was trace
> points in the fault handlers and a basic test that knows apriori that it
> *should* be triggering a certain number of huge faults:
> 
> https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31

Good approach, thanks Dan! My working assumption is that we'll be able to make
that approach work in the famfs tests. So the fault counters should go away
in the next version.

John




Re: [RFC PATCH 16/20] famfs: Add fault counters

2024-02-23 Thread Dave Hansen
On 2/23/24 12:39, John Groves wrote:
>> We had similar unit test regression concerns with fsdax where some
>> upstream change silently broke PMD faults. The solution there was trace
>> points in the fault handlers and a basic test that knows apriori that it
>> *should* be triggering a certain number of huge faults:
>>
>> https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31
> Good approach, thanks Dan! My working assumption is that we'll be able to make
> that approach work in the famfs tests. So the fault counters should go away
> in the next version.

I do really suspect there's something more generic that should be done
here.  Maybe we need a generic 'huge_faults' perf event to pair up with
the good ol' faults that we already have:

# perf stat -e faults /bin/ls

 Performance counter stats for '/bin/ls':

   104  faults


   0.001499862 seconds time elapsed

   0.00149 seconds user
   0.0 seconds sys






Re: [RFC PATCH 16/20] famfs: Add fault counters

2024-02-23 Thread Dan Williams
Dave Hansen wrote:
> On 2/23/24 12:39, John Groves wrote:
> >> We had similar unit test regression concerns with fsdax where some
> >> upstream change silently broke PMD faults. The solution there was trace
> >> points in the fault handlers and a basic test that knows apriori that it
> >> *should* be triggering a certain number of huge faults:
> >>
> >> https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31
> > Good approach, thanks Dan! My working assumption is that we'll be able to 
> > make
> > that approach work in the famfs tests. So the fault counters should go away
> > in the next version.
> 
> I do really suspect there's something more generic that should be done
> here.  Maybe we need a generic 'huge_faults' perf event to pair up with
> the good ol' faults that we already have:
> 
> # perf stat -e faults /bin/ls
> 
>  Performance counter stats for '/bin/ls':
> 
>104  faults
> 
> 
>0.001499862 seconds time elapsed
> 
>0.00149 seconds user
>0.0 seconds sys

Certainly something like that would have satisified this sanity test use
case. I will note that mm_account_fault() would need some help to figure
out the size of the page table entry that got installed. Maybe
extensions to vm_fault_reason to add VM_FAULT_P*D? That compliments
VM_FAULT_FALLBACK to indicate whether, for example, the fallback went
from PUD to PMD, or all the way back to PTE.

Then use cases like this could just add a dynamic probe in
mm_account_fault(). No real need for a new tracepoint unless there was a
use case for this outside of regression testing fault handlers, right?



Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

2024-02-23 Thread Luis Chamberlain
On Fri, Feb 23, 2024 at 11:41:44AM -0600, John Groves wrote:
> This patch set introduces famfs[1] - a special-purpose fs-dax file system
> for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> CXL-specific in anyway way.
> 
> * Famfs creates a simple access method for storing and sharing data in
>   sharable memory. The memory is exposed and accessed as memory-mappable
>   dax files.
> * Famfs supports multiple hosts mounting the same file system from the
>   same memory (something existing fs-dax file systems don't do).
> * A famfs file system can be created on either a /dev/pmem device in fs-dax
>   mode, or a /dev/dax device in devdax mode (the latter depending on
>   patches 2-6 of this series).
> 
> The famfs kernel file system is part the famfs framework; additional
> components in user space[2] handle metadata and direct the famfs kernel
> module to instantiate files that map to specific memory. The famfs user
> space has documentation and a reasonably thorough test suite.
> 
> The famfs kernel module never accesses the shared memory directly (either
> data or metadata). Because of this, shared memory managed by the famfs
> framework does not create a RAS "blast radius" problem that should be able
> to crash or de-stabilize the kernel. Poison or timeouts in famfs memory
> can be expected to kill apps via SIGBUS and cause mounts to be disabled
> due to memory failure notifications.
> 
> Famfs does not attempt to solve concurrency or coherency problems for apps,
> although it does solve these problems in regard to its own data structures.
> Apps may encounter hard concurrency problems, but there are use cases that
> are imminently useful and uncomplicated from a concurrency perspective:
> serial sharing is one (only one host at a time has access), and read-only
> concurrent sharing is another (all hosts can read-cache without worry).

Can you do me a favor, curious if you can run a test like this:

fio -name=ten-1g-per-thread --nrfiles=10 -bs=2M -ioengine=io_uring  

  
-direct=1   

 
--group_reporting=1 --alloc-size=1048576 --filesize=1GiB

  
--readwrite=write --fallocate=none --numjobs=$(nproc) --create_on_open=1

  
--directory=/mnt 

What do you get for throughput?

The absolute large the system an capacity the better.

  Luis



Re: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h

2024-02-23 Thread Randy Dunlap
Hi--

On 2/23/24 09:41, John Groves wrote:
> Add uapi include file for famfs. The famfs user space uses ioctl on
> individual files to pass in mapping information and file size. This
> would be hard to do via sysfs or other means, since it's
> file-specific.
> 
> Signed-off-by: John Groves 
> ---
>  include/uapi/linux/famfs_ioctl.h | 56 
>  1 file changed, 56 insertions(+)
>  create mode 100644 include/uapi/linux/famfs_ioctl.h
> 
> diff --git a/include/uapi/linux/famfs_ioctl.h 
> b/include/uapi/linux/famfs_ioctl.h
> new file mode 100644
> index ..6b3e6452d02f
> --- /dev/null
> +++ b/include/uapi/linux/famfs_ioctl.h
> @@ -0,0 +1,56 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2024 Micron Technology, Inc.
> + *
> + * This file system, originally based on ramfs the dax support from xfs,

  This is confusing to me. Is it just me? ^

> + * is intended to allow multiple host systems to mount a common file system
> + * view of dax files that map to shared memory.
> + */
> +#ifndef FAMFS_IOCTL_H
> +#define FAMFS_IOCTL_H
> +
> +#include 
> +#include 
> +
> +#define FAMFS_MAX_EXTENTS 2
> +
> +enum extent_type {
> + SIMPLE_DAX_EXTENT = 13,
> + INVALID_EXTENT_TYPE,
> +};
> +
> +struct famfs_extent {
> + __u64  offset;
> + __u64  len;
> +};
> +
> +enum famfs_file_type {
> + FAMFS_REG,
> + FAMFS_SUPERBLOCK,
> + FAMFS_LOG,
> +};
> +
> +/**

"/**" is used to begin kernel-doc comments, but this comment block is missing
a few entries to make it be kernel-doc compatible. Please either add them
or just use "/*" to begin the comment.

> + * struct famfs_ioc_map
> + *
> + * This is the metadata that indicates where the memory is for a famfs file
> + */
> +struct famfs_ioc_map {
> + enum extent_type  extent_type;
> + enum famfs_file_type  file_type;
> + __u64 file_size;
> + __u64 ext_list_count;
> + struct famfs_extent   ext_list[FAMFS_MAX_EXTENTS];
> +};
> +
> +#define FAMFSIOC_MAGIC 'u'

This 'u' value should be documented in
Documentation/userspace-api/ioctl/ioctl-number.rst.

and if possible, you might want to use values like 0x5x or 0x8x
that don't conflict with the ioctl numbers that are already used
in the 'u' space.

> +
> +/* famfs file ioctl opcodes */
> +#define FAMFSIOC_MAP_CREATE_IOW(FAMFSIOC_MAGIC, 1, struct famfs_ioc_map)
> +#define FAMFSIOC_MAP_GET   _IOR(FAMFSIOC_MAGIC, 2, struct famfs_ioc_map)
> +#define FAMFSIOC_MAP_GETEXT_IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent)
> +#define FAMFSIOC_NOP   _IO(FAMFSIOC_MAGIC,  4)
> +
> +#endif /* FAMFS_IOCTL_H */

-- 
#Randy



Re: [RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing

2024-02-23 Thread Randy Dunlap
Hi,

On 2/23/24 09:42, John Groves wrote:
> Add famfs Kconfig and Makefile, and hook into fs/Kconfig and fs/Makefile
> 
> Signed-off-by: John Groves 
> ---
>  fs/Kconfig|  2 ++
>  fs/Makefile   |  1 +
>  fs/famfs/Kconfig  | 10 ++
>  fs/famfs/Makefile |  5 +
>  4 files changed, 18 insertions(+)
>  create mode 100644 fs/famfs/Kconfig
>  create mode 100644 fs/famfs/Makefile
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 89fdbefd1075..8a11625a54a2 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -141,6 +141,8 @@ source "fs/autofs/Kconfig"
>  source "fs/fuse/Kconfig"
>  source "fs/overlayfs/Kconfig"
>  
> +source "fs/famfs/Kconfig"
> +
>  menu "Caches"
>  
>  source "fs/netfs/Kconfig"
> diff --git a/fs/Makefile b/fs/Makefile
> index c09016257f05..382c1ea4f4c3 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -130,3 +130,4 @@ obj-$(CONFIG_EFIVAR_FS)   += efivarfs/
>  obj-$(CONFIG_EROFS_FS)   += erofs/
>  obj-$(CONFIG_VBOXSF_FS)  += vboxsf/
>  obj-$(CONFIG_ZONEFS_FS)  += zonefs/
> +obj-$(CONFIG_FAMFS) += famfs/
> diff --git a/fs/famfs/Kconfig b/fs/famfs/Kconfig
> new file mode 100644
> index ..e450928d8912
> --- /dev/null
> +++ b/fs/famfs/Kconfig
> @@ -0,0 +1,10 @@
> +
> +
> +config FAMFS
> +   tristate "famfs: shared memory file system"
> +   depends on DEV_DAX && FS_DAX
> +   help
> + Support for the famfs file system. Famfs is a dax file system that
> +  can support scale-out shared access to fabric-attached memory
> +  (e.g. CXL shared memory). Famfs is not a general purpose file system;
> +  it is an enabler for data sets in shared memory.

Please use one tab + 2 spaces to indent help text (below the "help" keyword)
as documented in Documentation/process/coding-style.rst.

> diff --git a/fs/famfs/Makefile b/fs/famfs/Makefile
> new file mode 100644
> index ..8cac90c090a4
> --- /dev/null
> +++ b/fs/famfs/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +obj-$(CONFIG_FAMFS) += famfs.o
> +
> +famfs-y := famfs_inode.o famfs_file.o

-- 
#Randy



Re: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h

2024-02-23 Thread John Groves
On 24/02/23 05:39PM, Randy Dunlap wrote:
> Hi--
> 
> On 2/23/24 09:41, John Groves wrote:
> > Add uapi include file for famfs. The famfs user space uses ioctl on
> > individual files to pass in mapping information and file size. This
> > would be hard to do via sysfs or other means, since it's
> > file-specific.
> > 
> > Signed-off-by: John Groves 
> > ---
> >  include/uapi/linux/famfs_ioctl.h | 56 
> >  1 file changed, 56 insertions(+)
> >  create mode 100644 include/uapi/linux/famfs_ioctl.h
> > 
> > diff --git a/include/uapi/linux/famfs_ioctl.h 
> > b/include/uapi/linux/famfs_ioctl.h
> > new file mode 100644
> > index ..6b3e6452d02f
> > --- /dev/null
> > +++ b/include/uapi/linux/famfs_ioctl.h
> > @@ -0,0 +1,56 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +/*
> > + * famfs - dax file system for shared fabric-attached memory
> > + *
> > + * Copyright 2023-2024 Micron Technology, Inc.
> > + *
> > + * This file system, originally based on ramfs the dax support from xfs,
> 
>   This is confusing to me. Is it just me? ^

Thanks Randy. I think I was trying to say "based on ramfs *plus* the dax
support from xfs. But I'll try to come up with something more clear than
that...

> 
> > + * is intended to allow multiple host systems to mount a common file system
> > + * view of dax files that map to shared memory.
> > + */
> > +#ifndef FAMFS_IOCTL_H
> > +#define FAMFS_IOCTL_H
> > +
> > +#include 
> > +#include 
> > +
> > +#define FAMFS_MAX_EXTENTS 2
> > +
> > +enum extent_type {
> > +   SIMPLE_DAX_EXTENT = 13,
> > +   INVALID_EXTENT_TYPE,
> > +};
> > +
> > +struct famfs_extent {
> > +   __u64  offset;
> > +   __u64  len;
> > +};
> > +
> > +enum famfs_file_type {
> > +   FAMFS_REG,
> > +   FAMFS_SUPERBLOCK,
> > +   FAMFS_LOG,
> > +};
> > +
> > +/**
> 
> "/**" is used to begin kernel-doc comments, but this comment block is missing
> a few entries to make it be kernel-doc compatible. Please either add them
> or just use "/*" to begin the comment.

Will do, thanks. And I'll check the whole code base for other instances;
I won't be surprise if I was sloop about that in more than one place.

> 
> > + * struct famfs_ioc_map
> > + *
> > + * This is the metadata that indicates where the memory is for a famfs file
> > + */
> > +struct famfs_ioc_map {
> > +   enum extent_type  extent_type;
> > +   enum famfs_file_type  file_type;
> > +   __u64 file_size;
> > +   __u64 ext_list_count;
> > +   struct famfs_extent   ext_list[FAMFS_MAX_EXTENTS];
> > +};
> > +
> > +#define FAMFSIOC_MAGIC 'u'
> 
> This 'u' value should be documented in
> Documentation/userspace-api/ioctl/ioctl-number.rst.
> 
> and if possible, you might want to use values like 0x5x or 0x8x
> that don't conflict with the ioctl numbers that are already used
> in the 'u' space.

Will do. I was trying to be too clever there, invoking "mu" for
micron. 

> 
> > +
> > +/* famfs file ioctl opcodes */
> > +#define FAMFSIOC_MAP_CREATE_IOW(FAMFSIOC_MAGIC, 1, struct 
> > famfs_ioc_map)
> > +#define FAMFSIOC_MAP_GET   _IOR(FAMFSIOC_MAGIC, 2, struct 
> > famfs_ioc_map)
> > +#define FAMFSIOC_MAP_GETEXT_IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent)
> > +#define FAMFSIOC_NOP   _IO(FAMFSIOC_MAGIC,  4)
> > +
> > +#endif /* FAMFS_IOCTL_H */
> 
> -- 
> #Randy

Thank you for taking the time to look it over, Randy.

John




Re: [RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing

2024-02-23 Thread John Groves
On 24/02/23 05:50PM, Randy Dunlap wrote:
> Hi,
> 
> On 2/23/24 09:42, John Groves wrote:
> > Add famfs Kconfig and Makefile, and hook into fs/Kconfig and fs/Makefile
> > 
> > Signed-off-by: John Groves 
> > ---
> >  fs/Kconfig|  2 ++
> >  fs/Makefile   |  1 +
> >  fs/famfs/Kconfig  | 10 ++
> >  fs/famfs/Makefile |  5 +
> >  4 files changed, 18 insertions(+)
> >  create mode 100644 fs/famfs/Kconfig
> >  create mode 100644 fs/famfs/Makefile
> > 
> > diff --git a/fs/Kconfig b/fs/Kconfig
> > index 89fdbefd1075..8a11625a54a2 100644
> > --- a/fs/Kconfig
> > +++ b/fs/Kconfig
> > @@ -141,6 +141,8 @@ source "fs/autofs/Kconfig"
> >  source "fs/fuse/Kconfig"
> >  source "fs/overlayfs/Kconfig"
> >  
> > +source "fs/famfs/Kconfig"
> > +
> >  menu "Caches"
> >  
> >  source "fs/netfs/Kconfig"
> > diff --git a/fs/Makefile b/fs/Makefile
> > index c09016257f05..382c1ea4f4c3 100644
> > --- a/fs/Makefile
> > +++ b/fs/Makefile
> > @@ -130,3 +130,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/
> >  obj-$(CONFIG_EROFS_FS) += erofs/
> >  obj-$(CONFIG_VBOXSF_FS)+= vboxsf/
> >  obj-$(CONFIG_ZONEFS_FS)+= zonefs/
> > +obj-$(CONFIG_FAMFS) += famfs/
> > diff --git a/fs/famfs/Kconfig b/fs/famfs/Kconfig
> > new file mode 100644
> > index ..e450928d8912
> > --- /dev/null
> > +++ b/fs/famfs/Kconfig
> > @@ -0,0 +1,10 @@
> > +
> > +
> > +config FAMFS
> > +   tristate "famfs: shared memory file system"
> > +   depends on DEV_DAX && FS_DAX
> > +   help
> > + Support for the famfs file system. Famfs is a dax file system that
> > +can support scale-out shared access to fabric-attached memory
> > +(e.g. CXL shared memory). Famfs is not a general purpose file system;
> > +it is an enabler for data sets in shared memory.
> 
> Please use one tab + 2 spaces to indent help text (below the "help" keyword)
> as documented in Documentation/process/coding-style.rst.

Will do, thank you!

John




Re: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h

2024-02-23 Thread Randy Dunlap
Hi John,

On 2/23/24 18:23, John Groves wrote:
>>> +
>>> +#define FAMFSIOC_MAGIC 'u'
>> This 'u' value should be documented in
>> Documentation/userspace-api/ioctl/ioctl-number.rst.
>>
>> and if possible, you might want to use values like 0x5x or 0x8x
>> that don't conflict with the ioctl numbers that are already used
>> in the 'u' space.
> Will do. I was trying to be too clever there, invoking "mu" for
> micron. 

I might have been unclear about this one.
It's OK to use 'u' but the values 1-4 below conflict in the 'u' space:

'u'   00-1F  linux/smb_fs.h  gone
'u'   20-3F  linux/uvcvideo.hUSB video 
class host driver
'u'   40-4f  linux/udmabuf.h

so if you could use
'u'   50-5f
or
'u'   80-8f

then those conflicts wouldn't be there.
HTH.

>>> +
>>> +/* famfs file ioctl opcodes */
>>> +#define FAMFSIOC_MAP_CREATE_IOW(FAMFSIOC_MAGIC, 1, struct 
>>> famfs_ioc_map)
>>> +#define FAMFSIOC_MAP_GET   _IOR(FAMFSIOC_MAGIC, 2, struct 
>>> famfs_ioc_map)
>>> +#define FAMFSIOC_MAP_GETEXT_IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent)
>>> +#define FAMFSIOC_NOP   _IO(FAMFSIOC_MAGIC,  4)

-- 
#Randy



Re: [RFC PATCH 16/20] famfs: Add fault counters

2024-02-23 Thread Matthew Wilcox
On Fri, Feb 23, 2024 at 03:50:33PM -0800, Dan Williams wrote:
> Certainly something like that would have satisified this sanity test use
> case. I will note that mm_account_fault() would need some help to figure
> out the size of the page table entry that got installed. Maybe
> extensions to vm_fault_reason to add VM_FAULT_P*D? That compliments
> VM_FAULT_FALLBACK to indicate whether, for example, the fallback went
> from PUD to PMD, or all the way back to PTE.

ugh, no, it's more complicated than that.  look at the recent changes to
set_ptes().  we can now install PTEs of many different sizes, depending
on the architecture.  someday i look forward to supporting all the page
sizes on parisc (4k, 16k, 64k, 256k, ... 4G)



Re: [RFC PATCH 16/20] famfs: Add fault counters

2024-02-23 Thread Dan Williams
Matthew Wilcox wrote:
> On Fri, Feb 23, 2024 at 03:50:33PM -0800, Dan Williams wrote:
> > Certainly something like that would have satisified this sanity test use
> > case. I will note that mm_account_fault() would need some help to figure
> > out the size of the page table entry that got installed. Maybe
> > extensions to vm_fault_reason to add VM_FAULT_P*D? That compliments
> > VM_FAULT_FALLBACK to indicate whether, for example, the fallback went
> > from PUD to PMD, or all the way back to PTE.
> 
> ugh, no, it's more complicated than that.  look at the recent changes to
> set_ptes().  we can now install PTEs of many different sizes, depending
> on the architecture.  someday i look forward to supporting all the page
> sizes on parisc (4k, 16k, 64k, 256k, ... 4G)

Nice!

There are enough bits in vm_fault_t to represent many page sizes instead
of the entry type as I suggested, but I would defer to you or Dave on
how to make "installed pte size" generically traceable per Dave's
suggestion.