On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > The system call is currently wired up for x86 arch.
> 
> Building on other architectures (except for arm64 for some reason) yields:
> 
>   CALL    /.../scripts/checksyscalls.sh
>   <stdin>:1565:2: warning: #warning syscall memfd_restricted not implemented 
> [-Wcpp]
> 
> Do we care?  It's the only such warning, which makes me think we either need 
> to
> wire this up for all architectures, or explicitly document that it's 
> unsupported.

I'm a bit conservative and prefer enabling only on x86 where we know the
exact usecase. For the warning we can get rid of by changing
scripts/checksyscalls.sh, just like __IGNORE_memfd_secret:

https://lkml.kernel.org/r/20210518072034.31572-7-r...@kernel.org

> 
> > Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.p...@linux.intel.com>
> > ---
> 
> ...
> 
> > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> > new file mode 100644
> > index 000000000000..c2700c5daa43
> > --- /dev/null
> > +++ b/include/linux/restrictedmem.h
> > @@ -0,0 +1,71 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _LINUX_RESTRICTEDMEM_H
> 
> Missing
> 
>  #define _LINUX_RESTRICTEDMEM_H
> 
> which causes fireworks if restrictedmem.h is included more than once.
> 
> > +#include <linux/file.h>
> > +#include <linux/magic.h>
> > +#include <linux/pfn_t.h>
> 
> ...
> 
> > +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +                                    struct page **pagep, int *order)
> > +{
> > +   return -1;
> 
> This should be a proper -errno, though in the current incarnation of things 
> it's
> a moot point because no stub is needed.  KVM can (and should) easily provide 
> its
> own stub for this one.
> 
> > +}
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > +   return false;
> > +}
> > +
> > +static inline void restrictedmem_error_page(struct page *page,
> > +                                       struct address_space *mapping)
> > +{
> > +}
> > +
> > +#endif /* CONFIG_RESTRICTEDMEM */
> > +
> > +#endif /* _LINUX_RESTRICTEDMEM_H */
> 
> ...
> 
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > new file mode 100644
> > index 000000000000..56953c204e5c
> > --- /dev/null
> > +++ b/mm/restrictedmem.c
> > @@ -0,0 +1,318 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include <linux/pagemap.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/shmem_fs.h>
> > +#include <linux/syscalls.h>
> > +#include <uapi/linux/falloc.h>
> > +#include <uapi/linux/magic.h>
> > +#include <linux/restrictedmem.h>
> > +
> > +struct restrictedmem_data {
> 
> Any objection to simply calling this "restrictedmem"?  And then using either 
> "rm"
> or "rmem" for local variable names?  I kept reading "data" as the underyling 
> data
> being written to the page, as opposed to the metadata describing the 
> restrictedmem
> instance.
> 
> > +   struct mutex lock;
> > +   struct file *memfd;
> > +   struct list_head notifiers;
> > +};
> > +
> > +static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
> > +                                      pgoff_t start, pgoff_t end)
> > +{
> > +   struct restrictedmem_notifier *notifier;
> > +
> > +   mutex_lock(&data->lock);
> 
> This can be a r/w semaphore instead of a mutex, that way punching holes at 
> multiple
> points in the file can at least run the notifiers in parallel.  The actual 
> allocation
> by shmem will still be serialized, but I think it's worth the simple 
> optimization
> since zapping and flushing in KVM may be somewhat slow.
> 
> > +   list_for_each_entry(notifier, &data->notifiers, list) {
> > +           notifier->ops->invalidate_start(notifier, start, end);
> 
> Two major design issues that we overlooked long ago:
> 
>   1. Blindly invoking notifiers will not scale.  E.g. if userspace configures 
> a
>      VM with a large number of convertible memslots that are all backed by a
>      single large restrictedmem instance, then converting a single page will
>      result in a linear walk through all memslots.  I don't expect anyone to
>      actually do something silly like that, but I also never expected there 
> to be
>      a legitimate usecase for thousands of memslots.
> 
>   2. This approach fails to provide the ability for KVM to ensure a guest has
>      exclusive access to a page.  As discussed in the past, the kernel can 
> rely
>      on hardware (and maybe ARM's pKVM implementation?) for those guarantees, 
> but
>      only for SNP and TDX VMs.  For VMs where userspace is trusted to some 
> extent,
>      e.g. SEV, there is value in ensuring a 1:1 association.
> 
>      And probably more importantly, relying on hardware for SNP and TDX 
> yields a
>      poor ABI and complicates KVM's internals.  If the kernel doesn't 
> guarantee a
>      page is exclusive to a guest, i.e. if userspace can hand out the same 
> page
>      from a restrictedmem instance to multiple VMs, then failure will occur 
> only
>      when KVM tries to assign the page to the second VM.  That will happen 
> deep
>      in KVM, which means KVM needs to gracefully handle such errors, and it 
> means
>      that KVM's ABI effectively allows plumbing garbage into its memslots.

It may not be a valid usage, but in my TDX environment I do meet below
issue.

kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 
ua=0x7fe1ebfff000 ret=0
kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 
ua=0x7fe271579000 ret=0
kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 
ua=0x7fe1ec09f000 ret=-22

Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU
and slot#2 fails due to below exclusive check.

Currently I changed QEMU code to mark these alias slots as shared
instead of private but I'm not 100% confident this is correct fix.

> 
> Rather than use a simple list of notifiers, this appears to be yet another
> opportunity to use an xarray.  Supporting sharing of restrictedmem will be
> non-trivial, but IMO we should punt that to the future since it's still 
> unclear
> exactly how sharing will work.
> 
> An xarray will solve #1 by notifying only the consumers (memslots) that are 
> bound
> to the affected range.
> 
> And for #2, it's relatively straightforward (knock wood) to detect existing
> entries, i.e. if the user wants exclusive access to memory, then the bind 
> operation
> can be reject if there's an existing entry.
> 
> VERY lightly tested code snippet at the bottom (will provide link to fully 
> worked
> code in cover letter).
> 
> 
> > +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int 
> > mode,
> > +                                loff_t offset, loff_t len)
> > +{
> > +   int ret;
> > +   pgoff_t start, end;
> > +   struct file *memfd = data->memfd;
> > +
> > +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +           return -EINVAL;
> > +
> > +   start = offset >> PAGE_SHIFT;
> > +   end = (offset + len) >> PAGE_SHIFT;
> > +
> > +   restrictedmem_invalidate_start(data, start, end);
> > +   ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > +   restrictedmem_invalidate_end(data, start, end);
> 
> The lock needs to be end for the entire duration of the hole punch, i.e. 
> needs to
> be taken before invalidate_start() and released after invalidate_end().  If a 
> user
> (un)binds/(un)registers after invalidate_state(), it will see an unpaired 
> notification,
> e.g. could leave KVM with incorrect notifier counts.
> 
> > +
> > +   return ret;
> > +}
> 
> What I ended up with for an xarray-based implementation.  I'm very flexible on
> names and whatnot, these are just what made sense to me.
> 
> static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
>                                    loff_t offset, loff_t len)
> {
>       struct restrictedmem_notifier *notifier;
>       struct file *memfd = rm->memfd;
>       unsigned long index;
>       pgoff_t start, end;
>       int ret;
> 
>       if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
>               return -EINVAL;
> 
>       start = offset >> PAGE_SHIFT;
>       end = (offset + len) >> PAGE_SHIFT;
> 
>       /*
>        * Bindings must stable across invalidation to ensure the start+end
>        * are balanced.
>        */
>       down_read(&rm->lock);
> 
>       xa_for_each_range(&rm->bindings, index, notifier, start, end)
>               notifier->ops->invalidate_start(notifier, start, end);
> 
>       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> 
>       xa_for_each_range(&rm->bindings, index, notifier, start, end)
>               notifier->ops->invalidate_end(notifier, start, end);
> 
>       up_read(&rm->lock);
> 
>       return ret;
> }
> 
> int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
>                      struct restrictedmem_notifier *notifier, bool exclusive)
> {
>       struct restrictedmem *rm = file->f_mapping->private_data;
>       int ret = -EINVAL;
> 
>       down_write(&rm->lock);
> 
>       /* Non-exclusive mappings are not yet implemented. */
>       if (!exclusive)
>               goto out_unlock;
> 
>       if (!xa_empty(&rm->bindings)) {
>               if (exclusive != rm->exclusive)
>                       goto out_unlock;
> 
>               if (exclusive && xa_find(&rm->bindings, &start, end, 
> XA_PRESENT))
>                       goto out_unlock;
>       }
> 
>       xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);
>       rm->exclusive = exclusive;
>       ret = 0;
> out_unlock:
>       up_write(&rm->lock);
>       return ret;
> }
> EXPORT_SYMBOL_GPL(restrictedmem_bind);
> 
> void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
>                         struct restrictedmem_notifier *notifier)
> {
>       struct restrictedmem *rm = file->f_mapping->private_data;
> 
>       down_write(&rm->lock);
>       xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
>       synchronize_rcu();
>       up_write(&rm->lock);
> }
> EXPORT_SYMBOL_GPL(restrictedmem_unbind);

Reply via email to