On Sat, Mar 15, 2025 at 02:41:43PM -0400, James Bottomley wrote:
> On Sat, 2025-03-15 at 11:04 +0100, Christian Brauner wrote:
> [...]
> > Since efivars uses a single global superblock and we know that sfi-
> > >sb is still alive (After all we've just pinned it above.)
> > vfs_kern_mount() will reuse the same superblock.
> > 
> > There's two cases to consider:
> > 
> > (1) vfs_kern_mount() was successful. In this case path->mnt will hold
> > an active superblock reference that will be released asynchronously
> > via __fput(). That is safe and correct.
> > 
> > (2) vfs_kern_mount() fails. That's an issue because you need to call
> > deactivate_super() which will have a similar deadlock problem.
> > If efivarfs_pm_notify() now holds the last reference to the
> > superblock then deactivate_super() super will put that last
> > reference and call efivarfs_kill_super() which in turn will wait for
> > efivarfs_pm_notify() to finish. => deadlock
> > 
> > So in the error case you need to offload the call to
> > deactivate_super() to a workqueue.
> 
> OK, got that (although it did make my head explode a bit ... this is
> certainly subtle stuff).  To do the delayed work for the
> deactivate_super(), I hijacked the superblock destroy_work structure
> which I think is safe because by the time the work structure is
> executed, we own it and so it can be reused for the actual destroy_work
> in deactivate_super().
> 
> However, there's another problem: the mntput after kernel_file_open may
> synchronously call cleanup_mnt() (and thus deactivate_super()) if the
> open fails because it's marked MNT_INTERNAL, which is caused by
> SB_KERNMOUNT.  I fixed this just by not passing the SB_KERNMOUNT flag,
> which feels a bit hacky.

It actually isn't. We know that vfs_kern_mount() will always resurface
the single superblock that's exposed to userspace because we've just
taken a reference to it earlier in efivarfs_pm_notify(). So that
SB_KERNMOUNT flag is ignored because no new superblock is allocated. It
would only matter if we'd end up allocating a new superblock which we
never do.

And if we did it would be a bug because the superblock we allocate could
be reused at any time if a userspace task mounts efivarfs before
efivarfs_pm_notify() has destroyed it (or the respective workqueue).
But that superblock would then have SB_KERNMOUNT for something that's
not supposed to be one.

And whether or not that helper mount has MNT_INTERNAL is immaterial to
what you're doing here afaict.

So not passing the SB_KERNMOUNT flag is the right thing (see devtmpfs as
well). You could slap a comment in here explaining that we never
allocate a new superblock so it's clear to people not familiar with this
particular code.

> 
> I've put together everything at the bottom, however, I can't test the
> error legs of this because trying to trigger and unmount and hybernate
> at exactly the right point is pretty much impossible.  The rest seems
> to work as advertised, although I would like a tested-by from the
> apparmour people because I do run apparmour in my debian test rig but
> don't see the problem.
> 
> Regards,
> 
> James
> 
> ---
> 
> diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
> index 6eae8cf655c1..2d826e98066b 100644
> --- a/fs/efivarfs/super.c
> +++ b/fs/efivarfs/super.c
> @@ -474,12 +474,25 @@ static int efivarfs_check_missing(efi_char16_t *name16, 
> efi_guid_t vendor,
>       return err;
>  }
>  
> +static void efivarfs_deactivate_super_work(struct work_struct *work)
> +{
> +     struct super_block *s = container_of(work, struct super_block,
> +                                          destroy_work);
> +     /*
> +      * note: here s->destroy_work is free for reuse (which
> +      * will happen in deactivate_super)
> +      */
> +     deactivate_super(s);
> +}
> +
> +static struct file_system_type efivarfs_type;
> +
>  static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long 
> action,
>                             void *ptr)
>  {
>       struct efivarfs_fs_info *sfi = container_of(nb, struct efivarfs_fs_info,
>                                                   pm_nb);
> -     struct path path = { .mnt = NULL, .dentry = sfi->sb->s_root, };
> +     struct path path;
>       struct efivarfs_ctx ectx = {
>               .ctx = {
>                       .actor  = efivarfs_actor,
> @@ -487,6 +500,7 @@ static int efivarfs_pm_notify(struct notifier_block *nb, 
> unsigned long action,
>               .sb = sfi->sb,
>       };
>       struct file *file;
> +     struct super_block *s = sfi->sb;
>       static bool rescan_done = true;
>  
>       if (action == PM_HIBERNATION_PREPARE) {
> @@ -499,11 +513,39 @@ static int efivarfs_pm_notify(struct notifier_block 
> *nb, unsigned long action,
>       if (rescan_done)
>               return NOTIFY_DONE;
>  
> +     /* ensure single superblock is alive and pin it */
> +     if (!atomic_inc_not_zero(&s->s_active))
> +             return NOTIFY_DONE;
> +
>       pr_info("efivarfs: resyncing variable state\n");
>  
> -     /* O_NOATIME is required to prevent oops on NULL mnt */
> +     path.dentry = sfi->sb->s_root;
> +
> +     /* do not add SB_KERNMOUNT which causes MNT_INTERNAL, see below */
> +     path.mnt = vfs_kern_mount(&efivarfs_type, 0,
> +                               efivarfs_type.name, NULL);
> +     if (IS_ERR(path.mnt)) {
> +             pr_err("efivarfs: internal mount failed\n");
> +             /*
> +              * We may be the last pinner of the superblock but
> +              * calling efivarfs_kill_sb from within the notifier
> +              * here would deadlock trying to unregister it
> +              */
> +             INIT_WORK(&s->destroy_work, efivarfs_deactivate_super_work);
> +             schedule_work(&s->destroy_work);
> +             return PTR_ERR(path.mnt);
> +     }
> +
> +     /* path.mnt now has pin on superblock, so this must be above one */
> +     atomic_dec(&s->s_active);
> +
>       file = kernel_file_open(&path, O_RDONLY | O_DIRECTORY | O_NOATIME,
>                               current_cred());
> +     /*
> +      * safe even if last put because no MNT_INTERNAL means this
> +      * will do delayed deactivate_super and not deadlock
> +      */
> +     mntput(path.mnt);
>       if (IS_ERR(file))
>               return NOTIFY_DONE;
>  
> 
> 

Reply via email to