Re: [RFC PATCH v2] vfio: Include No-IOMMU mode

Michael S. Tsirkin Thu, 15 Oct 2015 09:45:25 -0700

On Wed, Oct 14, 2015 at 03:51:18PM -0600, Alex Williamson wrote:
> There is really no way to safely give a user full access to a DMA
> capable device without an IOMMU to protect the host system.  There is
> also no way to provide DMA translation, for use cases such as device
> assignment to virtual machines.  However, there are still those users
> that want userspace drivers even under those conditions.  The UIO
> driver exists for this use case, but does not provide the degree of
> device access and programming that VFIO has.  In an effort to avoid
> code duplication, this introduces a No-IOMMU mode for VFIO.
> 
> This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
> the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
> should make it very clear that this mode is not safe.  Additionally,
> CAP_SYS_RAWIO privileges are necessary to work with groups and
> containers using this mode.  Groups making use of this support are
> named /dev/vfio/noiommu-$GROUP and can only make use of the special
> VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
> binding a device without a native IOMMU group to a VFIO bus driver
> will taint the kernel and should therefore not be considered
> supported.  This patch includes no-iommu support for the vfio-pci bus
> driver only.
> 
> Signed-off-by: Alex Williamson <alex.william...@redhat.com>


FWIW

Acked-by: Michael S. Tsirkin <m...@redhat.com>

> ---
> 
> v2:
> 
> Less code, more features.  We don't really need to register an iommu_ops
> for a bus, we can get away with creating an IOMMU group on-demand when
> we're running in noiommu mode.  That also let's us have a mechanism to
> test for these fake groups, an IOMMU group without iommu_present().  We
> can therefore be more dynamic in allowing this mode and even allowing it
> alongside properly IOMMU backed devices (what do I care, the kernel is
> already tainted).  This also decouples us from pci_bus_type, even though
> vfio-pci is the only VFIO bus driver modified to enable noiommu here.
> Having combined modes means we can't easily relocate the noiommu groups
> to a separate sub-directory, so as noted in the commit log, they're
> prefixed with "noiommu-" in the /dev/vfio/ directory.  Also, rather than
> CAP_SYS_ADMIN, access to these groups require CAP_SYS_RAWIO, as Avi
> suggested.  These should be the only user visible changes from v1.
> 
> I feel a little better about not hijacking iommu_ops for a whole bus,
> but an IOMMU group w/o iommu_ops still leaves us in a strange situation.
> We can (and do) now remove the IOMMU group when unbound from the VFIO
> bus driver, so this group only exists for the window of time the device
> is owned by the VFIO bus driver, so we need to be careful of what IOMMU
> API interfaces are used, but since our IOMMU backend doesn't do
> anything, we should be all set.  Comments or suggestions?
> 
>  drivers/vfio/Kconfig        |   15 ++++
>  drivers/vfio/pci/vfio_pci.c |    8 +-
>  drivers/vfio/vfio.c         |  183 
> ++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/vfio.h        |    3 +
>  include/uapi/linux/vfio.h   |    7 ++
>  5 files changed, 206 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 4540179..b6d3cdc 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -31,5 +31,20 @@ menuconfig VFIO
>  
>         If you don't know what to do here, say N.
>  
> +menuconfig VFIO_NOIOMMU
> +     bool "VFIO No-IOMMU support"
> +     depends on VFIO
> +     help
> +       VFIO is built on the ability to isolate devices using the IOMMU.
> +       Only with an IOMMU can userspace access to DMA capable devices be
> +       considered secure.  VFIO No-IOMMU mode enables IOMMU groups for
> +       devices without IOMMU backing for the purpose of re-using the VFIO
> +       infrastructure in a non-secure mode.  Use of this mode will result
> +       in an unsupportable kernel and will therefore taint the kernel.
> +       Device assignment to virtual machines is also not possible with
> +       this mode since there is no IOMMU to provide DMA translation.
> +
> +       If you don't know what to do here, say N.
> +
>  source "drivers/vfio/pci/Kconfig"
>  source "drivers/vfio/platform/Kconfig"
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 964ad57..32b88bd 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -940,13 +940,13 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
> struct pci_device_id *id)
>       if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
>               return -EINVAL;
>  
> -     group = iommu_group_get(&pdev->dev);
> +     group = vfio_iommu_group_get(&pdev->dev);
>       if (!group)
>               return -EINVAL;
>  
>       vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
>       if (!vdev) {
> -             iommu_group_put(group);
> +             vfio_iommu_group_put(group, &pdev->dev);
>               return -ENOMEM;
>       }
>  
> @@ -957,7 +957,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
> struct pci_device_id *id)
>  
>       ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
>       if (ret) {
> -             iommu_group_put(group);
> +             vfio_iommu_group_put(group, &pdev->dev);
>               kfree(vdev);
>               return ret;
>       }
> @@ -993,7 +993,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
>       if (!vdev)
>               return;
>  
> -     iommu_group_put(pdev->dev.iommu_group);
> +     vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
>       kfree(vdev);
>  
>       if (vfio_pci_is_vga(pdev)) {
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 563c510..7f75d6d 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -61,6 +61,7 @@ struct vfio_container {
>       struct rw_semaphore             group_lock;
>       struct vfio_iommu_driver        *iommu_driver;
>       void                            *iommu_data;
> +     bool                            noiommu;
>  };
>  
>  struct vfio_unbound_dev {
> @@ -83,6 +84,7 @@ struct vfio_group {
>       struct list_head                unbound_list;
>       struct mutex                    unbound_lock;
>       atomic_t                        opened;
> +     bool                            noiommu;
>  };
>  
>  struct vfio_device {
> @@ -94,6 +96,148 @@ struct vfio_device {
>       void                            *device_data;
>  };
>  
> +#ifdef CONFIG_VFIO_NOIOMMU
> +static bool noiommu __read_mostly;
> +module_param_named(enable_unsafe_noiommu_support,
> +                noiommu, bool, S_IRUGO | S_IWUSR);
> +MODULE_PARM_DESC(enable_unsafe_noiommu_mode, "Enable UNSAFE, no-IOMMU mode.  
> This mode provides no device isolation, no DMA translation, no host kernel 
> protection, cannot be used for device assignment to virtual machines, 
> requires RAWIO permissions, and will taint the kernel.  If you do not know 
> what this is for, step away. (default: false)");
> +#endif
> +
> +/*
> + * vfio_iommu_group_{get,put} are only intended for VFIO bus driver probe
> + * and remove functions, any use cases other than acquiring the first
> + * reference for the purpose of calling vfio_add_group_dev() or removing
> + * that symmetric reference after vfio_del_group_dev() should use the raw
> + * iommu_group_{get,put} functions.  In particular, vfio_iommu_group_put()
> + * removes the device from the dummy group and cannot be nested.
> + */
> +struct iommu_group *vfio_iommu_group_get(struct device *dev)
> +{
> +     struct iommu_group *group;
> +     int __maybe_unused ret;
> +
> +     group = iommu_group_get(dev);
> +
> +#ifdef CONFIG_VFIO_NOIOMMU
> +     /*
> +      * With noiommu enabled, an IOMMU group will be created for a device
> +      * that doesn't already have one and doesn't have an iommu_ops on their
> +      * bus.  We use iommu_present() again in the main code to detect these
> +      * fake groups.
> +      */
> +     if (group || !noiommu || iommu_present(dev->bus))
> +             return group;
> +
> +     group = iommu_group_alloc();
> +     if (IS_ERR(group))
> +             return NULL;
> +
> +     iommu_group_set_name(group, "vfio-noiommu");
> +     ret = iommu_group_add_device(group, dev);
> +     iommu_group_put(group);
> +     if (ret)
> +             return NULL;
> +
> +     /*
> +      * Where to taint?  At this point we've added an IOMMU group for a
> +      * device that is not backed by iommu_ops, therefore any iommu_
> +      * callback using iommu_ops can legitimately Oops.  So, while we may
> +      * be about to give a DMA capable device to a user without IOMMU
> +      * protection, which is clearly taint-worthy, let's go ahead and do
> +      * it here.
> +      */
> +     add_taint(TAINT_USER, LOCKDEP_STILL_OK);
> +     pr_warn("VFIO: Adding kernel taint for vfio-noiommu group on device "
> +             "%s\n", dev_name(dev));
> +#endif
> +
> +     return group;
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommu_group_get);
> +
> +void vfio_iommu_group_put(struct iommu_group *group, struct device *dev)
> +{
> +#ifdef CONFIG_VFIO_NOIOMMU
> +     if (!iommu_present(dev->bus))
> +             iommu_group_remove_device(dev);
> +#endif
> +
> +     iommu_group_put(group);
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommu_group_put);
> +
> +#ifdef CONFIG_VFIO_NOIOMMU
> +static void *vfio_noiommu_open(unsigned long arg)
> +{
> +     if (arg != VFIO_NOIOMMU_IOMMU)
> +             return ERR_PTR(-EINVAL);
> +     if (!capable(CAP_SYS_RAWIO))
> +             return ERR_PTR(-EPERM);
> +
> +     return NULL;
> +}
> +
> +static void vfio_noiommu_release(void *iommu_data)
> +{
> +}
> +
> +static long vfio_noiommu_ioctl(void *iommu_data,
> +                            unsigned int cmd, unsigned long arg)
> +{
> +     if (cmd == VFIO_CHECK_EXTENSION)
> +             return arg == VFIO_NOIOMMU_IOMMU ? 1 : 0;
> +
> +     return -ENOTTY;
> +}
> +
> +static int vfio_iommu_present(struct device *dev, void *unused)
> +{
> +     return iommu_present(dev->bus) ? 1 : 0;
> +}
> +
> +static int vfio_noiommu_attach_group(void *iommu_data,
> +                                  struct iommu_group *iommu_group)
> +{
> +     return iommu_group_for_each_dev(iommu_group, NULL,
> +                                     vfio_iommu_present) ? -EINVAL : 0;
> +}
> +
> +static void vfio_noiommu_detach_group(void *iommu_data,
> +                                   struct iommu_group *iommu_group)
> +{
> +}
> +
> +static struct vfio_iommu_driver_ops vfio_noiommu_ops = {
> +     .name = "vfio-noiommu",
> +     .owner = THIS_MODULE,
> +     .open = vfio_noiommu_open,
> +     .release = vfio_noiommu_release,
> +     .ioctl = vfio_noiommu_ioctl,
> +     .attach_group = vfio_noiommu_attach_group,
> +     .detach_group = vfio_noiommu_detach_group,
> +};
> +
> +static struct vfio_iommu_driver vfio_noiommu_driver = {
> +     .ops = &vfio_noiommu_ops,
> +};
> +
> +/*
> + * Wrap IOMMU drivers, the noiommu driver is the one and only driver for
> + * noiommu groups (and thus containers) and not available for normal groups.
> + */
> +#define vfio_for_each_iommu_driver(con, pos)                         \
> +     for (pos = con->noiommu ? &vfio_noiommu_driver :                \
> +          list_first_entry(&vfio.iommu_drivers_list,                 \
> +                           struct vfio_iommu_driver, vfio_next);     \
> +          (con->noiommu && pos) || (!con->noiommu &&                 \
> +                     &pos->vfio_next != &vfio.iommu_drivers_list);   \
> +           pos = con->noiommu ? NULL : list_next_entry(pos, vfio_next))
> +#else
> +#define vfio_for_each_iommu_driver(con, pos)                         \
> +     list_for_each_entry(pos, &vfio.iommu_drivers_list, vfio_next)
> +#endif
> +
> +
>  /**
>   * IOMMU driver registration
>   */
> @@ -198,7 +342,8 @@ static void vfio_group_unlock_and_free(struct vfio_group 
> *group)
>  /**
>   * Group objects - create, release, get, put, search
>   */
> -static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group)
> +static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
> +                                         bool noiommu)
>  {
>       struct vfio_group *group, *tmp;
>       struct device *dev;
> @@ -216,6 +361,7 @@ static struct vfio_group *vfio_create_group(struct 
> iommu_group *iommu_group)
>       atomic_set(&group->container_users, 0);
>       atomic_set(&group->opened, 0);
>       group->iommu_group = iommu_group;
> +     group->noiommu = noiommu;
>  
>       group->nb.notifier_call = vfio_iommu_group_notifier;
>  
> @@ -251,7 +397,8 @@ static struct vfio_group *vfio_create_group(struct 
> iommu_group *iommu_group)
>  
>       dev = device_create(vfio.class, NULL,
>                           MKDEV(MAJOR(vfio.group_devt), minor),
> -                         group, "%d", iommu_group_id(iommu_group));
> +                         group, "%s%d", noiommu ? "noiommu-" : "",
> +                         iommu_group_id(iommu_group));
>       if (IS_ERR(dev)) {
>               vfio_free_group_minor(minor);
>               vfio_group_unlock_and_free(group);
> @@ -621,7 +768,8 @@ int vfio_add_group_dev(struct device *dev,
>  
>       group = vfio_group_get_from_iommu(iommu_group);
>       if (!group) {
> -             group = vfio_create_group(iommu_group);
> +             group = vfio_create_group(iommu_group,
> +                                       !iommu_present(dev->bus));
>               if (IS_ERR(group)) {
>                       iommu_group_put(iommu_group);
>                       return PTR_ERR(group);
> @@ -832,8 +980,7 @@ static long vfio_ioctl_check_extension(struct 
> vfio_container *container,
>                */
>               if (!driver) {
>                       mutex_lock(&vfio.iommu_drivers_lock);
> -                     list_for_each_entry(driver, &vfio.iommu_drivers_list,
> -                                         vfio_next) {
> +                     vfio_for_each_iommu_driver(container, driver) {
>                               if (!try_module_get(driver->ops->owner))
>                                       continue;
>  
> @@ -902,7 +1049,7 @@ static long vfio_ioctl_set_iommu(struct vfio_container 
> *container,
>       }
>  
>       mutex_lock(&vfio.iommu_drivers_lock);
> -     list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
> +     vfio_for_each_iommu_driver(container, driver) {
>               void *data;
>  
>               if (!try_module_get(driver->ops->owner))
> @@ -1167,6 +1314,9 @@ static int vfio_group_set_container(struct vfio_group 
> *group, int container_fd)
>       if (atomic_read(&group->container_users))
>               return -EINVAL;
>  
> +     if (group->noiommu && !capable(CAP_SYS_RAWIO))
> +             return -EPERM;
> +
>       f = fdget(container_fd);
>       if (!f.file)
>               return -EBADF;
> @@ -1182,6 +1332,13 @@ static int vfio_group_set_container(struct vfio_group 
> *group, int container_fd)
>  
>       down_write(&container->group_lock);
>  
> +     /* Real groups and fake groups cannot mix */
> +     if (!list_empty(&container->group_list) &&
> +         container->noiommu != group->noiommu) {
> +             ret = -EPERM;
> +             goto unlock_out;
> +     }
> +
>       driver = container->iommu_driver;
>       if (driver) {
>               ret = driver->ops->attach_group(container->iommu_data,
> @@ -1191,6 +1348,7 @@ static int vfio_group_set_container(struct vfio_group 
> *group, int container_fd)
>       }
>  
>       group->container = container;
> +     container->noiommu = group->noiommu;
>       list_add(&group->container_next, &container->group_list);
>  
>       /* Get a reference on the container and mark a user within the group */
> @@ -1221,6 +1379,9 @@ static int vfio_group_get_device_fd(struct vfio_group 
> *group, char *buf)
>           !group->container->iommu_driver || !vfio_group_viable(group))
>               return -EINVAL;
>  
> +     if (group->noiommu && !capable(CAP_SYS_RAWIO))
> +             return -EPERM;
> +
>       device = vfio_device_get_from_name(group, buf);
>       if (!device)
>               return -ENODEV;
> @@ -1351,6 +1512,11 @@ static int vfio_group_fops_open(struct inode *inode, 
> struct file *filep)
>       if (!group)
>               return -ENODEV;
>  
> +     if (group->noiommu && !capable(CAP_SYS_RAWIO)) {
> +             vfio_group_put(group);
> +             return -EPERM;
> +     }
> +
>       /* Do we need multiple instances of the group open?  Seems not. */
>       opened = atomic_cmpxchg(&group->opened, 0, 1);
>       if (opened) {
> @@ -1513,6 +1679,11 @@ struct vfio_group *vfio_group_get_external_user(struct 
> file *filep)
>       if (!atomic_inc_not_zero(&group->container_users))
>               return ERR_PTR(-EINVAL);
>  
> +     if (group->noiommu) {
> +             atomic_dec(&group->container_users);
> +             return ERR_PTR(-EPERM);
> +     }
> +
>       if (!group->container->iommu_driver ||
>                       !vfio_group_viable(group)) {
>               atomic_dec(&group->container_users);
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index ddb4409..610a86a 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -44,6 +44,9 @@ struct vfio_device_ops {
>       void    (*request)(void *device_data, unsigned int count);
>  };
>  
> +extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
> +extern void vfio_iommu_group_put(struct iommu_group *group, struct device 
> *dev);
> +
>  extern int vfio_add_group_dev(struct device *dev,
>                             const struct vfio_device_ops *ops,
>                             void *device_data);
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9fd7b5d..34f7faf 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -39,6 +39,13 @@
>  #define VFIO_SPAPR_TCE_v2_IOMMU              7
>  
>  /*
> + * The No-IOMMU IOMMU offers no translation or isolation for devices and
> + * supports no iotls

ioctls

> outside of VFIO_CHECK_EXTENSION.  Use of VFIO's No-IOMMU
> + * code will taint the host kernel and should be used with extreme caution.
> + */
> +#define VFIO_NOIOMMU_IOMMU           8
> +
> +/*
>   * The IOCTL interface is designed for extensibility by embedding the
>   * structure length (argsz) and flags into structures passed between
>   * kernel and userspace.  We therefore use the _IO() macro for these
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2] vfio: Include No-IOMMU mode

Reply via email to