from:"Alex Williamson"

Re: [PATCH] vfio powerpc: enabled and supported on powernv platform

2012-11-26 Thread Alex Williamson

On Thu, 2012-11-22 at 11:56 +, Sethi Varun-B16395 wrote:
> 
> > -Original Message-
> > From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> > ow...@vger.kernel.org] On Behalf Of Alex Williamson
> > Sent: Tuesday, November 20, 2012 11:50 PM
> > To: Alexey Kardashevskiy
> > Cc: Benjamin Herrenschmidt; Paul Mackerras; linuxppc-
> > d...@lists.ozlabs.org; linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > David Gibson
> > Subject: Re: [PATCH] vfio powerpc: enabled and supported on powernv
> > platform
> > 
> > On Tue, 2012-11-20 at 11:48 +1100, Alexey Kardashevskiy wrote:
> > > VFIO implements platform independent stuff such as a PCI driver, BAR
> > > access (via read/write on a file descriptor or direct mapping when
> > > possible) and IRQ signaling.
> > > The platform dependent part includes IOMMU initialization and
> > > handling.
> > >
> > > This patch initializes IOMMU groups based on the IOMMU configuration
> > > discovered during the PCI scan, only POWERNV platform is supported at
> > > the moment.
> > >
> > > Also the patch implements an VFIO-IOMMU driver which manages DMA
> > > mapping/unmapping requests coming from the client (now QEMU). It also
> > > returns a DMA window information to let the guest initialize the
> > > device tree for a guest OS properly. Although this driver has been
> > > tested only on POWERNV, it should work on any platform supporting TCE
> > > tables.
> > >
> > > To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config option.
> > >
> > > Cc: David Gibson 
> > > Signed-off-by: Alexey Kardashevskiy 
> > > ---
> > >  arch/powerpc/include/asm/iommu.h |6 +
> > >  arch/powerpc/kernel/iommu.c  |  140 +++
> > >  arch/powerpc/platforms/powernv/pci.c |  135 +++
> > >  drivers/iommu/Kconfig|8 ++
> > >  drivers/vfio/Kconfig |6 +
> > >  drivers/vfio/Makefile|1 +
> > >  drivers/vfio/vfio_iommu_spapr_tce.c  |  247
> > ++
> > >  include/linux/vfio.h |   20 +++
> > >  8 files changed, 563 insertions(+)
> > >  create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
> > >
> > > diff --git a/arch/powerpc/include/asm/iommu.h
> > > b/arch/powerpc/include/asm/iommu.h
> > > index cbfe678..5ba66cb 100644
> > > --- a/arch/powerpc/include/asm/iommu.h
> > > +++ b/arch/powerpc/include/asm/iommu.h
> > > @@ -64,30 +64,33 @@ struct iommu_pool {  }
> > > cacheline_aligned_in_smp;
> > >
> > >  struct iommu_table {
> > >   unsigned long  it_busno; /* Bus number this table belongs to */
> > >   unsigned long  it_size;  /* Size of iommu table in entries */
> > >   unsigned long  it_offset;/* Offset into global table */
> > >   unsigned long  it_base;  /* mapped address of tce table */
> > >   unsigned long  it_index; /* which iommu table this is */
> > >   unsigned long  it_type;  /* type: PCI or Virtual Bus */
> > >   unsigned long  it_blocksize; /* Entries in each block (cacheline)
> > */
> > >   unsigned long  poolsize;
> > >   unsigned long  nr_pools;
> > >   struct iommu_pool large_pool;
> > >   struct iommu_pool pools[IOMMU_NR_POOLS];
> > >   unsigned long *it_map;   /* A simple allocation bitmap for now
> > */
> > > +#ifdef CONFIG_IOMMU_API
> > > + struct iommu_group *it_group;
> > > +#endif
> > >  };
> > >
> > >  struct scatterlist;
> > >
> > >  static inline void set_iommu_table_base(struct device *dev, void
> > > *base)  {
> > >   dev->archdata.dma_data.iommu_table_base = base;  }
> > >
> > >  static inline void *get_iommu_table_base(struct device *dev)  {
> > >   return dev->archdata.dma_data.iommu_table_base;
> > >  }
> > >
> > >  /* Frees table for an individual device node */ @@ -135,17 +138,20 @@
> > > static inline void pci_iommu_init(void) { }  extern void
> > > alloc_dart_table(void);  #if defined(CONFIG_PPC64) &&
> > > defined(CONFIG_PM)  static inline void iommu_save(void)  {
> > >   if (ppc_md.iommu_save)
> > >   ppc_md.iommu_save();
> > >  }
> > >
> > >  static inline void iommu_restore(void)  {
> > >   if (ppc_md.iommu_restore)
> > >   ppc_md.iommu_

Re: [PATCH] vfio powerpc: enabled and supported on powernv platform

2012-11-26 Thread Alex Williamson

On Fri, 2012-11-23 at 13:02 +1100, Alexey Kardashevskiy wrote:
> On 22/11/12 22:56, Sethi Varun-B16395 wrote:
> >
> >
> >> -Original Message-
> >> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> >> ow...@vger.kernel.org] On Behalf Of Alex Williamson
> >> Sent: Tuesday, November 20, 2012 11:50 PM
> >> To: Alexey Kardashevskiy
> >> Cc: Benjamin Herrenschmidt; Paul Mackerras; linuxppc-
> >> d...@lists.ozlabs.org; linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> >> David Gibson
> >> Subject: Re: [PATCH] vfio powerpc: enabled and supported on powernv
> >> platform
> >>
> >> On Tue, 2012-11-20 at 11:48 +1100, Alexey Kardashevskiy wrote:
> >>> VFIO implements platform independent stuff such as a PCI driver, BAR
> >>> access (via read/write on a file descriptor or direct mapping when
> >>> possible) and IRQ signaling.
> >>> The platform dependent part includes IOMMU initialization and
> >>> handling.
> >>>
> >>> This patch initializes IOMMU groups based on the IOMMU configuration
> >>> discovered during the PCI scan, only POWERNV platform is supported at
> >>> the moment.
> >>>
> >>> Also the patch implements an VFIO-IOMMU driver which manages DMA
> >>> mapping/unmapping requests coming from the client (now QEMU). It also
> >>> returns a DMA window information to let the guest initialize the
> >>> device tree for a guest OS properly. Although this driver has been
> >>> tested only on POWERNV, it should work on any platform supporting TCE
> >>> tables.
> >>>
> >>> To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config option.
> >>>
> >>> Cc: David Gibson 
> >>> Signed-off-by: Alexey Kardashevskiy 
> >>> ---
> >>>   arch/powerpc/include/asm/iommu.h |6 +
> >>>   arch/powerpc/kernel/iommu.c  |  140 +++
> >>>   arch/powerpc/platforms/powernv/pci.c |  135 +++
> >>>   drivers/iommu/Kconfig|8 ++
> >>>   drivers/vfio/Kconfig |6 +
> >>>   drivers/vfio/Makefile|1 +
> >>>   drivers/vfio/vfio_iommu_spapr_tce.c  |  247
> >> ++
> >>>   include/linux/vfio.h |   20 +++
> >>>   8 files changed, 563 insertions(+)
> >>>   create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
> >>>
> >>> diff --git a/arch/powerpc/include/asm/iommu.h
> >>> b/arch/powerpc/include/asm/iommu.h
> >>> index cbfe678..5ba66cb 100644
> >>> --- a/arch/powerpc/include/asm/iommu.h
> >>> +++ b/arch/powerpc/include/asm/iommu.h
> >>> @@ -64,30 +64,33 @@ struct iommu_pool {  }
> >>> cacheline_aligned_in_smp;
> >>>
> >>>   struct iommu_table {
> >>>   unsigned long  it_busno; /* Bus number this table belongs 
> >>> to */
> >>>   unsigned long  it_size;  /* Size of iommu table in entries 
> >>> */
> >>>   unsigned long  it_offset;/* Offset into global table */
> >>>   unsigned long  it_base;  /* mapped address of tce table */
> >>>   unsigned long  it_index; /* which iommu table this is */
> >>>   unsigned long  it_type;  /* type: PCI or Virtual Bus */
> >>>   unsigned long  it_blocksize; /* Entries in each block 
> >>> (cacheline)
> >> */
> >>>   unsigned long  poolsize;
> >>>   unsigned long  nr_pools;
> >>>   struct iommu_pool large_pool;
> >>>   struct iommu_pool pools[IOMMU_NR_POOLS];
> >>>   unsigned long *it_map;   /* A simple allocation bitmap for 
> >>> now
> >> */
> >>> +#ifdef CONFIG_IOMMU_API
> >>> + struct iommu_group *it_group;
> >>> +#endif
> >>>   };
> >>>
> >>>   struct scatterlist;
> >>>
> >>>   static inline void set_iommu_table_base(struct device *dev, void
> >>> *base)  {
> >>>   dev->archdata.dma_data.iommu_table_base = base;  }
> >>>
> >>>   static inline void *get_iommu_table_base(struct device *dev)  {
> >>>   return dev->archdata.dma_data.iommu_table_base;
> >>>   }
> >>>
> >>>   /*

Re: [PATCH] vfio powerpc: enabled and supported on powernv platform

2012-11-26 Thread Alex Williamson

On Mon, 2012-11-26 at 08:18 -0700, Alex Williamson wrote:
> On Fri, 2012-11-23 at 13:02 +1100, Alexey Kardashevskiy wrote:
> > On 22/11/12 22:56, Sethi Varun-B16395 wrote:
> > >
> > >
> > >> -Original Message-
> > >> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> > >> ow...@vger.kernel.org] On Behalf Of Alex Williamson
> > >> Sent: Tuesday, November 20, 2012 11:50 PM
> > >> To: Alexey Kardashevskiy
> > >> Cc: Benjamin Herrenschmidt; Paul Mackerras; linuxppc-
> > >> d...@lists.ozlabs.org; linux-kernel@vger.kernel.org; 
> > >> k...@vger.kernel.org;
> > >> David Gibson
> > >> Subject: Re: [PATCH] vfio powerpc: enabled and supported on powernv
> > >> platform
> > >>
> > >> On Tue, 2012-11-20 at 11:48 +1100, Alexey Kardashevskiy wrote:
> > >>> VFIO implements platform independent stuff such as a PCI driver, BAR
> > >>> access (via read/write on a file descriptor or direct mapping when
> > >>> possible) and IRQ signaling.
> > >>> The platform dependent part includes IOMMU initialization and
> > >>> handling.
> > >>>
> > >>> This patch initializes IOMMU groups based on the IOMMU configuration
> > >>> discovered during the PCI scan, only POWERNV platform is supported at
> > >>> the moment.
> > >>>
> > >>> Also the patch implements an VFIO-IOMMU driver which manages DMA
> > >>> mapping/unmapping requests coming from the client (now QEMU). It also
> > >>> returns a DMA window information to let the guest initialize the
> > >>> device tree for a guest OS properly. Although this driver has been
> > >>> tested only on POWERNV, it should work on any platform supporting TCE
> > >>> tables.
> > >>>
> > >>> To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config option.
> > >>>
> > >>> Cc: David Gibson 
> > >>> Signed-off-by: Alexey Kardashevskiy 
> > >>> ---
> > >>>   arch/powerpc/include/asm/iommu.h |6 +
> > >>>   arch/powerpc/kernel/iommu.c  |  140 +++
> > >>>   arch/powerpc/platforms/powernv/pci.c |  135 +++
> > >>>   drivers/iommu/Kconfig|8 ++
> > >>>   drivers/vfio/Kconfig |6 +
> > >>>   drivers/vfio/Makefile|1 +
> > >>>   drivers/vfio/vfio_iommu_spapr_tce.c  |  247
> > >> ++
> > >>>   include/linux/vfio.h |   20 +++
> > >>>   8 files changed, 563 insertions(+)
> > >>>   create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
> > >>>
> > >>> diff --git a/arch/powerpc/include/asm/iommu.h
> > >>> b/arch/powerpc/include/asm/iommu.h
> > >>> index cbfe678..5ba66cb 100644
> > >>> --- a/arch/powerpc/include/asm/iommu.h
> > >>> +++ b/arch/powerpc/include/asm/iommu.h
> > >>> @@ -64,30 +64,33 @@ struct iommu_pool {  }
> > >>> cacheline_aligned_in_smp;
> > >>>
> > >>>   struct iommu_table {
> > >>> unsigned long  it_busno; /* Bus number this table belongs 
> > >>> to */
> > >>> unsigned long  it_size;  /* Size of iommu table in entries 
> > >>> */
> > >>> unsigned long  it_offset;/* Offset into global table */
> > >>> unsigned long  it_base;  /* mapped address of tce table */
> > >>> unsigned long  it_index; /* which iommu table this is */
> > >>> unsigned long  it_type;  /* type: PCI or Virtual Bus */
> > >>> unsigned long  it_blocksize; /* Entries in each block 
> > >>> (cacheline)
> > >> */
> > >>> unsigned long  poolsize;
> > >>> unsigned long  nr_pools;
> > >>> struct iommu_pool large_pool;
> > >>> struct iommu_pool pools[IOMMU_NR_POOLS];
> > >>> unsigned long *it_map;   /* A simple allocation bitmap for 
> > >>> now
> > >> */
> > >>> +#ifdef CONFIG_IOMMU_API
> > >>> +   struct iommu_group *it_group;
> > >>> +#endif
> > >>>   };
> > >>>
> > >>>   struct s

Re: [PATCH 1/2] vfio powerpc: implemented IOMMU driver for VFIO

2012-11-26 Thread Alex Williamson

On Fri, 2012-11-23 at 20:03 +1100, Alexey Kardashevskiy wrote:
> VFIO implements platform independent stuff such as
> a PCI driver, BAR access (via read/write on a file descriptor
> or direct mapping when possible) and IRQ signaling.
> 
> The platform dependent part includes IOMMU initialization
> and handling. This patch implements an IOMMU driver for VFIO
> which does mapping/unmapping pages for the guest IO and
> provides information about DMA window (required by a POWERPC
> guest).
> 
> The counterpart in QEMU is required to support this functionality.
> 
> Cc: David Gibson 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/Kconfig|6 +
>  drivers/vfio/Makefile   |1 +
>  drivers/vfio/vfio_iommu_spapr_tce.c |  247 
> +++
>  include/linux/vfio.h|   20 +++
>  4 files changed, 274 insertions(+)
>  create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 7cd5dec..b464687 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
>   depends on VFIO
>   default n
>  
> +config VFIO_IOMMU_SPAPR_TCE
> + tristate
> + depends on VFIO && SPAPR_TCE_IOMMU
> + default n
> +
>  menuconfig VFIO
>   tristate "VFIO Non-Privileged userspace driver framework"
>   depends on IOMMU_API
>   select VFIO_IOMMU_TYPE1 if X86
> + select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
>   help
> VFIO provides a framework for secure userspace device drivers.
> See Documentation/vfio.txt for more details.
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 2398d4a..72bfabc 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -1,3 +1,4 @@
>  obj-$(CONFIG_VFIO) += vfio.o
>  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> new file mode 100644
> index 000..46a6298
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -0,0 +1,247 @@
> +/*
> + * VFIO: IOMMU DMA mapping support for TCE on POWER
> + *
> + * Copyright (C) 2012 IBM Corp.  All rights reserved.
> + * Author: Alexey Kardashevskiy 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio_iommu_type1.c:
> + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> + * Author: Alex Williamson 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "a...@ozlabs.ru"
> +#define DRIVER_DESC "VFIO IOMMU SPAPR TCE"
> +
> +static void tce_iommu_detach_group(void *iommu_data,
> + struct iommu_group *iommu_group);
> +
> +/*
> + * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
> + */
> +
> +/*
> + * The container descriptor supports only a single group per container.
> + * Required by the API as the container is not supplied with the IOMMU group
> + * at the moment of initialization.
> + */
> +struct tce_container {
> + struct mutex lock;
> + struct iommu_table *tbl;
> +};
> +
> +static void *tce_iommu_open(unsigned long arg)
> +{
> + struct tce_container *container;
> +
> + if (arg != VFIO_SPAPR_TCE_IOMMU) {
> + printk(KERN_ERR "tce_vfio: Wrong IOMMU type\n");
> + return ERR_PTR(-EINVAL);
> + }
> +
> + container = kzalloc(sizeof(*container), GFP_KERNEL);
> + if (!container)
> + return ERR_PTR(-ENOMEM);
> +
> + mutex_init(&container->lock);
> +
> + return container;
> +}
> +
> +static void tce_iommu_release(void *iommu_data)
> +{
> + struct tce_container *container = iommu_data;
> +
> + WARN_ON(container->tbl && !container->tbl->it_group);

I think your patch ordering is backwards here.  it_group isn't added
until 2/2.  I'd really like to see the arch/powerpc code approved and
merged by the powerpc maintainer before we add the code that makes use
of it into vfio.  Otherwise we just get lots of churn if interfaces
change or they disapprove of it altogether.

> + if (container->tbl && container->tbl->it_group)

Re: [PATCH] vfio powerpc: enabled and supported on powernv platform

2012-11-26 Thread Alex Williamson

On Tue, 2012-11-27 at 14:28 +1100, Alexey Kardashevskiy wrote:
> On 27/11/12 05:04, Alex Williamson wrote:
> > On Mon, 2012-11-26 at 08:18 -0700, Alex Williamson wrote:
> >> On Fri, 2012-11-23 at 13:02 +1100, Alexey Kardashevskiy wrote:
> >>> On 22/11/12 22:56, Sethi Varun-B16395 wrote:
> >>>>
> >>>>
> >>>>> -Original Message-
> >>>>> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> >>>>> ow...@vger.kernel.org] On Behalf Of Alex Williamson
> >>>>> Sent: Tuesday, November 20, 2012 11:50 PM
> >>>>> To: Alexey Kardashevskiy
> >>>>> Cc: Benjamin Herrenschmidt; Paul Mackerras; linuxppc-
> >>>>> d...@lists.ozlabs.org; linux-kernel@vger.kernel.org; 
> >>>>> k...@vger.kernel.org;
> >>>>> David Gibson
> >>>>> Subject: Re: [PATCH] vfio powerpc: enabled and supported on powernv
> >>>>> platform
> >>>>>
> >>>>> On Tue, 2012-11-20 at 11:48 +1100, Alexey Kardashevskiy wrote:
> >>>>>> VFIO implements platform independent stuff such as a PCI driver, BAR
> >>>>>> access (via read/write on a file descriptor or direct mapping when
> >>>>>> possible) and IRQ signaling.
> >>>>>> The platform dependent part includes IOMMU initialization and
> >>>>>> handling.
> >>>>>>
> >>>>>> This patch initializes IOMMU groups based on the IOMMU configuration
> >>>>>> discovered during the PCI scan, only POWERNV platform is supported at
> >>>>>> the moment.
> >>>>>>
> >>>>>> Also the patch implements an VFIO-IOMMU driver which manages DMA
> >>>>>> mapping/unmapping requests coming from the client (now QEMU). It also
> >>>>>> returns a DMA window information to let the guest initialize the
> >>>>>> device tree for a guest OS properly. Although this driver has been
> >>>>>> tested only on POWERNV, it should work on any platform supporting TCE
> >>>>>> tables.
> >>>>>>
> >>>>>> To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config option.
> >>>>>>
> >>>>>> Cc: David Gibson 
> >>>>>> Signed-off-by: Alexey Kardashevskiy 
> >>>>>> ---
> >>>>>>arch/powerpc/include/asm/iommu.h |6 +
> >>>>>>arch/powerpc/kernel/iommu.c  |  140 +++
> >>>>>>arch/powerpc/platforms/powernv/pci.c |  135 +++
> >>>>>>drivers/iommu/Kconfig|8 ++
> >>>>>>drivers/vfio/Kconfig |6 +
> >>>>>>drivers/vfio/Makefile|1 +
> >>>>>>drivers/vfio/vfio_iommu_spapr_tce.c  |  247
> >>>>> ++
> >>>>>>include/linux/vfio.h |   20 +++
> >>>>>>8 files changed, 563 insertions(+)
> >>>>>>create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>
> >>>>>> diff --git a/arch/powerpc/include/asm/iommu.h
> >>>>>> b/arch/powerpc/include/asm/iommu.h
> >>>>>> index cbfe678..5ba66cb 100644
> >>>>>> --- a/arch/powerpc/include/asm/iommu.h
> >>>>>> +++ b/arch/powerpc/include/asm/iommu.h
> >>>>>> @@ -64,30 +64,33 @@ struct iommu_pool {  }
> >>>>>> cacheline_aligned_in_smp;
> >>>>>>
> >>>>>>struct iommu_table {
> >>>>>>unsigned long  it_busno; /* Bus number this table belongs 
> >>>>>> to */
> >>>>>>unsigned long  it_size;  /* Size of iommu table in entries 
> >>>>>> */
> >>>>>>unsigned long  it_offset;/* Offset into global table */
> >>>>>>unsigned long  it_base;  /* mapped address of tce table */
> >>>>>>unsigned long  it_index; /* which iommu table this is */
> >>>>>>unsigned long  it_type;  /* type: PCI or Virtual Bus */
> >>>>>>unsigned long  it_blocksize; /* Entries in each block 
> >>>>>> (cacheline)
> >>>&

Re: [PATCH 2/2] vfio powerpc: enabled on powernv platform

2012-11-26 Thread Alex Williamson

On Fri, 2012-11-23 at 20:03 +1100, Alexey Kardashevskiy wrote:
> This patch initializes IOMMU groups based on the IOMMU
> configuration discovered during the PCI scan on POWERNV
> (POWER non virtualized) platform. The IOMMU groups are
> to be used later by VFIO driver (PCI pass through).
> 
> It also implements an API for mapping/unmapping pages for
> guest PCI drivers and providing DMA window properties.
> This API is going to be used later by QEMU-VFIO to handle
> h_put_tce hypercalls from the KVM guest.
> 
> Although this driver has been tested only on the POWERNV
> platform, it should work on any platform which supports
> TCE tables.
> 
> To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
> option and configure VFIO as required.
> 
> Cc: David Gibson 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/include/asm/iommu.h |6 ++
>  arch/powerpc/kernel/iommu.c  |  141 
> ++
>  arch/powerpc/platforms/powernv/pci.c |  135 
>  drivers/iommu/Kconfig|8 ++
>  4 files changed, 290 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index cbfe678..5ba66cb 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -76,6 +76,9 @@ struct iommu_table {
>   struct iommu_pool large_pool;
>   struct iommu_pool pools[IOMMU_NR_POOLS];
>   unsigned long *it_map;   /* A simple allocation bitmap for now */
> +#ifdef CONFIG_IOMMU_API
> + struct iommu_group *it_group;
> +#endif
>  };
>  
>  struct scatterlist;
> @@ -147,5 +150,8 @@ static inline void iommu_restore(void)
>  }
>  #endif
>  
> +extern long iommu_put_tces(struct iommu_table *tbl, unsigned long entry, 
> uint64_t tce,
> + enum dma_data_direction direction, unsigned long pages);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_IOMMU_H */
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index ff5a6ce..c8dad1f 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DBG(...)
>  
> @@ -856,3 +857,143 @@ void iommu_free_coherent(struct iommu_table *tbl, 
> size_t size,
>   free_pages((unsigned long)vaddr, get_order(size));
>   }
>  }
> +
> +#ifdef CONFIG_IOMMU_API
> +/*
> + * SPAPR TCE API
> + */
> +static struct page *free_tce(struct iommu_table *tbl, unsigned long entry)
> +{
> + struct page *page;
> + unsigned long oldtce;
> +
> + oldtce = ppc_md.tce_get(tbl, entry);
> +
> + if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> + return NULL;
> +
> + page = pfn_to_page(oldtce >> PAGE_SHIFT);
> +
> + WARN_ON(!page);
> + if (page && (oldtce & TCE_PCI_WRITE))
> + SetPageDirty(page);
> + ppc_md.tce_free(tbl, entry, 1);
> +
> + return page;
> +}
> +
> +static int put_tce(struct iommu_table *tbl, unsigned long entry,
> + uint64_t tce, enum dma_data_direction direction)
> +{
> + int ret;
> + struct page *page = NULL;
> + unsigned long kva, offset;
> +
> + /* Map new TCE */
> + offset = (tce & IOMMU_PAGE_MASK) - (tce & PAGE_MASK);
> + ret = get_user_pages_fast(tce & PAGE_MASK, 1,
> + direction != DMA_TO_DEVICE, &page);

We're locking memory here on behalf of the user, but I don't see where
rlimit gets checked to verify the user has privileges to lock the pages.
I know you're locking a much smaller set of memory than x86 does, but
are we just foregoing that added security?

> + if (ret < 1) {
> + printk(KERN_ERR "tce_vfio: get_user_pages_fast failed tce=%llx 
> ioba=%lx ret=%d\n",
> + tce, entry << IOMMU_PAGE_SHIFT, ret);
> + if (!ret)
> + ret = -EFAULT;
> + return ret;
> + }
> +
> + kva = (unsigned long) page_address(page);
> + kva += offset;
> +
> + /* tce_build receives a virtual address */
> + entry += tbl->it_offset; /* Offset into real TCE table */
> + ret = ppc_md.tce_build(tbl, entry, 1, kva, direction, NULL);
> +
> + /* tce_build() only returns non-zero for transient errors */
> + if (unlikely(ret)) {
> + printk(KERN_ERR "tce_vfio: tce_put failed on tce=%llx ioba=%lx 
> kva=%lx ret=%d\n",
> + tce, entry << IOMMU_PAGE_SHIFT, kva, ret);
> + put_page(page);
> + return -EIO;
> + }
> +
> + return 0;
> +}
> +
> +static void tce_flush(struct iommu_table *tbl)
> +{
> + /* Flush/invalidate TLB caches if necessary */
> + if (ppc_md.tce_flush)
> + ppc_md.tce_flush(tbl);
> +
> + /* Make sure updates are seen by hardware */
> + mb();
> +}
> +
> +long iommu_put_tces(struct iommu_table *tbl, unsigned long entry, uint64_t 
> tce,
> + enum dma_data_direction direction,

Re: [PATCH 1/2] vfio powerpc: implemented IOMMU driver for VFIO

2012-11-26 Thread Alex Williamson

On Tue, 2012-11-27 at 15:06 +1100, Alexey Kardashevskiy wrote:
> On 27/11/12 05:20, Alex Williamson wrote:
> > On Fri, 2012-11-23 at 20:03 +1100, Alexey Kardashevskiy wrote:
> >> VFIO implements platform independent stuff such as
> >> a PCI driver, BAR access (via read/write on a file descriptor
> >> or direct mapping when possible) and IRQ signaling.
> >>
> >> The platform dependent part includes IOMMU initialization
> >> and handling. This patch implements an IOMMU driver for VFIO
> >> which does mapping/unmapping pages for the guest IO and
> >> provides information about DMA window (required by a POWERPC
> >> guest).
> >>
> >> The counterpart in QEMU is required to support this functionality.
> >>
> >> Cc: David Gibson 
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >>   drivers/vfio/Kconfig|6 +
> >>   drivers/vfio/Makefile   |1 +
> >>   drivers/vfio/vfio_iommu_spapr_tce.c |  247 
> >> +++
> >>   include/linux/vfio.h|   20 +++
> >>   4 files changed, 274 insertions(+)
> >>   create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
> >>
> >> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> >> index 7cd5dec..b464687 100644
> >> --- a/drivers/vfio/Kconfig
> >> +++ b/drivers/vfio/Kconfig
> >> @@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
> >>depends on VFIO
> >>default n
> >>
> >> +config VFIO_IOMMU_SPAPR_TCE
> >> +  tristate
> >> +  depends on VFIO && SPAPR_TCE_IOMMU
> >> +  default n
> >> +
> >>   menuconfig VFIO
> >>tristate "VFIO Non-Privileged userspace driver framework"
> >>depends on IOMMU_API
> >>select VFIO_IOMMU_TYPE1 if X86
> >> +  select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
> >>help
> >>  VFIO provides a framework for secure userspace device drivers.
> >>  See Documentation/vfio.txt for more details.
> >> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> >> index 2398d4a..72bfabc 100644
> >> --- a/drivers/vfio/Makefile
> >> +++ b/drivers/vfio/Makefile
> >> @@ -1,3 +1,4 @@
> >>   obj-$(CONFIG_VFIO) += vfio.o
> >>   obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> >> +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> >>   obj-$(CONFIG_VFIO_PCI) += pci/
> >> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> >> b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> new file mode 100644
> >> index 000..46a6298
> >> --- /dev/null
> >> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> @@ -0,0 +1,247 @@
> >> +/*
> >> + * VFIO: IOMMU DMA mapping support for TCE on POWER
> >> + *
> >> + * Copyright (C) 2012 IBM Corp.  All rights reserved.
> >> + * Author: Alexey Kardashevskiy 
> >> + *
> >> + * This program is free software; you can redistribute it and/or modify
> >> + * it under the terms of the GNU General Public License version 2 as
> >> + * published by the Free Software Foundation.
> >> + *
> >> + * Derived from original vfio_iommu_type1.c:
> >> + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> >> + * Author: Alex Williamson 
> >> + */
> >> +
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +
> >> +#define DRIVER_VERSION  "0.1"
> >> +#define DRIVER_AUTHOR   "a...@ozlabs.ru"
> >> +#define DRIVER_DESC "VFIO IOMMU SPAPR TCE"
> >> +
> >> +static void tce_iommu_detach_group(void *iommu_data,
> >> +  struct iommu_group *iommu_group);
> >> +
> >> +/*
> >> + * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
> >> + */
> >> +
> >> +/*
> >> + * The container descriptor supports only a single group per container.
> >> + * Required by the API as the container is not supplied with the IOMMU 
> >> group
> >> + * at the moment of initialization.
> >> + */
> >> +struct tce_container {
> >> +  struct mutex lock;
> >> +  struct iommu_table *tbl;
> >> +};
> >> +
> >> +static void *tce_iommu_open(unsigned long arg)
> >> +{
> >> +  struct tce_conta

Re: [PATCH 1/2] vfio powerpc: implemented IOMMU driver for VFIO

2012-11-26 Thread Alex Williamson

On Tue, 2012-11-27 at 15:58 +1100, Alexey Kardashevskiy wrote:
> On 27/11/12 15:29, Alex Williamson wrote:
> > On Tue, 2012-11-27 at 15:06 +1100, Alexey Kardashevskiy wrote:
> >> On 27/11/12 05:20, Alex Williamson wrote:
> >>> On Fri, 2012-11-23 at 20:03 +1100, Alexey Kardashevskiy wrote:
> >>>> VFIO implements platform independent stuff such as
> >>>> a PCI driver, BAR access (via read/write on a file descriptor
> >>>> or direct mapping when possible) and IRQ signaling.
> >>>>
> >>>> The platform dependent part includes IOMMU initialization
> >>>> and handling. This patch implements an IOMMU driver for VFIO
> >>>> which does mapping/unmapping pages for the guest IO and
> >>>> provides information about DMA window (required by a POWERPC
> >>>> guest).
> >>>>
> >>>> The counterpart in QEMU is required to support this functionality.
> >>>>
> >>>> Cc: David Gibson 
> >>>> Signed-off-by: Alexey Kardashevskiy 
> >>>> ---
> >>>>drivers/vfio/Kconfig|6 +
> >>>>drivers/vfio/Makefile   |1 +
> >>>>drivers/vfio/vfio_iommu_spapr_tce.c |  247 
> >>>> +++
> >>>>include/linux/vfio.h|   20 +++
> >>>>4 files changed, 274 insertions(+)
> >>>>create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>
> >>>> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> >>>> index 7cd5dec..b464687 100644
> >>>> --- a/drivers/vfio/Kconfig
> >>>> +++ b/drivers/vfio/Kconfig
> >>>> @@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
> >>>>  depends on VFIO
> >>>>  default n
> >>>>
> >>>> +config VFIO_IOMMU_SPAPR_TCE
> >>>> +tristate
> >>>> +depends on VFIO && SPAPR_TCE_IOMMU
> >>>> +default n
> >>>> +
> >>>>menuconfig VFIO
> >>>>  tristate "VFIO Non-Privileged userspace driver framework"
> >>>>  depends on IOMMU_API
> >>>>  select VFIO_IOMMU_TYPE1 if X86
> >>>> +select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
> >>>>  help
> >>>>VFIO provides a framework for secure userspace device drivers.
> >>>>See Documentation/vfio.txt for more details.
> >>>> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> >>>> index 2398d4a..72bfabc 100644
> >>>> --- a/drivers/vfio/Makefile
> >>>> +++ b/drivers/vfio/Makefile
> >>>> @@ -1,3 +1,4 @@
> >>>>obj-$(CONFIG_VFIO) += vfio.o
> >>>>obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> >>>> +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> >>>>obj-$(CONFIG_VFIO_PCI) += pci/
> >>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> >>>> b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>> new file mode 100644
> >>>> index 000..46a6298
> >>>> --- /dev/null
> >>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>> @@ -0,0 +1,247 @@
> >>>> +/*
> >>>> + * VFIO: IOMMU DMA mapping support for TCE on POWER
> >>>> + *
> >>>> + * Copyright (C) 2012 IBM Corp.  All rights reserved.
> >>>> + * Author: Alexey Kardashevskiy 
> >>>> + *
> >>>> + * This program is free software; you can redistribute it and/or modify
> >>>> + * it under the terms of the GNU General Public License version 2 as
> >>>> + * published by the Free Software Foundation.
> >>>> + *
> >>>> + * Derived from original vfio_iommu_type1.c:
> >>>> + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> >>>> + * Author: Alex Williamson 
> >>>> + */
> >>>> +
> >>>> +#include 
> >>>> +#include 
> >>>> +#include 
> >>>> +#include 
> >>>> +#include 
> >>>> +#include 
> >>>> +#include 
> >>>> +
> >>>> +#define DRIVER_VERSION  "0.1"
> >>>> +#define DRIVER_AUTHOR   "a...@ozlabs.ru"
> >&g

Re: [PATCH] vfio powerpc: implemented IOMMU driver for VFIO

2012-11-28 Thread Alex Williamson

On Wed, 2012-11-28 at 18:21 +1100, Alexey Kardashevskiy wrote:
> VFIO implements platform independent stuff such as
> a PCI driver, BAR access (via read/write on a file descriptor
> or direct mapping when possible) and IRQ signaling.
> 
> The platform dependent part includes IOMMU initialization
> and handling. This patch implements an IOMMU driver for VFIO
> which does mapping/unmapping pages for the guest IO and
> provides information about DMA window (required by a POWERPC
> guest).
> 
> The counterpart in QEMU is required to support this functionality.
> 
> Cc: David Gibson 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/Kconfig|6 +
>  drivers/vfio/Makefile   |1 +
>  drivers/vfio/vfio_iommu_spapr_tce.c |  332 
> +++
>  include/linux/vfio.h|   33 
>  4 files changed, 372 insertions(+)
>  create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 7cd5dec..b464687 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
>   depends on VFIO
>   default n
>  
> +config VFIO_IOMMU_SPAPR_TCE
> + tristate
> + depends on VFIO && SPAPR_TCE_IOMMU
> + default n
> +
>  menuconfig VFIO
>   tristate "VFIO Non-Privileged userspace driver framework"
>   depends on IOMMU_API
>   select VFIO_IOMMU_TYPE1 if X86
> + select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
>   help
> VFIO provides a framework for secure userspace device drivers.
> See Documentation/vfio.txt for more details.
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 2398d4a..72bfabc 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -1,3 +1,4 @@
>  obj-$(CONFIG_VFIO) += vfio.o
>  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> new file mode 100644
> index 000..b98770e
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -0,0 +1,332 @@
> +/*
> + * VFIO: IOMMU DMA mapping support for TCE on POWER
> + *
> + * Copyright (C) 2012 IBM Corp.  All rights reserved.
> + * Author: Alexey Kardashevskiy 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio_iommu_type1.c:
> + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> + * Author: Alex Williamson 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "a...@ozlabs.ru"
> +#define DRIVER_DESC "VFIO IOMMU SPAPR TCE"
> +
> +static void tce_iommu_detach_group(void *iommu_data,
> + struct iommu_group *iommu_group);
> +
> +/*
> + * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
> + */
> +
> +/*
> + * This code handles mapping and unmapping of user data buffers
> + * into DMA'ble space using the IOMMU
> + */
> +
> +#define NPAGE_TO_SIZE(npage) ((size_t)(npage) << PAGE_SHIFT)
> +
> +struct vwork {
> + struct mm_struct*mm;
> + longnpage;
> + struct work_struct  work;
> +};
> +
> +/* delayed decrement/increment for locked_vm */
> +static void lock_acct_bg(struct work_struct *work)
> +{
> + struct vwork *vwork = container_of(work, struct vwork, work);
> + struct mm_struct *mm;
> +
> + mm = vwork->mm;
> + down_write(&mm->mmap_sem);
> + mm->locked_vm += vwork->npage;
> + up_write(&mm->mmap_sem);
> + mmput(mm);
> + kfree(vwork);
> +}
> +
> +static void lock_acct(long npage)
> +{
> + struct vwork *vwork;
> + struct mm_struct *mm;
> +
> + if (!current->mm)
> + return; /* process exited */
> +
> + if (down_write_trylock(¤t->mm->mmap_sem)) {
> + current->mm->locked_vm += npage;
> + up_write(¤t->mm->mmap_sem);
> + return;
> + }
> +
> + /*
> +  * Couldn't get mmap_sem lock, so must setup to update
> +  * mm->locked_vm later. If locked_vm were atomic, we
> +  * wouldn't need this silliness
> +  */
&

Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-11-28 Thread Alex Williamson

On Wed, 2012-11-28 at 18:18 +1100, Alexey Kardashevskiy wrote:
> This patch initializes IOMMU groups based on the IOMMU
> configuration discovered during the PCI scan on POWERNV
> (POWER non virtualized) platform. The IOMMU groups are
> to be used later by VFIO driver (PCI pass through).
> 
> It also implements an API for mapping/unmapping pages for
> guest PCI drivers and providing DMA window properties.
> This API is going to be used later by QEMU-VFIO to handle
> h_put_tce hypercalls from the KVM guest.
> 
> Although this driver has been tested only on the POWERNV
> platform, it should work on any platform which supports
> TCE tables.
> 
> To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
> option and configure VFIO as required.
> 
> Cc: David Gibson 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/include/asm/iommu.h |9 +++
>  arch/powerpc/kernel/iommu.c  |  147 
> ++
>  arch/powerpc/platforms/powernv/pci.c |  135 +++
>  drivers/iommu/Kconfig|8 ++
>  4 files changed, 299 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index cbfe678..5c7087a 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -76,6 +76,9 @@ struct iommu_table {
>   struct iommu_pool large_pool;
>   struct iommu_pool pools[IOMMU_NR_POOLS];
>   unsigned long *it_map;   /* A simple allocation bitmap for now */
> +#ifdef CONFIG_IOMMU_API
> + struct iommu_group *it_group;
> +#endif
>  };
>  
>  struct scatterlist;
> @@ -147,5 +150,11 @@ static inline void iommu_restore(void)
>  }
>  #endif
>  
> +extern long iommu_clear_tces(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages);
> +extern long iommu_put_tces(struct iommu_table *tbl, unsigned long entry,
> + uint64_t tce, enum dma_data_direction direction,
> + unsigned long pages);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_IOMMU_H */
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index ff5a6ce..1456b6e 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DBG(...)
>  
> @@ -856,3 +857,149 @@ void iommu_free_coherent(struct iommu_table *tbl, 
> size_t size,
>   free_pages((unsigned long)vaddr, get_order(size));
>   }
>  }
> +
> +#ifdef CONFIG_IOMMU_API
> +/*
> + * SPAPR TCE API
> + */
> +static void tce_flush(struct iommu_table *tbl)
> +{
> + /* Flush/invalidate TLB caches if necessary */
> + if (ppc_md.tce_flush)
> + ppc_md.tce_flush(tbl);
> +
> + /* Make sure updates are seen by hardware */
> + mb();
> +}
> +
> +/*
> + * iommu_clear_tces clears tces and returned the number of pages
> + * which it called put_page() on.
> + */
> +static long clear_tces_nolock(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages)
> +{
> + int i, pages_put = 0;
> + unsigned long oldtce;
> + struct page *page;
> +
> + for (i = 0; i < pages; ++i) {
> + oldtce = ppc_md.tce_get(tbl, entry + i);
> + ppc_md.tce_free(tbl, entry + i, 1);
> +
> + if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> + continue;
> +
> + page = pfn_to_page(oldtce >> PAGE_SHIFT);
> +
> + WARN_ON(!page);
> + if (!page)
> + continue;
> +
> + if (oldtce & TCE_PCI_WRITE)
> + SetPageDirty(page);
> +
> + ++pages_put;
> + put_page(page);
> + }
> +
> + return pages_put;
> +}
> +
> +/*
> + * iommu_clear_tces clears tces and returned the number of released pages
> + */
> +long iommu_clear_tces(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages)
> +{
> + int ret;
> + struct iommu_pool *pool = get_pool(tbl, entry);
> +
> + spin_lock(&(pool->lock));
> + ret = clear_tces_nolock(tbl, entry, pages);
> + tce_flush(tbl);
> + spin_unlock(&(pool->lock));
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_clear_tces);
> +
> +static int put_tce(struct iommu_table *tbl, unsigned long entry,
> + uint64_t tce, enum dma_data_direction direction)
> +{
> + int ret;
> + struct page *page = NULL;
> + unsigned long kva, offset;
> +
> + /* Map new TCE */
> + offset = (tce & IOMMU_PAGE_MASK) - (tce & PAGE_MASK);
> +
> + ret = get_user_pages_fast(tce & PAGE_MASK, 1,
> + direction != DMA_TO_DEVICE, &page);
> + if (ret < 1) {
> + printk(KERN_ERR "tce_vfio: get_user_pages_fast failed tce=%llx 
> ioba=%lx ret=%d\n",
> + tce, entry << IOMMU_PAGE_SHIFT, ret);
> + if (!ret)
> + ret = -EFAULT;
> + ret

Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-11-28 Thread Alex Williamson

On Thu, 2012-11-29 at 14:53 +1100, Alexey Kardashevskiy wrote:
> This patch initializes IOMMU groups based on the IOMMU
> configuration discovered during the PCI scan on POWERNV
> (POWER non virtualized) platform. The IOMMU groups are
> to be used later by VFIO driver (PCI pass through).
> 
> It also implements an API for mapping/unmapping pages for
> guest PCI drivers and providing DMA window properties.
> This API is going to be used later by QEMU-VFIO to handle
> h_put_tce hypercalls from the KVM guest.
> 
> Although this driver has been tested only on the POWERNV
> platform, it should work on any platform which supports
> TCE tables.
> 
> To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
> option and configure VFIO as required.
> 
> Cc: David Gibson 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/include/asm/iommu.h |9 ++
>  arch/powerpc/kernel/iommu.c  |  159 
> ++
>  arch/powerpc/platforms/powernv/pci.c |  135 +
>  drivers/iommu/Kconfig|8 ++
>  4 files changed, 311 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index cbfe678..5c7087a 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -76,6 +76,9 @@ struct iommu_table {
>   struct iommu_pool large_pool;
>   struct iommu_pool pools[IOMMU_NR_POOLS];
>   unsigned long *it_map;   /* A simple allocation bitmap for now */
> +#ifdef CONFIG_IOMMU_API
> + struct iommu_group *it_group;
> +#endif
>  };
>  
>  struct scatterlist;
> @@ -147,5 +150,11 @@ static inline void iommu_restore(void)
>  }
>  #endif
>  
> +extern long iommu_clear_tces(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages);
> +extern long iommu_put_tces(struct iommu_table *tbl, unsigned long entry,
> + uint64_t tce, enum dma_data_direction direction,
> + unsigned long pages);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_IOMMU_H */
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index ff5a6ce..1225fbb 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DBG(...)
>  
> @@ -856,3 +857,161 @@ void iommu_free_coherent(struct iommu_table *tbl, 
> size_t size,
>   free_pages((unsigned long)vaddr, get_order(size));
>   }
>  }
> +
> +#ifdef CONFIG_IOMMU_API
> +/*
> + * SPAPR TCE API
> + */
> +static void tce_flush(struct iommu_table *tbl)
> +{
> + /* Flush/invalidate TLB caches if necessary */
> + if (ppc_md.tce_flush)
> + ppc_md.tce_flush(tbl);
> +
> + /* Make sure updates are seen by hardware */
> + mb();
> +}
> +
> +/*
> + * iommu_clear_tces clears tces and returned the number of pages
> + * which it called put_page() on.
> + */
> +static long clear_tces_nolock(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages)
> +{
> + int i, retpages = 0;
> + unsigned long oldtce;
> + struct page *page;
> +
> + for (i = 0; i < pages; ++i) {
> + oldtce = ppc_md.tce_get(tbl, entry + i);
> + ppc_md.tce_free(tbl, entry + i, 1);
> +
> + if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> + continue;
> +
> + page = pfn_to_page(oldtce >> PAGE_SHIFT);
> +
> + WARN_ON(!page);
> + if (!page)
> + continue;
> +
> + if (oldtce & TCE_PCI_WRITE)
> + SetPageDirty(page);
> +
> + if (!(oldtce & ~PAGE_MASK))
> + ++retpages;

I'm confused, it looks like you're trying to only increment the counter
for tce pages aligned at the start of a page, but don't we need to mask
out the read/write and valid bits?  Trickiness like this demands a
comment.

> +
> + put_page(page);
> + }
> +
> + return retpages;
> +}
> +
> +/*
> + * iommu_clear_tces clears tces and returned the number of released pages
> + */
> +long iommu_clear_tces(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages)
> +{
> + int ret;
> + struct iommu_pool *pool = get_pool(tbl, entry);
> +
> + spin_lock(&(pool->lock));
> + ret = clear_tces_nolock(tbl, entry, pages);
> + tce_flush(tbl);
> + spin_unlock(&(pool->lock));
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_clear_tces);
> +
> +static int put_tce(struct iommu_table *tbl, unsigned long entry,
> + uint64_t tce, enum dma_data_direction direction)
> +{
> + int ret;
> + struct page *page = NULL;
> + unsigned long kva, offset;
> +
> + /* Map new TCE */
> + offset = (tce & IOMMU_PAGE_MASK) - (tce & PAGE_MASK);
> +
> + ret = get_user_pages_fast(tce & PAGE_MASK, 1,
> + direction != DMA_TO_DEVICE, &page);
> +

[PATCH] kvm: Fix user memslot overlap check

2012-11-29 Thread Alex Williamson

Prior to memory slot sorting this loop compared all of the user memory
slots for overlap with new entries.  With memory slot sorting, we're
just checking some number of entries in the array that may or may not
be user slots.  Instead, walk all the slots with kvm_for_each_memslot,
which has the added benefit of terminating early when we hit the first
empty slot, and skip comparison to private slots.

Signed-off-by: Alex Williamson 
Cc: sta...@vger.kernel.org
---
 virt/kvm/kvm_main.c |   12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..cac294d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -710,7 +710,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
gfn_t base_gfn;
unsigned long npages;
unsigned long i;
-   struct kvm_memory_slot *memslot;
+   struct kvm_memory_slot *memslot, *slot;
struct kvm_memory_slot old, new;
struct kvm_memslots *slots, *old_memslots;
 
@@ -761,13 +761,11 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
/* Check for overlaps */
r = -EEXIST;
-   for (i = 0; i < KVM_MEMORY_SLOTS; ++i) {
-   struct kvm_memory_slot *s = &kvm->memslots->memslots[i];
-
-   if (s == memslot || !s->npages)
+   kvm_for_each_memslot(slot, kvm->memslots) {
+   if (slot->id >= KVM_MEMORY_SLOTS || slot == memslot)
continue;
-   if (!((base_gfn + npages <= s->base_gfn) ||
- (base_gfn >= s->base_gfn + s->npages)))
+   if (!((base_gfn + npages <= slot->base_gfn) ||
+ (base_gfn >= slot->base_gfn + slot->npages)))
goto out_free;
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] kvm: Fix user memslot overlap check

2012-11-29 Thread Alex Williamson

Prior to memory slot sorting this loop compared all of the user memory
slots for overlap with new entries.  With memory slot sorting, we're
just checking some number of entries in the array that may or may not
be user slots.  Instead, walk all the slots with kvm_for_each_memslot,
which has the added benefit of terminating early when we hit the first
empty slot, and skip comparison to private slots.

Signed-off-by: Alex Williamson 
Cc: sta...@vger.kernel.org
---

v2: Remove unused variable i

 virt/kvm/kvm_main.c |   13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..6e8fa7e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -709,8 +709,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
int r;
gfn_t base_gfn;
unsigned long npages;
-   unsigned long i;
-   struct kvm_memory_slot *memslot;
+   struct kvm_memory_slot *memslot, *slot;
struct kvm_memory_slot old, new;
struct kvm_memslots *slots, *old_memslots;
 
@@ -761,13 +760,11 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
/* Check for overlaps */
r = -EEXIST;
-   for (i = 0; i < KVM_MEMORY_SLOTS; ++i) {
-   struct kvm_memory_slot *s = &kvm->memslots->memslots[i];
-
-   if (s == memslot || !s->npages)
+   kvm_for_each_memslot(slot, kvm->memslots) {
+   if (slot->id >= KVM_MEMORY_SLOTS || slot == memslot)
continue;
-   if (!((base_gfn + npages <= s->base_gfn) ||
- (base_gfn >= s->base_gfn + s->npages)))
+   if (!((base_gfn + npages <= slot->base_gfn) ||
+ (base_gfn >= slot->base_gfn + slot->npages)))
goto out_free;
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] udevadm-info: Don't access sysfs 'resource' files

2013-03-17 Thread Alex Williamson

On Sun, 2013-03-17 at 08:33 -0600, Myron Stowe wrote:
> On Sun, 2013-03-17 at 07:38 -0600, Alex Williamson wrote:
> > On Sat, 2013-03-16 at 22:36 -0700, Greg KH wrote:
> > > On Sat, Mar 16, 2013 at 10:11:22PM -0600, Alex Williamson wrote:
> > > > On Sat, 2013-03-16 at 18:03 -0700, Greg KH wrote:
> > > > > On Sat, Mar 16, 2013 at 05:50:53PM -0600, Myron Stowe wrote:
> > > > > > On Sat, 2013-03-16 at 15:11 -0700, Greg KH wrote:
> > > > > > > On Sat, Mar 16, 2013 at 03:35:19PM -0600, Myron Stowe wrote:
> > > > > > > > Sysfs includes entries to memory that backs a PCI device's 
> > > > > > > > BARs, both I/O
> > > > > > > > Port space and MMIO.  This memory regions correspond to the 
> > > > > > > > device's
> > > > > > > > internal status and control registers used to drive the device.
> > > > > > > > 
> > > > > > > > Accessing these registers from userspace such as "udevadm info
> > > > > > > > --attribute-walk --path=/sys/devices/..." does can not be 
> > > > > > > > allowed as
> > > > > > > > such accesses outside of the driver, even just reading, can 
> > > > > > > > yield
> > > > > > > > catastrophic consequences.
> > > > > > > > 
> > > > > > > > Udevadm-info skips parsing a specific set of sysfs entries 
> > > > > > > > including
> > > > > > > > 'resource'.  This patch extends the set to include the 
> > > > > > > > additional
> > > > > > > > 'resource' entries that correspond to a PCI device's BARs.
> > > > > > > 
> > > > > > > Nice, are you also going to patch bash to prevent a user from 
> > > > > > > reading
> > > > > > > these sysfs files as well?  :)
> > > > > > > 
> > > > > > > And pciutils?
> > > > > > > 
> > > > > > > You get my point here, right?  The root user just asked to read 
> > > > > > > all of
> > > > > > > the data for this device, so why wouldn't you allow it?  Just like
> > > > > > > 'lspci' does.  Or bash does.
> > > > > > 
> > > > > > Yes :P , you raise a very good point, there are a lot of way a user 
> > > > > > can
> > > > > > poke around in those BARs.  However, there is a difference between
> > > > > > shooting yourself in the foot and getting what you deserve versus
> > > > > > unknowingly executing a common command such as udevadm and having 
> > > > > > the
> > > > > > system hang.
> > > > > > > 
> > > > > > > If this hardware has a problem, then it needs to be fixed in the 
> > > > > > > kernel,
> > > > > > > not have random band-aids added to various userspace programs to 
> > > > > > > paper
> > > > > > > over the root problem here.  Please fix the kernel driver and all 
> > > > > > > should
> > > > > > > be fine.  No need to change udevadm.
> > > > > > 
> > > > > > Xiangliang initially proposed a patch within the PCI core.  
> > > > > > Ignoring the
> > > > > > specific issue with the proposal which I pointed out in the
> > > > > > https://lkml.org/lkml/2013/3/7/242 thread, that just doesn't seem 
> > > > > > like
> > > > > > the right place to effect a change either as PCI's core isn't 
> > > > > > concerned
> > > > > > with the contents or access limitations of those regions, those are
> > > > > > issues that the driver concerns itself with.
> > > > > > 
> > > > > > So things seem to be gravitating towards the driver.  I'm fairly
> > > > > > ignorant of this area but as Robert succinctly pointed out in the
> > > > > > originating thread - the AHCI driver only uses the device's MMIO 
> > > > > > region.
> > > > > > The I/O related regions are for legacy SFF-compatible ATA ports and 
> > > > > > are
> > > > > > not use

Re: [PATCH] udevadm-info: Don't access sysfs 'resource' files

2013-03-18 Thread Alex Williamson

On Sun, 2013-03-17 at 15:00 +0100, Kay Sievers wrote:
> On Sun, Mar 17, 2013 at 2:38 PM, Alex Williamson
>  wrote:
> > I'm assuming that the device only breaks because udevadm is dumping the
> > full I/O port register space of the device and that if an actual driver
> > was interacting with it through this interface that it would work.  Who
> > knows how many devices will have read side-effects by udevadm blindly
> > dumping these files.  Thanks,
> 
> Sysfs is a too public interface to export things there which make
> devices/driver choke on a simple read() of an attribute.

That's why the default permissions for the file do not allow users to
read it.  I wish we could do something as clever as the MMIO resource
files, but I/O port spaces don't allow mmap for the predominant
architecture.  Eventually VFIO is meant to replace this access and does
move device register access behind ioctls, but for now legacy KVM device
assignment relies on these files and so might some UIO drivers.

> This is nothing specific to udevadm, any tool can do that. Udevadm
> will never read any of the files during normal operation. The admin
> explicitly asked udevadm with a specific command to dump all the stuff
> the device offers.

Isn't it possible udevadm could drop privileges or filter out non-world
readable files? 

> The kernel driver needs to be fixed to allow that, in the worst case,
> the attributes not exported at all. People should take more care what
> they export in /sys, it's not a hidden and private ioctl what's
> exported there, stuff is very visible and will be looked at.

File permissions...

> Telling userspace not to use specific stuff in /sys I would not expect
> to work as a strategy; there is too much weird stuff out there that
> will always try to do that ...

I agree, the kernel needs to protect itself from malicious apps, but if
you run a malicious app with admin access, how much can/should we do?
If we're going to ignore file permissions, why limit ourselves to
read(), should we make everything safe against write() as well?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] udevadm-info: Don't access sysfs 'resource' files

2013-03-18 Thread Alex Williamson

On Mon, 2013-03-18 at 10:50 -0400, Don Dutile wrote:
> On 03/17/2013 06:28 PM, Alex Williamson wrote:
> > On Sun, 2013-03-17 at 08:33 -0600, Myron Stowe wrote:
> >> On Sun, 2013-03-17 at 07:38 -0600, Alex Williamson wrote:
> >>> On Sat, 2013-03-16 at 22:36 -0700, Greg KH wrote:
> >>>> On Sat, Mar 16, 2013 at 10:11:22PM -0600, Alex Williamson wrote:
> >>>>> On Sat, 2013-03-16 at 18:03 -0700, Greg KH wrote:
> >>>>>> On Sat, Mar 16, 2013 at 05:50:53PM -0600, Myron Stowe wrote:
> >>>>>>> On Sat, 2013-03-16 at 15:11 -0700, Greg KH wrote:
> >>>>>>>> On Sat, Mar 16, 2013 at 03:35:19PM -0600, Myron Stowe wrote:
> >>>>>>>>> Sysfs includes entries to memory that backs a PCI device's BARs, 
> >>>>>>>>> both I/O
> >>>>>>>>> Port space and MMIO.  This memory regions correspond to the device's
> >>>>>>>>> internal status and control registers used to drive the device.
> >>>>>>>>>
> >>>>>>>>> Accessing these registers from userspace such as "udevadm info
> >>>>>>>>> --attribute-walk --path=/sys/devices/..." does can not be allowed as
> >>>>>>>>> such accesses outside of the driver, even just reading, can yield
> >>>>>>>>> catastrophic consequences.
> >>>>>>>>>
> >>>>>>>>> Udevadm-info skips parsing a specific set of sysfs entries including
> >>>>>>>>> 'resource'.  This patch extends the set to include the additional
> >>>>>>>>> 'resource' entries that correspond to a PCI device's BARs.
> >>>>>>>>
> >>>>>>>> Nice, are you also going to patch bash to prevent a user from reading
> >>>>>>>> these sysfs files as well?  :)
> >>>>>>>>
> >>>>>>>> And pciutils?
> >>>>>>>>
> >>>>>>>> You get my point here, right?  The root user just asked to read all 
> >>>>>>>> of
> >>>>>>>> the data for this device, so why wouldn't you allow it?  Just like
> >>>>>>>> 'lspci' does.  Or bash does.
> >>>>>>>
> >>>>>>> Yes :P , you raise a very good point, there are a lot of way a user 
> >>>>>>> can
> >>>>>>> poke around in those BARs.  However, there is a difference between
> >>>>>>> shooting yourself in the foot and getting what you deserve versus
> >>>>>>> unknowingly executing a common command such as udevadm and having the
> >>>>>>> system hang.
> >>>>>>>>
> >>>>>>>> If this hardware has a problem, then it needs to be fixed in the 
> >>>>>>>> kernel,
> >>>>>>>> not have random band-aids added to various userspace programs to 
> >>>>>>>> paper
> >>>>>>>> over the root problem here.  Please fix the kernel driver and all 
> >>>>>>>> should
> >>>>>>>> be fine.  No need to change udevadm.
> >>>>>>>
> >>>>>>> Xiangliang initially proposed a patch within the PCI core.  Ignoring 
> >>>>>>> the
> >>>>>>> specific issue with the proposal which I pointed out in the
> >>>>>>> https://lkml.org/lkml/2013/3/7/242 thread, that just doesn't seem like
> >>>>>>> the right place to effect a change either as PCI's core isn't 
> >>>>>>> concerned
> >>>>>>> with the contents or access limitations of those regions, those are
> >>>>>>> issues that the driver concerns itself with.
> >>>>>>>
> >>>>>>> So things seem to be gravitating towards the driver.  I'm fairly
> >>>>>>> ignorant of this area but as Robert succinctly pointed out in the
> >>>>>>> originating thread - the AHCI driver only uses the device's MMIO 
> >>>>>>> region.
> >>>>>>> The I/O related regions are for legacy SFF-compatible ATA ports and 
> >>>>>>> are
> >>>>

Re: [PATCH] udevadm-info: Don't access sysfs 'resource' files

2013-03-18 Thread Alex Williamson

On Mon, 2013-03-18 at 09:41 -0700, Greg KH wrote:
> On Mon, Mar 18, 2013 at 10:24:40AM -0600, Alex Williamson wrote:
> > On Sun, 2013-03-17 at 15:00 +0100, Kay Sievers wrote:
> > > On Sun, Mar 17, 2013 at 2:38 PM, Alex Williamson
> > >  wrote:
> > > > I'm assuming that the device only breaks because udevadm is dumping the
> > > > full I/O port register space of the device and that if an actual driver
> > > > was interacting with it through this interface that it would work.  Who
> > > > knows how many devices will have read side-effects by udevadm blindly
> > > > dumping these files.  Thanks,
> > > 
> > > Sysfs is a too public interface to export things there which make
> > > devices/driver choke on a simple read() of an attribute.
> > 
> > That's why the default permissions for the file do not allow users to
> > read it.  I wish we could do something as clever as the MMIO resource
> > files, but I/O port spaces don't allow mmap for the predominant
> > architecture.  Eventually VFIO is meant to replace this access and does
> > move device register access behind ioctls, but for now legacy KVM device
> > assignment relies on these files and so might some UIO drivers.
> > 
> > > This is nothing specific to udevadm, any tool can do that. Udevadm
> > > will never read any of the files during normal operation. The admin
> > > explicitly asked udevadm with a specific command to dump all the stuff
> > > the device offers.
> > 
> > Isn't it possible udevadm could drop privileges or filter out non-world
> > readable files? 
> 
> And you are going to do the same thing for bash?  All other shells?
> 
> Come on, the user specifically asked to read this file, as root, and
> udev did so.  Just like bash would.
> 
> Please fix the kernel if this is a real problem, you aren't going to be
> able to patch all userspace programs, that's not the proper solution
> here.

At least for KVM the kernel fix is the addition of the vfio driver which
gives us a non-sysfs way to do this.  If this problem was found a few
years later and we were ready to make the switch I'd support just
removing these resource files.  In the meantime we have userspace that
depends on this interface, so I'm open to suggestions how to fix it.

If we want to blacklist this specific device, that's fine, but as others
have pointed out it's really a class problem.  Perhaps we report 1 byte
extra for the file length where EOF-1 is an enable byte?  Is there
anything else in file ops that we could use to make it slightly more
complicated than open(), read() to access the device?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] udevadm-info: Don't access sysfs 'resource' files

2013-03-18 Thread Alex Williamson

On Mon, 2013-03-18 at 18:20 +0100, Bjørn Mork wrote:
> Alex Williamson  writes:
> 
> > At least for KVM the kernel fix is the addition of the vfio driver which
> > gives us a non-sysfs way to do this.  If this problem was found a few
> > years later and we were ready to make the switch I'd support just
> > removing these resource files.  In the meantime we have userspace that
> > depends on this interface, so I'm open to suggestions how to fix it.
> 
> I am puzzled by a couple of things in this discussion:
> 
> 1) do you seriously mean that a userspace application (any, not just
>udevadm or qemu or whatever) should be able to read and write these
>registers while the device is owned by a driver?  How is that ever
>going to work?

The expectation is that the user doesn't mess with the device through
pci-sysfs while it's running.  This is really no different than config
space or MMIO space in that respect.  You can use setpci to break your
PCI card while it's used by the driver today.  The difference is that
MMIO spaces side-step the issue by only allowing mmap and config space
is known not to have read side-effects.

> 2) is it really so that a device can be so fundamentally screwed up by
>reading some registers, that a later driver probe cannot properly
>reinitialize it?

Never underestimate how broken hardware can be, though in this case
reading a device register seems to be causing a system hang/reset.

> I would have thought that the solution to all this was to return -EINVAL
> on any attemt to read or write these files while a driver is bound to
> the device.  If userspace is going to use the API, then the application
> better unbind any driver first.
> 
> Or? Am I missing something here?

That doesn't really solve anything though.  Let's pretend the resource
files only work while the device is bound to pci-stub.  Now what happens
when you run this udevadm command as admin while it's in use by the
userspace driver?  All we've done is limit the scope of the problem.

> > If we want to blacklist this specific device, that's fine, but as others
> > have pointed out it's really a class problem.  Perhaps we report 1 byte
> > extra for the file length where EOF-1 is an enable byte?  Is there
> > anything else in file ops that we could use to make it slightly more
> > complicated than open(), read() to access the device?  Thanks,
> 
> If there really are devices which cannot handle reading at all, and
> cannot be reset to a sane state by later driver initialization, then a
> blacklist could be added for those devices.  This should not be a common
> problem.

Yes, if these are dead registers, let's blacklist and move along.  I
suspect though that these registers probably work fine if you access
them according to the device programming model, so blacklisting just
prevents full use through something like KVM device assignment.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] udevadm-info: Don't access sysfs 'resource' files

2013-03-18 Thread Alex Williamson

On Mon, 2013-03-18 at 19:25 +0100, Bjørn Mork wrote:
> Alex Williamson  wrote:
> 
> >On Mon, 2013-03-18 at 18:20 +0100, Bjørn Mork wrote:
> >> Alex Williamson  writes:
> >> 
> >> > At least for KVM the kernel fix is the addition of the vfio driver
> >which
> >> > gives us a non-sysfs way to do this.  If this problem was found a
> >few
> >> > years later and we were ready to make the switch I'd support just
> >> > removing these resource files.  In the meantime we have userspace
> >that
> >> > depends on this interface, so I'm open to suggestions how to fix
> >it.
> >> 
> >> I am puzzled by a couple of things in this discussion:
> >> 
> >> 1) do you seriously mean that a userspace application (any, not just
> >>udevadm or qemu or whatever) should be able to read and write
> >these
> >>registers while the device is owned by a driver?  How is that ever
> >>going to work?
> >
> >The expectation is that the user doesn't mess with the device through
> >pci-sysfs while it's running.  This is really no different than config
> >space or MMIO space in that respect. 
> 
> But it is.  That's the problem. As a user I expect to be able to run
> e.g "grep . /sys/devices/whatever/*" with no ill effects. This holds
> for config space or MMIO space. It does not for any reset-on-read
> register.

As a non-admin user you can

> > You can use setpci to break your
> >PCI card while it's used by the driver today.  The difference is that
> >MMIO spaces side-step the issue by only allowing mmap and config space
> >is known not to have read side-effects.
> 
> Yes. And that is why there is no problem exporting those. This
> difference is fundamental. 

So how do we side-step the problem with I/O port registers?  If we
remove them then KVM needs to run with iopl which is a pretty serious
security hole should QEMU be exploited.  We could activate the resource
files only when the device is bound to pci-assign, but that only limits
the scope and might break UIO drivers.  We could modify the file to have
an enable sequence, but we can't do this without breaking current
userspace.  As I mentioned, the VFIO driver is intended to replace KVM's
use of these files, but we're not ready to rip it out, perhaps not even
ready to declare it deprecated.

> >> 2) is it really so that a device can be so fundamentally screwed up
> >by
> >>reading some registers, that a later driver probe cannot properly
> >>reinitialize it?
> >
> >Never underestimate how broken hardware can be, 
> 
> True :)
> 
> > though in this case
> >reading a device register seems to be causing a system hang/reset.
> 
> I understand that it does so if the ahci driver is bound to the device
> while reading the registers, but does it also hang the system with no
> bound driver? How does it do that? By killing the bus?

I don't know, Myron?

> >> I would have thought that the solution to all this was to return
> >-EINVAL
> >> on any attemt to read or write these files while a driver is bound to
> >> the device.  If userspace is going to use the API, then the
> >application
> >> better unbind any driver first.
> >> 
> >> Or? Am I missing something here?
> >
> >That doesn't really solve anything though.  Let's pretend the resource
> >files only work while the device is bound to pci-stub.  Now what
> >happens
> >when you run this udevadm command as admin while it's in use by the
> >userspace driver?  All we've done is limit the scope of the problem.
> 
> Assuming that the system hangs without driver help and that this
> brokenness is widespread. I don't think any of those assumptions hold.
> Do they?

I thought it was true that for this device a system hang happened
regardless of the host driver, but haven't seen the original bug report.
As for widespread, this is the first I've heard of problems in the 2.5+
years that we've supported these I/O port resource files.  The rest is
probably just FUD about random userspace apps trolling through device
registers.

> >> > If we want to blacklist this specific device, that's fine, but as
> >others
> >> > have pointed out it's really a class problem.  Perhaps we report 1
> >byte
> >> > extra for the file length where EOF-1 is an enable byte?  Is there
> >> > anything else in file ops that we could use to make it slightly
> >more
> >> > complicated than open(), read() to access the device?  Thanks,
&

Re: [PATCH] iommu: making IOMMU sysfs nodes API public

2013-03-18 Thread Alex Williamson

On Mon, 2013-03-18 at 14:53 +1100, Alexey Kardashevskiy wrote:
> On 20/02/13 15:33, Alex Williamson wrote:
> > On Wed, 2013-02-20 at 15:20 +1100, Alexey Kardashevskiy wrote:
> >> On 20/02/13 14:47, Alex Williamson wrote:
> >>> On Wed, 2013-02-20 at 13:31 +1100, Alexey Kardashevskiy wrote:
> >>>> On 20/02/13 07:11, Alex Williamson wrote:
> >>>>> On Tue, 2013-02-19 at 18:38 +1100, David Gibson wrote:
> >>>>>> On Mon, Feb 18, 2013 at 10:24:00PM -0700, Alex Williamson wrote:
> >>>>>>> On Mon, 2013-02-18 at 17:15 +1100, Alexey Kardashevskiy wrote:
> >>>>>>>> On 13/02/13 04:15, Alex Williamson wrote:
> >>>>>>>>> On Wed, 2013-02-13 at 01:42 +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>> On 12/02/13 16:07, Alex Williamson wrote:
> >>>>>>>>>>> On Tue, 2013-02-12 at 15:06 +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>>>> Having this patch in a tree, adding new nodes in sysfs
> >>>>>>>>>>>> for IOMMU groups is going to be easier.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The first candidate for this change is a "dma-window-size"
> >>>>>>>>>>>> property which tells a size of a DMA window of the specific
> >>>>>>>>>>>> IOMMU group which can be used later for locked pages accounting.
> >>>>>>>>>>>
> >>>>>>>>>>> I'm still churning on this one; I'm nervous this would basically 
> >>>>>>>>>>> creat
> >>>>>>>>>>> a /proc free-for-all under /sys/kernel/iommu_group/$GROUP/ where 
> >>>>>>>>>>> any
> >>>>>>>>>>> iommu driver can add random attributes.  That can get ugly for
> >>>>>>>>>>> userspace.
> >>>>>>>>>>
> >>>>>>>>>> Is not it exactly what sysfs is for (unlike /proc)? :)
> >>>>>>>>>
> >>>>>>>>> Um, I hope it's a little more thought out than /proc.
> >>>>>>>>>
> >>>>>>>>>>> On the other hand, for the application of userspace knowing how 
> >>>>>>>>>>> much
> >>>>>>>>>>> memory to lock for vfio use of a group, it's an appealing 
> >>>>>>>>>>> location to
> >>>>>>>>>>> get that information.  Something like libvirt would already be 
> >>>>>>>>>>> poking
> >>>>>>>>>>> around here to figure out which devices to bind.  Page limits 
> >>>>>>>>>>> need to be
> >>>>>>>>>>> setup prior to use through vfio, so sysfs is more convenient than
> >>>>>>>>>>> through vfio ioctls.
> >>>>>>>>>>
> >>>>>>>>>> True. DMA window properties do not change since boot so sysfs is 
> >>>>>>>>>> the right
> >>>>>>>>>> place to expose them.
> >>>>>>>>>>
> >>>>>>>>>>> But then is dma-window-size just a vfio requirement leaking over 
> >>>>>>>>>>> into
> >>>>>>>>>>> iommu groups?  Can we allow iommu driver based attributes without 
> >>>>>>>>>>> giving
> >>>>>>>>>>> up control of the namespace?  Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Who are you asking these questions? :)
> >>>>>>>>>
> >>>>>>>>> Anyone, including you.  Rather than dropping misc files in sysfs to
> >>>>>>>>> describe things about the group, I think the better solution in your
> >>>>>>>>> case might be a link from the group to an existing sysfs directory
> >>>>>>>>> describing the PE.  I believe your PE is rooted in a PCI bridge, so 
> >>>>>>>>> that
> >>>>>>>>> presumably already has a representation in sysfs.  Can the aperture 
> >>>>

[GIT PULL] vfio fix for 3.9-rc4

2013-03-19 Thread Alex Williamson

Hi Linus,

Please pull for the next rc.  Thanks!

The following changes since commit f6161aa153581da4a3867a2d1a7caf4be19b6ec9:

  Linux 3.9-rc2 (2013-03-10 16:54:19 -0700)

are available in the git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v3.9-rc4

for you to fetch changes up to 25e9789ddd9d14a8971f4a421d04f282719ab733:

  vfio: include  for kmalloc (2013-03-15 12:58:20 -0600)


vfio fix for v3.9-rc4


Arnd Bergmann (1):
  vfio: include  for kmalloc

 drivers/vfio/pci/vfio_pci_config.c | 1 +
 drivers/vfio/pci/vfio_pci_intrs.c  | 1 +
 2 files changed, 2 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 01/12] Security: Add CAP_COMPROMISE_KERNEL

2013-03-19 Thread Alex Williamson

On Tue, 2013-03-19 at 20:08 -0700, H. Peter Anvin wrote:
> On 03/19/2013 07:48 PM, H. Peter Anvin wrote:
> > On 03/19/2013 06:28 PM, Matthew Garrett wrote:
> >> Mm. The question is whether we can reliably determine the ranges a
> device should be able to access without having to trust userspace
> (and, ideally, without having to worry about whether iommu vendors
> have done their job). It's pretty important for PCI passthrough, so we
> do need to care. 
> > 
> > It is actually very simple: the device should be able to DMA into/out of:
> > 
> > 1. pinned pages
> > 2. owned by the process controlling the device
> > 
> > ... and nothing else.
> > 
> 
> The "pinning" process needs to involve a call to the kernel to process
> the page for DMA (pinning the page and opening it in the iommu) and
> return a transaction address, of course.
> 
> I think we have the interface for that in vfio, but I haven't followed
> that work.

Yes, vfio does this and is meant to provide a secure-boot-friendly PCI
passthrough interface.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 01/12] Security: Add CAP_COMPROMISE_KERNEL

2013-03-19 Thread Alex Williamson

On Tue, 2013-03-19 at 20:22 -0700, H. Peter Anvin wrote:
> On 03/19/2013 08:18 PM, Alex Williamson wrote:
> >>
> >> The "pinning" process needs to involve a call to the kernel to process
> >> the page for DMA (pinning the page and opening it in the iommu) and
> >> return a transaction address, of course.
> >>
> >> I think we have the interface for that in vfio, but I haven't followed
> >> that work.
> > 
> > Yes, vfio does this and is meant to provide a secure-boot-friendly PCI
> > passthrough interface.  Thanks,
> > 
> 
> Right, and presumably vfio does *not* require CAP_SYS_RAWIO, right?

Correct

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] pciehp: Add pciehp_surprise module option

2013-03-20 Thread Alex Williamson

On Wed, 2013-03-20 at 15:02 +0100, Takashi Iwai wrote:
> We encountered a problem that on some HP machines the Realtek PCI-e
> card reader device appears only when you inserted a card before the
> cold boot.  While debugging, it turned out that the device is actually
> handled via PCI-e hotplug in some level.  The device sends a presence
> change notification, and pciehp receives it, but it's ignored because
> of lack of the hotplug surprise (PCI_EXP_SLTCAP_HPS) capability bit.
> Once when this check passes, everything starts working -- the device
> appears upon plugging the card properly.
> 
> There are a few other bug reports indicating the similar problems
> (e.g. on recent Dell laptops), and I guess the culprit is same.
> 
> This patch adds a new module option, pciehp_surprise, to pciehp as a
> workaround.  When pciehp_surprise=1 is given, pciehp handles the
> presence change as the device on/off as if PCI_EXP_SLTCAP_HPS is set.
> Unless it's set explicitly, there is no impact on the existing
> behavior.
> 
> Signed-off-by: Takashi Iwai 
> ---
>  drivers/pci/hotplug/pciehp.h  | 3 ++-
>  drivers/pci/hotplug/pciehp_core.c | 3 +++
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
> index 2c113de..314f3be 100644
> --- a/drivers/pci/hotplug/pciehp.h
> +++ b/drivers/pci/hotplug/pciehp.h
> @@ -44,6 +44,7 @@ extern bool pciehp_poll_mode;
>  extern int pciehp_poll_time;
>  extern bool pciehp_debug;
>  extern bool pciehp_force;
> +extern bool pciehp_surprise;
>  
>  #define dbg(format, arg...)  \
>  do { \
> @@ -122,7 +123,7 @@ struct controller {
>  #define MRL_SENS(ctrl)   ((ctrl)->slot_cap & 
> PCI_EXP_SLTCAP_MRLSP)
>  #define ATTN_LED(ctrl)   ((ctrl)->slot_cap & PCI_EXP_SLTCAP_AIP)
>  #define PWR_LED(ctrl)((ctrl)->slot_cap & PCI_EXP_SLTCAP_PIP)
> -#define HP_SUPR_RM(ctrl) ((ctrl)->slot_cap & PCI_EXP_SLTCAP_HPS)
> +#define HP_SUPR_RM(ctrl) (pciehp_surprise || ((ctrl)->slot_cap & 
> PCI_EXP_SLTCAP_HPS))
>  #define EMI(ctrl)((ctrl)->slot_cap & PCI_EXP_SLTCAP_EIP)
>  #define NO_CMD_CMPL(ctrl)((ctrl)->slot_cap & PCI_EXP_SLTCAP_NCCS)
>  #define PSN(ctrl)((ctrl)->slot_cap >> 19)
> diff --git a/drivers/pci/hotplug/pciehp_core.c 
> b/drivers/pci/hotplug/pciehp_core.c
> index 7d72c5e..c3a574e 100644
> --- a/drivers/pci/hotplug/pciehp_core.c
> +++ b/drivers/pci/hotplug/pciehp_core.c
> @@ -42,6 +42,7 @@ bool pciehp_debug;
>  bool pciehp_poll_mode;
>  int pciehp_poll_time;
>  bool pciehp_force;
> +bool pciehp_surprise;
>  
>  #define DRIVER_VERSION   "0.4"
>  #define DRIVER_AUTHOR"Dan Zink , Greg 
> Kroah-Hartman , Dely Sy "
> @@ -55,10 +56,12 @@ module_param(pciehp_debug, bool, 0644);
>  module_param(pciehp_poll_mode, bool, 0644);
>  module_param(pciehp_poll_time, int, 0644);
>  module_param(pciehp_force, bool, 0644);
> +module_param(pciehp_surprise, bool, 0644);
>  MODULE_PARM_DESC(pciehp_debug, "Debugging mode enabled or not");
>  MODULE_PARM_DESC(pciehp_poll_mode, "Using polling mechanism for hot-plug 
> events or not");
>  MODULE_PARM_DESC(pciehp_poll_time, "Polling mechanism frequency, in 
> seconds");
>  MODULE_PARM_DESC(pciehp_force, "Force pciehp, even if OSHP is missing");
> +MODULE_PARM_DESC(pciehp_surprise, "Force to set hotplug-surprise 
> capability");
>  
>  #define PCIE_MODULE_NAME "pciehp"
>  

Please no.  Can we quirk just the device with the problem?  My issue
with turning on surprise hotplug across the system is that secondary bus
resets will trigger a device hot unplug, hot re-plug.  I have a system
with an integrated broadcom NIC behind a root port that claims to
support surprise hotplug (even though this NIC is soldered to the
motherboard) and it's nearly impossible to use it with KVM device
assignment because the struct pci_dev goes away as we're trying to use
it.  This patch is only going to make that problem more widespread.
Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] vfio powerpc: implement IOMMU driver for VFIO

2013-03-20 Thread Alex Williamson

1] VFIO was originally an acronym for "Virtual Function I/O" in its
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 7cd5dec..b464687 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
>   depends on VFIO
>   default n
>  
> +config VFIO_IOMMU_SPAPR_TCE
> + tristate
> + depends on VFIO && SPAPR_TCE_IOMMU
> + default n
> +
>  menuconfig VFIO
>   tristate "VFIO Non-Privileged userspace driver framework"
>   depends on IOMMU_API
>   select VFIO_IOMMU_TYPE1 if X86
> + select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
>   help
> VFIO provides a framework for secure userspace device drivers.
> See Documentation/vfio.txt for more details.
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 2398d4a..72bfabc 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -1,3 +1,4 @@
>  obj-$(CONFIG_VFIO) += vfio.o
>  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 12c264d..004033d 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1376,6 +1376,7 @@ static int __init vfio_init(void)
>* drivers.
>*/
>   request_module_nowait("vfio_iommu_type1");
> + request_module_nowait("vfio_iommu_spapr");
>  
>   return 0;
>  
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> new file mode 100644
> index 000..22ba0b5
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -0,0 +1,365 @@
> +/*
> + * VFIO: IOMMU DMA mapping support for TCE on POWER
> + *
> + * Copyright (C) 2013 IBM Corp.  All rights reserved.
> + * Author: Alexey Kardashevskiy 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio_iommu_type1.c:
> + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> + * Author: Alex Williamson 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "a...@ozlabs.ru"
> +#define DRIVER_DESC "VFIO IOMMU SPAPR TCE"
> +
> +static void tce_iommu_detach_group(void *iommu_data,
> + struct iommu_group *iommu_group);
> +
> +/*
> + * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
> + *
> + * This code handles mapping and unmapping of user data buffers
> + * into DMA'ble space using the IOMMU
> + */
> +
> +/*
> + * The container descriptor supports only a single group per container.
> + * Required by the API as the container is not supplied with the IOMMU group
> + * at the moment of initialization.
> + */
> +struct tce_container {
> + struct mutex lock;
> + struct iommu_table *tbl;
> + bool enabled;
> +};
> +
> +static int tce_iommu_enable(struct tce_container *container)
> +{
> + int ret = 0;
> + unsigned long locked, lock_limit, npages;
> + struct iommu_table *tbl = container->tbl;
> +
> + if (!container->tbl)
> + return -ENXIO;
> +
> + if (!current->mm)
> + return -ESRCH; /* process exited */
> +
> + mutex_lock(&container->lock);
> + if (container->enabled) {
> + mutex_unlock(&container->lock);
> + return -EBUSY;
> + }
> +
> + /*
> +  * Accounting for locked pages.
> +  *
> +  * On sPAPR platform, IOMMU translation table contains
> +  * an entry per 4K page. Every map/unmap request is sent
> +  * by the guest via hypercall and supposed to be handled
> +  * quickly, ecpesially in real mode (if such option is

s/ecpesially/especially/

> +  * supported and enabled).
> +  * As handling the accounting in real time is rather
> +  * impossible, it may require switching to virtual mode
> +  * which is quite expensive operation.
> +  * As the whole window can be actuall mapped on high

s/actuall/actually/

> +  * performance guests and we do not want such guests
> +  * to fail, we do the accounting as a part of IOMMU
> +  * enablement process.
> +  */
> + down_write(¤t->mm->mmap_sem);
> + npages = (tbl->it_size << IOMMU_P

Re: [PATCH] vfio powerpc: implement IOMMU driver for VFIO

2013-03-20 Thread Alex Williamson

On Thu, 2013-03-21 at 11:57 +1100, Alexey Kardashevskiy wrote:
> On 21/03/13 07:45, Alex Williamson wrote:
> > On Tue, 2013-03-19 at 18:08 +1100, Alexey Kardashevskiy wrote:
> >> VFIO implements platform independent stuff such as
> >> a PCI driver, BAR access (via read/write on a file descriptor
> >> or direct mapping when possible) and IRQ signaling.
> >>
> >> The platform dependent part includes IOMMU initialization
> >> and handling. This patch implements an IOMMU driver for VFIO
> >> which does mapping/unmapping pages for the guest IO and
> >> provides information about DMA window (required by a POWERPC
> >> guest).
> >>
> >> The counterpart in QEMU is required to support this functionality.
> >>
> >> Changelog:
> >> * documentation updated
> >> * containter enable/disable ioctls added
> >> * request_module(spapr_iommu) added
> >> * various locks fixed
> >>
> >> Cc: David Gibson 
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >
> >
> > Looking pretty good.  There's one problem with the detach_group,
> > otherwise just some trivial comments below.  What's the status of the
> > tce code that this depends on?  Thanks,
> 
> 
> It is done, I am just waiting till other patches of the series will be 
> reviewed (by guys) and fixed (by me) and then I'll post everything again.
> 
> [skipped]
> 
> >> +static int tce_iommu_enable(struct tce_container *container)
> >> +{
> >> +  int ret = 0;
> >> +  unsigned long locked, lock_limit, npages;
> >> +  struct iommu_table *tbl = container->tbl;
> >> +
> >> +  if (!container->tbl)
> >> +  return -ENXIO;
> >> +
> >> +  if (!current->mm)
> >> +  return -ESRCH; /* process exited */
> >> +
> >> +  mutex_lock(&container->lock);
> >> +  if (container->enabled) {
> >> +  mutex_unlock(&container->lock);
> >> +  return -EBUSY;
> >> +  }
> >> +
> >> +  /*
> >> +   * Accounting for locked pages.
> >> +   *
> >> +   * On sPAPR platform, IOMMU translation table contains
> >> +   * an entry per 4K page. Every map/unmap request is sent
> >> +   * by the guest via hypercall and supposed to be handled
> >> +   * quickly, ecpesially in real mode (if such option is
> >
> > s/ecpesially/especially/
> 
> 
> I replaced the whole text by the one written by Ben and David.
> 
> [skipped]
> 
> >> +  }
> >> +
> >> +  return -ENOTTY;
> >> +}
> >> +
> >> +static int tce_iommu_attach_group(void *iommu_data,
> >> +  struct iommu_group *iommu_group)
> >> +{
> >> +  int ret;
> >> +  struct tce_container *container = iommu_data;
> >> +  struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> >> +
> >> +  BUG_ON(!tbl);
> >> +  mutex_lock(&container->lock);
> >> +  pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
> >> +  iommu_group_id(iommu_group), iommu_group);
> >> +  if (container->tbl) {
> >> +  pr_warn("tce_vfio: Only one group per IOMMU container is 
> >> allowed, existing id=%d, attaching id=%d\n",
> >> +  iommu_group_id(container->tbl->it_group),
> >> +  iommu_group_id(iommu_group));
> >> +  mutex_unlock(&container->lock);
> >> +  return -EBUSY;
> >> +  }
> >> +
> >> +  ret = iommu_take_ownership(tbl);
> >> +  if (!ret)
> >> +  container->tbl = tbl;
> >> +
> >> +  mutex_unlock(&container->lock);
> >> +
> >> +  return ret;
> >> +}
> >> +
> >> +static void tce_iommu_detach_group(void *iommu_data,
> >> +  struct iommu_group *iommu_group)
> >> +{
> >> +  struct tce_container *container = iommu_data;
> >> +  struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> >> +
> >> +  BUG_ON(!tbl);
> >> +  mutex_lock(&container->lock);
> >> +  if (tbl != container->tbl) {
> >> +  pr_warn("tce_vfio: detaching group #%u, expected group is 
> >> #%u\n",
> >> +  iommu_group_id(iommu_group),
> >> +  iommu_group_id(tbl->it_group));
> >> +  } e

Re: [PATCH] vfio powerpc: implement IOMMU driver for VFIO

2013-03-20 Thread Alex Williamson

On Thu, 2013-03-21 at 12:55 +1100, David Gibson wrote:
> On Wed, Mar 20, 2013 at 02:45:24PM -0600, Alex Williamson wrote:
> > On Tue, 2013-03-19 at 18:08 +1100, Alexey Kardashevskiy wrote:
> > > VFIO implements platform independent stuff such as
> > > a PCI driver, BAR access (via read/write on a file descriptor
> > > or direct mapping when possible) and IRQ signaling.
> > > 
> > > The platform dependent part includes IOMMU initialization
> > > and handling. This patch implements an IOMMU driver for VFIO
> > > which does mapping/unmapping pages for the guest IO and
> > > provides information about DMA window (required by a POWERPC
> > > guest).
> > > 
> > > The counterpart in QEMU is required to support this functionality.
> > > 
> > > Changelog:
> > > * documentation updated
> > > * containter enable/disable ioctls added
> > > * request_module(spapr_iommu) added
> > > * various locks fixed
> > > 
> > > Cc: David Gibson 
> > > Signed-off-by: Alexey Kardashevskiy 
> > > ---
> > 
> > 
> > Looking pretty good.  There's one problem with the detach_group,
> > otherwise just some trivial comments below.  What's the status of the
> > tce code that this depends on?  Thanks,
> 
> [snip]
> > > +static void tce_iommu_detach_group(void *iommu_data,
> > > + struct iommu_group *iommu_group)
> > > +{
> > > + struct tce_container *container = iommu_data;
> > > + struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> > > +
> > > + BUG_ON(!tbl);
> > > + mutex_lock(&container->lock);
> > > + if (tbl != container->tbl) {
> > > + pr_warn("tce_vfio: detaching group #%u, expected group is 
> > > #%u\n",
> > > + iommu_group_id(iommu_group),
> > > + iommu_group_id(tbl->it_group));
> > > + } else if (container->enabled) {
> > > + pr_err("tce_vfio: detaching group #%u from enabled container\n",
> > > + iommu_group_id(tbl->it_group));
> > 
> > Hmm, something more than a pr_err needs to happen here.  Wouldn't this
> > imply a disable and going back to an unprivileged container?
> 
> Uh, no.  I think the idea here is that we use the enable/disable
> semantic to address some other potential problems.  Specifically,
> sidestepping the problems of what happens if you change the
> container's capabilities by adding/removing groups while in the middle
> of using it.  So the point is that the detach fails when the group is
> enabled, rather than implicitly doing anything.

The function returns void.  We're not failing the detach, just getting
into a broken state.  This is only called to unwind attaching groups
when the iommu is set or if the user explicitly calls
GROUP_UNSET_CONTAINER.  The former won't have had a chance to call
enable but the latter would need to be fixed.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] iommu: add a function to find an iommu group by id

2013-03-21 Thread Alex Williamson

On Thu, 2013-03-21 at 18:48 +1100, Alexey Kardashevskiy wrote:
> As IOMMU groups are exposed to the user space by their numbers,
> the user space can use them in various kernel APIs so the kernel
> might need an API to find a group by its ID.
> 
> As an example, QEMU VFIO on PPC64 platform needs it to associate
> a logical bus number (LIOBN) with a specific IOMMU group in order
> to support in-kernel handling of DMA map/unmap requests.
> 
> The patch adds the iommu_group_find(id) function which performs
> such search.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/iommu/iommu.c |   26 ++
>  include/linux/iommu.h |1 +
>  2 files changed, 27 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index b0afd3d..6340cac 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -205,6 +205,32 @@ printk("%s %u grp %d\n", __func__, __LINE__, 
> iommu_group_id(group));
>  }
>  EXPORT_SYMBOL_GPL(iommu_group_alloc);
>  
> +struct iommu_group *iommu_group_find(int id)
> +{
> + struct kobject *group_kobj;
> + struct iommu_group *grp;
> + const char *name;
> +
> + if (!iommu_group_kset)
> + return NULL;
> +
> + name = kasprintf(GFP_KERNEL, "%d", id);
> + if (!name)
> + return NULL;
> +
> + group_kobj = kset_find_obj(iommu_group_kset, name);
> + kfree(name);
> +
> + if (!group_kobj)
> + return NULL;
> +
> + grp = container_of(group_kobj, struct iommu_group, kobj);
> + BUG_ON(grp->id != id);
> +
> + return grp;
> +}
> +EXPORT_SYMBOL_GPL(iommu_group_find);

Don't you need to do some reference counting here?  Otherwise there's no
guarantee the returned pointer is still valid by the time it's used.
The interface should probably be iommu_group_get_by_id().  Thanks,

Alex

> +
>  /**
>   * iommu_group_get_iommudata - retrieve iommu_data registered for a group
>   * @group: the group
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index f3b99e1..20281d5 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -113,6 +113,7 @@ struct iommu_ops {
>  extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
>  extern bool iommu_present(struct bus_type *bus);
>  extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
> +extern struct iommu_group *iommu_group_find(int id);
>  extern void iommu_domain_free(struct iommu_domain *domain);
>  extern int iommu_attach_device(struct iommu_domain *domain,
>  struct device *dev);



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: DMAR faults from unrelated device when vfio is used

2013-02-04 Thread Alex Williamson

On Mon, 2013-02-04 at 11:10 +0100, David Gstir wrote:
> Hi!
> 
> I get the following error messages over and over again when using vfio
> in qemu-kvm:
> 
> [ 1692.021403] dmar: DMAR:[DMA Read] Request device [00:02.0] fault addr 
> 1a45aa9000 
> [ 1692.021403] DMAR:[fault reason 12] non-zero reserved fields in PTE
> [ 1692.021416] dmar: DRHD: handling fault status reg 2
> 
> This pci device is the graphics card, which I did not assign to qemu!
> I did assign the following devices:
> 00:1a.0, 00:1b.0, 00:1c.0, 00:1c.6, 00:1d.0, 03:00.0.

Piecing together your logs:

iommu_group 5
00:1a.0 USB controller [0c03]: Intel Corporation 6 Series/C200 Series Chipset 
Family USB Enhanced Host Controller #2 [8086:1c2d] (rev 04)
iommu_group 6
00:1b.0 Audio device [0403]: Intel Corporation 6 Series/C200 Series Chipset 
Family High Definition Audio Controller [8086:1c20] (rev 04)
iommu_group 7
00:1c.0 PCI bridge [0604]: Intel Corporation 6 Series/C200 Series Chipset 
Family PCI Express Root Port 1 [8086:1c10] (rev b4)
00:1c.6 PCI bridge [0604]: Intel Corporation 6 Series/C200 Series Chipset 
Family PCI Express Root Port 7 [8086:1c1c] (rev b4)
03:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host 
Controller [1033:0194] (rev ff)
iommu_group 8
00:1d.0 USB controller [0c03]: Intel Corporation 6 Series/C200 Series Chipset 
Family USB Enhanced Host Controller #1 [8086:1c26] (rev 04)

Can you clarify what you mean by assign?  Are you actually assigning the
root ports to the qemu guest (1c.0 & 1c.6)?  vfio will require they be
owned by vfio-pci to make use of 3:00.0, but assigning them to the guest
is not recommended.  Can you provided your qemu command line?  We need
to re-visit how to handle pcieport devices with vfio-pci, perhaps
white-listing it as a vfio "compatible" driver, but this still should
not interfere with devices external to the group.

The DMAR fault address looks pretty bogus unless you happen to have
100GB+ of ram in the system.

> The error occurs at random and is not reproducible every time. It
> happens about every third reboot. 
> I'm running qemu-kvm 1.3.0 (kvm-1.3.0-187.3), kernel 3.8.0-rc5 and
> windows 7 as guest OS. The hardware uses an Intel IOMMU. See
> attachments for output of lspci, and details on iommu groups
> 
> I'm not sure if this problem originates from qemu, kvm, vfio or the
> GPU driver.
> Do you have any hints how to debug this further?

vfio makes use of the IOMMU API for programming DMA translations, so an
reserved fields would have to be programmed by intel-iommu itself.  We
could of course be passing some kind of bogus data that intel-iommu
isn't catching.  If you're assigning the root ports to the guest, I'd
start with that, don't do it.  Attach them to vfio, but don't give them
to the guest.  Maybe that'll give us a hint.  I also notice that your
USB 3 controller is dead:

03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 
ff) (prog-if ff)
!!! Unknown header type 7f

We only see unknown header type 7f when the read from the device returns
-1.  This might have something to do with the root port above it (1c.6)
being in state D3.  Windows likes to put unused devices in D3, which
leads me to suspect you are giving it to the guest.  Let's see what
happens without that.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 50/62] vfio: convert to idr_alloc()

2013-02-04 Thread Alex Williamson

On Sat, 2013-02-02 at 17:20 -0800, Tejun Heo wrote:
> Convert to the much saner new idr interface.
> 
> Only compile tested.
> 
> Signed-off-by: Tejun Heo 
> Cc: Alex Williamson 
> Cc: k...@vger.kernel.org
> ---
> This patch depends on an earlier idr changes and I think it would be
> best to route these together through -mm.  Please holler if there's
> any objection.  Thanks.
> 
>  drivers/vfio/vfio.c | 18 +-
>  1 file changed, 1 insertion(+), 17 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 12c264d..0132846 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -139,23 +139,7 @@ EXPORT_SYMBOL_GPL(vfio_unregister_iommu_driver);
>   */
>  static int vfio_alloc_group_minor(struct vfio_group *group)
>  {
> - int ret, minor;
> -
> -again:
> - if (unlikely(idr_pre_get(&vfio.group_idr, GFP_KERNEL) == 0))
> - return -ENOMEM;
> -
> - /* index 0 is used by /dev/vfio/vfio */

I'd have preferred to keep this comment.  If you do a v2, please keep
it, otherwise I'll add it back later.

Acked-by: Alex Williamson 

> - ret = idr_get_new_above(&vfio.group_idr, group, 1, &minor);
> - if (ret == -EAGAIN)
> - goto again;
> - if (ret || minor > MINORMASK) {
> - if (minor > MINORMASK)
> - idr_remove(&vfio.group_idr, minor);
> - return -ENOSPC;
> - }
> -
> - return minor;
> + return idr_alloc(&vfio.group_idr, group, 1, MINORMASK + 1, GFP_KERNEL);
>  }
>  
>  static void vfio_free_group_minor(int minor)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-04 Thread Alex Williamson

On Sun, 2013-02-03 at 14:10 +, Pandarathil, Vijaymohan R wrote:
>   - Create eventfd per vfio device assigned to a guest and register an
>   event handler
> 
>   - This fd is passed to the vfio_pci driver through the SET_IRQ ioctl
> 
>   - When the device encounters an error, the eventfd is signalled
>   and the qemu eventfd handler gets invoked.
> 
>   - In the handler decide what action to take. Current action taken
>   is to terminate the guest.
> 
> Signed-off-by: Vijay Mohan Pandarathil 
> ---
>  hw/vfio_pci.c  | 105 
> +
>  linux-headers/linux/vfio.h |   1 +
>  2 files changed, 106 insertions(+)
> 
> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> index c51ae67..4e2f768 100644
> --- a/hw/vfio_pci.c
> +++ b/hw/vfio_pci.c
> @@ -130,6 +130,8 @@ typedef struct VFIODevice {
>  QLIST_ENTRY(VFIODevice) next;
>  struct VFIOGroup *group;
>  bool reset_works;
> +EventNotifier err_notifier;
> +bool pci_aer;

Re-order these for alignment please.  ie:

struct VFIOGroup *group;
EventNotifier err_notifier;
bool reset_works;
bool pci_aer;

>  } VFIODevice;
>  
>  typedef struct VFIOGroup {
> @@ -1922,6 +1924,106 @@ static void vfio_put_device(VFIODevice *vdev)
>  }
>  }
>  
> +static void vfio_err_notifier_handler(void *opaque)
> +{
> +VFIODevice *vdev = opaque;
> +
> +if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
> +return;
> +}
> +
> +/*
> + * TBD. Retrieve the error details and decide what action
> + * needs to be taken. One of the actions could be to pass
> + * the error to the guest and have the guest driver recover
> + * from the error. This requires that PCIe capabilities be
> + * exposed to the guest. At present, we just terminate the
> + * guest to contain the error.
> + */
> +
> +error_report("%s (%04x:%02x:%02x.%x)"
> +"Unrecoverable error detected... Terminating guest\n",
> +__func__, vdev->host.domain, vdev->host.bus,
> +vdev->host.slot, vdev->host.function);
> +
> +hw_error("(%04x:%02x:%02x.%x) Unrecoverable device error\n",
> +vdev->host.domain, vdev->host.bus,
> +vdev->host.slot, vdev->host.function);
> +
> +return;

As Blue Swirl mentions, these returns at the end of void functions are
unnecessary.

> +}
> +
> +static void vfio_register_err_notifier(VFIODevice *vdev)
> +{
> +int ret;
> +int argsz;
> +struct vfio_irq_set *irq_set;
> +int32_t *pfd;
> +
> +if (event_notifier_init(&vdev->err_notifier, 0)) {
> +error_report("vfio: Warning: Unable to init event notifier for error 
> detection\n");
> +return;
> +}
> +
> +argsz = sizeof(*irq_set) + sizeof(*pfd);
> +
> +irq_set = g_malloc0(argsz);
> +irq_set->argsz = argsz;
> +irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
> + VFIO_IRQ_SET_ACTION_TRIGGER;
> +irq_set->index = VFIO_PCI_ERR_IRQ_INDEX;
> +irq_set->start = 0;
> +irq_set->count = 1;
> +pfd = (int32_t *)&irq_set->data;
> +
> +*pfd = event_notifier_get_fd(&vdev->err_notifier);
> +qemu_set_fd_handler(*pfd, vfio_err_notifier_handler, NULL, vdev);
> +
> +ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
> +if (ret) {
> +DPRINTF("vfio: Error notification not supported for the device\n");

We should know this already though, right?  Where's our call to
VFIO_DEVICE_GET_IRQ_INFO for this index?  I'd expect that should happen
in vfio_get_device where it can set some flag true, then this function
would exit immediately if that flag isn't set.  Then by the time we're
here, it's a legitimate error_report if we think this should work and
doesn't.

> +qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
> +event_notifier_cleanup(&vdev->err_notifier);
> +g_free(irq_set);
> +return;
> +}
> +g_free(irq_set);
> +vdev->pci_aer = 1;

bool, so set to true or false.

> +return;
> +}
> +static void vfio_unregister_err_notifier(VFIODevice *vdev)
> +{
> +int argsz;
> +struct vfio_irq_set *irq_set;
> +int32_t *pfd;
> +int ret;
> +
> +if (!vdev->pci_aer) {
> +return;
> +}
> +
> +argsz = sizeof(*irq_set) + sizeof(*pfd);
> +
> +irq_set = g_malloc0(argsz);
> +irq_set->argsz = argsz;
> +irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
> + VFIO_IRQ_SET_ACTION_TRIGGER;
> +irq_set->index = VFIO_PCI_ERR_IRQ_INDEX;
> +irq_set->start = 0;
> +irq_set->count = 1;
> +pfd = (int32_t *)&irq_set->data;
> +*pfd = -1;
> +
> +ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
> +if (ret) {
> +DPRINTF("vfio: Failed to de-assign error fd: %d\n", ret);

This is also a legitimate error_report.  In general, if kernel vfio-pci
does or doesn't support a non-critical feature, that's a DPRINTF.  If
it's told us the feature is there and something d

Re: [PATCH v3 2/3] VFIO-AER: Vfio-pci driver changes for supporting AER

2013-02-04 Thread Alex Williamson

On Sun, 2013-02-03 at 14:10 +, Pandarathil, Vijaymohan R wrote:
>   - New VFIO_SET_IRQ ioctl option to pass the eventfd that is signaled 
> when
>   an error occurs in the vfio_pci_device
> 
>   - Register pci_error_handler for the vfio_pci driver
> 
>   - When the device encounters an error, the error handler registered by
>   the vfio_pci driver gets invoked by the AER infrastructure
> 
>   - In the error handler, signal the eventfd registered for the device.
> 
>   - This results in the qemu eventfd handler getting invoked and
>   appropriate action taken for the guest.
> 
> Signed-off-by: Vijay Mohan Pandarathil 
> ---
>  drivers/vfio/pci/vfio_pci.c | 43 
> -
>  drivers/vfio/pci/vfio_pci_intrs.c   | 30 ++
>  drivers/vfio/pci/vfio_pci_private.h |  1 +
>  include/uapi/linux/vfio.h   |  1 +
>  4 files changed, 74 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index b28e66c..818b1ed 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -196,7 +196,9 @@ static int vfio_pci_get_irq_count(struct vfio_pci_device 
> *vdev, int irq_type)
>  
>   return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
>   }
> - }
> + } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX)
> + if (pci_is_pcie(vdev->pdev))
> + return 1;
>  
>   return 0;
>  }
> @@ -302,6 +304,16 @@ static long vfio_pci_ioctl(void *device_data,
>   if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
>   return -EINVAL;
>  
> + switch (info.index) {
> + case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
> + break;
> + case VFIO_PCI_ERR_IRQ_INDEX:
> + if (pci_is_pcie(vdev->pdev))
> + break;
> + default:
> + return -EINVAL;
> + }
> +
>   info.flags = VFIO_IRQ_INFO_EVENTFD;
>  
>   info.count = vfio_pci_get_irq_count(vdev, info.index);
> @@ -538,11 +550,40 @@ static void vfio_pci_remove(struct pci_dev *pdev)
>   kfree(vdev);
>  }
>  
> +static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
> +   pci_channel_state_t state)
> +{
> + struct vfio_pci_device *vpdev;
> + void *vdev;

Can't vdev still be a struct vfio_device*?  We're not de-referencing it,
so I don't think you'll get a warning.

> +
> + vdev = vfio_device_get_from_dev(&pdev->dev);
> + if (vdev == NULL)
> + return PCI_ERS_RESULT_DISCONNECT;
> +
> + vpdev = vfio_device_data(vdev);
> + if (vpdev == NULL) {
> + vfio_device_put(vdev);
> + return PCI_ERS_RESULT_DISCONNECT;
> + }
> +
> + if (vpdev->err_trigger)
> + eventfd_signal(vpdev->err_trigger, 1);
> +
> + vfio_device_put(vdev);
> +
> + return PCI_ERS_RESULT_CAN_RECOVER;
> +}
> +
> +static struct pci_error_handlers vfio_err_handlers = {
> + .error_detected = vfio_pci_aer_err_detected,
> +};
> +
>  static struct pci_driver vfio_pci_driver = {
>   .name   = "vfio-pci",
>   .id_table   = NULL, /* only dynamic ids */
>   .probe  = vfio_pci_probe,
>   .remove = vfio_pci_remove,
> + .err_handler= &vfio_err_handlers,
>  };
>  
>  static void __exit vfio_pci_cleanup(void)
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> b/drivers/vfio/pci/vfio_pci_intrs.c
> index 3639371..83035b1 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -745,6 +745,29 @@ static int vfio_pci_set_msi_trigger(struct 
> vfio_pci_device *vdev,
>   return 0;
>  }
>  
> +static int vfio_pci_set_err_trigger(struct vfio_pci_device *vdev,
> + unsigned index, unsigned start,
> + unsigned count, uint32_t flags, void *data)
> +{
> + int32_t fd = *(int32_t *)data;
> +
> + if ((index != VFIO_PCI_ERR_IRQ_INDEX) ||
> + !(flags & VFIO_IRQ_SET_DATA_EVENTFD))
> + return -EINVAL;

Again, why no support for DATA_NONE or DATA_BOOL to enable loopback
support?

> +
> + if (fd == -1) {
> + if (vdev->err_trigger)
> + eventfd_ctx_put(vdev->err_trigger);
> + vdev->err_trigger = NULL;
> + return 0;
> + } else if (fd >= 0) {
> + vdev->err_trigger = eventfd_ctx_fdget(fd);
> + if (IS_ERR(vdev->err_trigger))
> + return PTR_ERR(vdev->err_trigger);

There's a bug here that if the fdget fails we leave an ERR_PTR in
vdev->err_trigger which is non-NULL, so we'll use it to try to trigger
and attempt to 'put' it above.  You probably want to think about
ordering or locking to avoid races

Re: [PATCH v3 1/3] VFIO: Wrapper for getting reference to vfio_device from device

2013-02-04 Thread Alex Williamson

On Sun, 2013-02-03 at 14:10 +, Pandarathil, Vijaymohan R wrote:
>   - Added vfio_device_get_from_dev() as wrapper to get
>   reference to vfio_device from struct device.
> 
>   - Added vfio_device_data() as a wrapper to get device_data from
>   vfio_device.
> 
> Signed-off-by: Vijay Mohan Pandarathil 
> ---
>  drivers/vfio/vfio.c  | 41 -
>  include/linux/vfio.h |  3 +++
>  2 files changed, 35 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 12c264d..f0a78a2 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -407,12 +407,13 @@ static void vfio_device_release(struct kref *kref)
>  }
>  
>  /* Device reference always implies a group reference */
> -static void vfio_device_put(struct vfio_device *device)
> +void vfio_device_put(struct vfio_device *device)
>  {
>   struct vfio_group *group = device->group;
>   kref_put_mutex(&device->kref, vfio_device_release, &group->device_lock);
>   vfio_group_put(group);
>  }
> +EXPORT_SYMBOL_GPL(vfio_device_put);
>  
>  static void vfio_device_get(struct vfio_device *device)
>  {
> @@ -642,8 +643,12 @@ int vfio_add_group_dev(struct device *dev,
>  }
>  EXPORT_SYMBOL_GPL(vfio_add_group_dev);
>  
> -/* Test whether a struct device is present in our tracking */
> -static bool vfio_dev_present(struct device *dev)
> +/**
> + * This does a get on the vfio_device from device.
> + * Callers of this function will have to call vfio_put_device() to
> + * remove the reference.
> + */
> +struct vfio_device *vfio_device_get_from_dev(struct device *dev)
>  {
>   struct iommu_group *iommu_group;
>   struct vfio_group *group;
> @@ -651,25 +656,43 @@ static bool vfio_dev_present(struct device *dev)
>  
>   iommu_group = iommu_group_get(dev);
>   if (!iommu_group)
> - return false;
> + return NULL;
>  
>   group = vfio_group_get_from_iommu(iommu_group);
>   if (!group) {
>   iommu_group_put(iommu_group);
> - return false;
> + return NULL;
>   }
>  
>   device = vfio_group_get_device(group, dev);
>   if (!device) {
>   vfio_group_put(group);
>   iommu_group_put(iommu_group);
> - return false;
> + return NULL;

nit, this test isn't necesary, skipping the whole if block and falling
through to the return below is functionally the same.

>   }
> -
> - vfio_device_put(device);
>   vfio_group_put(group);
>   iommu_group_put(iommu_group);
> - return true;
> + return device;
> +}
> +EXPORT_SYMBOL_GPL(vfio_device_get_from_dev);
> +

Let's add a comment here that the caller must hold a reference to the
vfio_device.  Thanks,

Alex

> +void *vfio_device_data(struct vfio_device *device)
> +{
> + return device->device_data;
> +}
> +EXPORT_SYMBOL_GPL(vfio_device_data);
> +
> +/* Test whether a struct device is present in our tracking */
> +static bool vfio_dev_present(struct device *dev)
> +{
> + struct vfio_device *device;
> +
> + device = vfio_device_get_from_dev(dev);
> + if (device) {
> + vfio_device_put(device);
> + return true;
> + } else
> + return false;
>  }
>  
>  /*
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index ab9e862..ac8d488 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -45,6 +45,9 @@ extern int vfio_add_group_dev(struct device *dev,
> void *device_data);
>  
>  extern void *vfio_del_group_dev(struct device *dev);
> +extern struct vfio_device *vfio_device_get_from_dev(struct device *dev);
> +extern void vfio_device_put(struct vfio_device *device);
> +extern void *vfio_device_data(struct vfio_device *device);
>  
>  /**
>   * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 0/3] AER-KVM: Error containment of VFIO devices assigned to KVM guests

2013-02-04 Thread Alex Williamson

On Sun, 2013-02-03 at 14:10 +, Pandarathil, Vijaymohan R wrote:
> Add support for error containment when a VFIO device assigned to a KVM
> guest encounters an error. This is for PCIe devices/drivers that support AER
> functionality. When the host OS is notified of an error in a device either
> through the firmware first approach or through an interrupt handled by the AER
> root port driver, the error handler registered by the vfio-pci driver gets
> invoked. The qemu process is signaled through an eventfd registered per
> VFIO device by the qemu process. In the eventfd handler, qemu decides on
> what action to take. In this implementation, guest is brought down to
> contain the error.
> 
> 
> v3:
>  - Removed PCI_AER* flags from device info ioctl.
>  - Incorporated feedback

Hi Vijay,

It's getting much closer, just a few comments in each patch.  As Gleb
points out, please try to use git send-mail  (or even stg mail) so that
threading is maintained for these.  A side effect will be that you can't
send kernel & qemu patches in the same series, but that's not
necessarily a bad thing.  Thanks,

Alex

> v2:
>  - Rebased to latest upstream stable bits
>  - Changed the new ioctl to be part of VFIO_SET_IRQs ioctl
>  - Added a new patch to get/put reference to a vfio device from struct device
>  - Incorporated all other feedback.
> 
> ---
> 
> Vijay Mohan Pandarathil(3):
> 
> [PATCH 1/3] VFIO: Wrapper for getting reference to vfio_device from device 
> [PATCH 2/3] VFIO-AER: Vfio-pci driver changes for supporting AER
> [PATCH 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
> 
> Kernel files changed
> 
>  drivers/vfio/vfio.c  | 41 -
>  include/linux/vfio.h |  3 +++
>  2 files changed, 35 insertions(+), 9 deletions(-)
> 
>  drivers/vfio/pci/vfio_pci.c | 43 
> -
>  drivers/vfio/pci/vfio_pci_intrs.c   | 30 ++
>  drivers/vfio/pci/vfio_pci_private.h |  1 +
>  include/uapi/linux/vfio.h   |  1 +
>  4 files changed, 74 insertions(+), 1 deletion(-)
> 
> Qemu files changed
> 
>  hw/vfio_pci.c  | 105 
> +
>  linux-headers/linux/vfio.h |   1 +
>  2 files changed, 106 insertions(+)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] intel_irq_remapping: Clean up x2apic optout security warning mess

2013-02-04 Thread Alex Williamson

On Fri, 2013-02-01 at 14:57 -0800, Andy Lutomirski wrote:
> Current kernels print this on my Dell server:
> 
>[ cut here ]
>WARNING: at drivers/iommu/intel_irq_remapping.c:542
>intel_enable_irq_remapping+0x7b/0x27e()
>Hardware name: PowerEdge R620
>Your BIOS is broken and requested that x2apic be disabled
>This will leave your machine vulnerable to irq-injection attacks
>Use 'intremap=no_x2apic_optout' to override BIOS request
>[...]
>Enabled IRQ remapping in xapic mode
>x2apic not enabled, IRQ remapping is in xapic mode
> 
> This is inconsistent with itself -- interrupt remapping is *on*.
> 
> Fix the mess by making the warnings say what they mean and my making
> sure that compatibility format interrupts (the dangerous ones) are
> disabled if x2apic is present regardless of BIOS settings.
> 
> With this patch applied, the output is:
> 
>   Your BIOS is broken and requested that x2apic be disabled.
>   This will slightly decrease performance.
>   Use 'intremap=no_x2apic_optout' to override BIOS request.
>   Enabled IRQ remapping in xapic mode
>   x2apic not enabled, IRQ remapping is in xapic mode
> 
> This should make us as or more secure than we are now and replace
> a rather scary warning with a much less scary warning on silly
> but functional systems.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  drivers/iommu/intel_irq_remapping.c | 36 
>  1 file changed, 28 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/iommu/intel_irq_remapping.c 
> b/drivers/iommu/intel_irq_remapping.c
> index af8904d..eca8832 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -425,11 +425,22 @@ static void iommu_set_irq_remapping(struct intel_iommu 
> *iommu, int mode)
>  
>   /* Enable interrupt-remapping */
>   iommu->gcmd |= DMA_GCMD_IRE;
> + iommu->gcmd &= ~DMA_GCMD_CFI;  /* Block compatibility-format MSIs */
>   writel(iommu->gcmd, iommu->reg + DMAR_GCMD_REG);
>  
>   IOMMU_WAIT_OP(iommu, DMAR_GSTS_REG,
> readl, (sts & DMA_GSTS_IRES), sts);
>  
> + /*
> +  * With CFI clear in the Global Command register, we should be
> +  * protected from dangerous (i.e. compatibility) interrupts
> +  * regardless of x2apic status.  Check just to be sure.
> +  */
> + if (sts & DMA_GSTS_CFIS)
> + WARN(1, KERN_WARNING
> + "Compatibility-format IRQs enabled despite intr 
> remapping;\n"
> + "you are vulnerable to IRQ injection.\n");
> +
>   raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
>  }
>  
> @@ -526,20 +537,24 @@ static int __init intel_irq_remapping_supported(void)
>  static int __init intel_enable_irq_remapping(void)
>  {
>   struct dmar_drhd_unit *drhd;
> + bool x2apic_present;
>   int setup = 0;
>   int eim = 0;
>  
> + x2apic_present = x2apic_supported();
> +
>   if (parse_ioapics_under_ir() != 1) {
>   printk(KERN_INFO "Not enable interrupt remapping\n");
> - return -1;
> + goto error;
>   }
>  
> - if (x2apic_supported()) {
> + if (x2apic_present) {
>   eim = !dmar_x2apic_optout();
> - WARN(!eim, KERN_WARNING
> -"Your BIOS is broken and requested that x2apic be 
> disabled\n"
> -"This will leave your machine vulnerable to 
> irq-injection attacks\n"
> -"Use 'intremap=no_x2apic_optout' to override BIOS 
> request\n");
> + if (!eim)
> + printk(KERN_WARNING
> + "Your BIOS is broken and requested that x2apic 
> be disabled.\n"
> + "This will slightly decrease performance.\n"
> + "Use 'intremap=no_x2apic_optout' to override 
> BIOS request.\n");
>   }
>  
>   for_each_drhd_unit(drhd) {
> @@ -578,7 +593,7 @@ static int __init intel_enable_irq_remapping(void)
>   if (eim && !ecap_eim_support(iommu->ecap)) {
>   printk(KERN_INFO "DRHD %Lx: EIM not supported by DRHD, "
>  " ecap %Lx\n", drhd->reg_base_addr, iommu->ecap);
> - return -1;
> + goto error;
>   }
>   }
>  
> @@ -594,7 +609,7 @@ static int __init intel_enable_irq_remapping(void)
>   printk(KERN_ERR "DRHD %Lx: failed to enable queued, "
>  " invalidation, ecap %Lx, ret %d\n",
>  drhd->reg_base_addr, iommu->ecap, ret);
> - return -1;
> + goto error;
>   }
>   }
>  
> @@ -625,6 +640,11 @@ error:
>   /*
>* handle error condition gracefully here!
>*/
> +
> + if (x2apic_present)
> + WARN(1, KERN_WARNING
> + "Failed to e

Re: [PATCH] intel_irq_remapping: Clean up x2apic optout security warning mess

2013-02-04 Thread Alex Williamson

On Mon, 2013-02-04 at 11:19 -0800, Andy Lutomirski wrote:
> On Mon, Feb 4, 2013 at 11:04 AM, Alex Williamson
>  wrote:
> > On Fri, 2013-02-01 at 14:57 -0800, Andy Lutomirski wrote:
> >> Current kernels print this on my Dell server:
> >>
> >>[ cut here ]
> >>WARNING: at drivers/iommu/intel_irq_remapping.c:542
> >>intel_enable_irq_remapping+0x7b/0x27e()
> >>Hardware name: PowerEdge R620
> >>Your BIOS is broken and requested that x2apic be disabled
> >>This will leave your machine vulnerable to irq-injection attacks
> >>Use 'intremap=no_x2apic_optout' to override BIOS request
> >>[...]
> >>Enabled IRQ remapping in xapic mode
> >>x2apic not enabled, IRQ remapping is in xapic mode
> >>
> >> This is inconsistent with itself -- interrupt remapping is *on*.
> >>
> >> Fix the mess by making the warnings say what they mean and my making
> >> sure that compatibility format interrupts (the dangerous ones) are
> >> disabled if x2apic is present regardless of BIOS settings.
> >>
> >> With this patch applied, the output is:
> >>
> >>   Your BIOS is broken and requested that x2apic be disabled.
> >>   This will slightly decrease performance.
> >>   Use 'intremap=no_x2apic_optout' to override BIOS request.
> >>   Enabled IRQ remapping in xapic mode
> >>   x2apic not enabled, IRQ remapping is in xapic mode
> >>
> >> This should make us as or more secure than we are now and replace
> >> a rather scary warning with a much less scary warning on silly
> >> but functional systems.
> >>
> >> Signed-off-by: Andy Lutomirski 
> >> ---
> >>  drivers/iommu/intel_irq_remapping.c | 36 
> >> 
> >>  1 file changed, 28 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/drivers/iommu/intel_irq_remapping.c 
> >> b/drivers/iommu/intel_irq_remapping.c
> >> index af8904d..eca8832 100644
> >> --- a/drivers/iommu/intel_irq_remapping.c
> >> +++ b/drivers/iommu/intel_irq_remapping.c
> >> @@ -425,11 +425,22 @@ static void iommu_set_irq_remapping(struct 
> >> intel_iommu *iommu, int mode)
> >>
> >>   /* Enable interrupt-remapping */
> >>   iommu->gcmd |= DMA_GCMD_IRE;
> >> + iommu->gcmd &= ~DMA_GCMD_CFI;  /* Block compatibility-format MSIs */
> >>   writel(iommu->gcmd, iommu->reg + DMAR_GCMD_REG);
> >>
> >>   IOMMU_WAIT_OP(iommu, DMAR_GSTS_REG,
> >> readl, (sts & DMA_GSTS_IRES), sts);
> >>
> >> + /*
> >> +  * With CFI clear in the Global Command register, we should be
> >> +  * protected from dangerous (i.e. compatibility) interrupts
> >> +  * regardless of x2apic status.  Check just to be sure.
> >> +  */
> >> + if (sts & DMA_GSTS_CFIS)
> >> + WARN(1, KERN_WARNING
> >> + "Compatibility-format IRQs enabled despite intr 
> >> remapping;\n"
> >> + "you are vulnerable to IRQ injection.\n");
> >> +
> >>   raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
> >>  }
> >>
> >> @@ -526,20 +537,24 @@ static int __init intel_irq_remapping_supported(void)
> >>  static int __init intel_enable_irq_remapping(void)
> >>  {
> >>   struct dmar_drhd_unit *drhd;
> >> + bool x2apic_present;
> >>   int setup = 0;
> >>   int eim = 0;
> >>
> >> + x2apic_present = x2apic_supported();
> >> +
> >>   if (parse_ioapics_under_ir() != 1) {
> >>   printk(KERN_INFO "Not enable interrupt remapping\n");
> >> - return -1;
> >> + goto error;
> >>   }
> >>
> >> - if (x2apic_supported()) {
> >> + if (x2apic_present) {
> >>   eim = !dmar_x2apic_optout();
> >> - WARN(!eim, KERN_WARNING
> >> -"Your BIOS is broken and requested that x2apic be 
> >> disabled\n"
> >> -"This will leave your machine vulnerable to 
> >> irq-injection attacks\n"
> >> -"Use 'intremap=no_x2apic_optout' to override BIOS 
> >> request\n");
> >> + if (!

Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Alex Williamson

On Tue, 2013-02-05 at 12:05 +, Pandarathil, Vijaymohan R wrote:
> 
> > -Original Message-
> > From: linux-pci-ow...@vger.kernel.org [mailto:linux-pci-
> > ow...@vger.kernel.org] On Behalf Of Gleb Natapov
> > Sent: Tuesday, February 05, 2013 3:37 AM
> > To: Pandarathil, Vijaymohan R
> > Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
> > k...@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
> > linux-kernel@vger.kernel.org
> > Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-
> > PCI devices
> > 
> > On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R wrote:
> > >
> > >
> > > > -Original Message-
> > > > From: Gleb Natapov [mailto:g...@redhat.com]
> > > > Sent: Tuesday, February 05, 2013 1:21 AM
> > > > To: Pandarathil, Vijaymohan R
> > > > Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
> > > > k...@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
> > > > linux-kernel@vger.kernel.org
> > > > Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for
> > VFIO-
> > > > PCI devices
> > > >
> > > > On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R
> > wrote:
> > > > >
> > > > >
> > > > > > -Original Message-
> > > > > > From: Gleb Natapov [mailto:g...@redhat.com]
> > > > > > Sent: Tuesday, February 05, 2013 12:05 AM
> > > > > > To: Blue Swirl
> > > > > > Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas;
> > Ortiz,
> > > > Lance
> > > > > > E; k...@vger.kernel.org; qemu-de...@nongnu.org; linux-
> > > > p...@vger.kernel.org;
> > > > > > linux-kernel@vger.kernel.org
> > > > > > Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER
> > for
> > > > VFIO-
> > > > > > PCI devices
> > > > > >
> > > > > > On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
> > > > > > > On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
> > > > > > >  wrote:
> > > > > > > > - Create eventfd per vfio device assigned to a guest
> > and
> > > > > > register an
> > > > > > > >   event handler
> > > > > > > >
> > > > > > > > - This fd is passed to the vfio_pci driver through the
> > > > SET_IRQ
> > > > > > ioctl
> > > > > > > >
> > > > > > > > - When the device encounters an error, the eventfd is
> > > > signalled
> > > > > > > >   and the qemu eventfd handler gets invoked.
> > > > > > > >
> > > > > > > > - In the handler decide what action to take. Current
> > action
> > > > > > taken
> > > > > > > >   is to terminate the guest.
> > > > > > >
> > > > > > > Usually this is not OK, but I guess this is not guest
> > triggerable.
> > > > > > >
> > > > > > Still not OK. Why not stop a guest with appropriate stop reason?
> > > > >
> > > > > The thinking was that since this is a hardware error, we would want
> > to
> > > > stop the guest at the earliest. The hw_error() routine which aborts the
> > > > qemu process was suggested by Alex and that seemed appropriate. Earlier
> > I
> > > > was using qemu_system_shutdown_request().  Any suggestions ?
> > > > >
> > > > I am thinking vm_stop(). Stopping SMP guest (and UP too in fact)
> > > > involves sending IPIs to other cpus running guest's vcpus. Both exit()
> > > > and vm_stop() will do it, but former is implicitly in the kernel and
> > > > later is explicitly in QEMU.
> > >
> > > I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while
> > testing, guest ended up in a hang rather than exiting. There seems to some
> > cleanup work which is being done as part of vm_stop. In our case, we wanted
> > the guest to exit immediately. So use of hw_error() seemed appropriate.
> > >
> > What makes you think it hang? It stopped, precisely what it should do if
> > you call vm_stop(). Now it is possible for vm user to investigate what
> > happened and even salvage some data from guest memory.
> 
> That was ignorance on my part on the expected behavior of vm_stop(). 
> So what you are suggesting is to stop the guest displaying an appropriate 
> error/next-steps message and have the users do any 
> data-collection/investigation 
> and then manually kill the guest, if they so desire. Right ?
> 
> Sounds reasonable. As long as the guest is not touching the device, it should 
> be okay.
> Alex, Any comments ?

What's the libvirt behavior when a guest goes to vm_stop?  My only
concern would be whether the user is going to be confused by a state
where the vm is still up, but not running.  I imagine they'll have to
manually stop it and restart it to continue.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Alex Williamson

On Tue, 2013-02-05 at 15:42 +0200, Gleb Natapov wrote:
> On Tue, Feb 05, 2013 at 06:37:35AM -0700, Alex Williamson wrote:
> > On Tue, 2013-02-05 at 12:05 +, Pandarathil, Vijaymohan R wrote:
> > > 
> > > > -Original Message-
> > > > From: linux-pci-ow...@vger.kernel.org [mailto:linux-pci-
> > > > ow...@vger.kernel.org] On Behalf Of Gleb Natapov
> > > > Sent: Tuesday, February 05, 2013 3:37 AM
> > > > To: Pandarathil, Vijaymohan R
> > > > Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
> > > > k...@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
> > > > linux-kernel@vger.kernel.org
> > > > Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for 
> > > > VFIO-
> > > > PCI devices
> > > > 
> > > > On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R 
> > > > wrote:
> > > > >
> > > > >
> > > > > > -Original Message-
> > > > > > From: Gleb Natapov [mailto:g...@redhat.com]
> > > > > > Sent: Tuesday, February 05, 2013 1:21 AM
> > > > > > To: Pandarathil, Vijaymohan R
> > > > > > Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
> > > > > > k...@vger.kernel.org; qemu-de...@nongnu.org; 
> > > > > > linux-...@vger.kernel.org;
> > > > > > linux-kernel@vger.kernel.org
> > > > > > Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER 
> > > > > > for
> > > > VFIO-
> > > > > > PCI devices
> > > > > >
> > > > > > On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R
> > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > > -Original Message-
> > > > > > > > From: Gleb Natapov [mailto:g...@redhat.com]
> > > > > > > > Sent: Tuesday, February 05, 2013 12:05 AM
> > > > > > > > To: Blue Swirl
> > > > > > > > Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas;
> > > > Ortiz,
> > > > > > Lance
> > > > > > > > E; k...@vger.kernel.org; qemu-de...@nongnu.org; linux-
> > > > > > p...@vger.kernel.org;
> > > > > > > > linux-kernel@vger.kernel.org
> > > > > > > > Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support 
> > > > > > > > AER
> > > > for
> > > > > > VFIO-
> > > > > > > > PCI devices
> > > > > > > >
> > > > > > > > On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
> > > > > > > > > On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
> > > > > > > > >  wrote:
> > > > > > > > > > - Create eventfd per vfio device assigned to a guest
> > > > and
> > > > > > > > register an
> > > > > > > > > >   event handler
> > > > > > > > > >
> > > > > > > > > > - This fd is passed to the vfio_pci driver through 
> > > > > > > > > > the
> > > > > > SET_IRQ
> > > > > > > > ioctl
> > > > > > > > > >
> > > > > > > > > > - When the device encounters an error, the eventfd 
> > > > > > > > > > is
> > > > > > signalled
> > > > > > > > > >   and the qemu eventfd handler gets invoked.
> > > > > > > > > >
> > > > > > > > > > - In the handler decide what action to take. Current
> > > > action
> > > > > > > > taken
> > > > > > > > > >   is to terminate the guest.
> > > > > > > > >
> > > > > > > > > Usually this is not OK, but I guess this is not guest
> > > > triggerable.
> > > > > > > > >
> > > > > > > > Still not OK. Why not stop a guest with appropriate stop reason?
> > > > > > >
> > > > > > > The thinking was that since this is a hardware error, we would 
> > >

Re: DMAR faults from unrelated device when vfio is used

2013-02-05 Thread Alex Williamson

On Tue, 2013-02-05 at 14:31 +0100, David Gstir wrote:
> Am Montag, den 04.02.2013, 08:49 -0700 schrieb Alex Williamson:
> 
> > Can you clarify what you mean by assign?  Are you actually assigning the
> > root ports to the qemu guest (1c.0 & 1c.6)?  vfio will require they be
> > owned by vfio-pci to make use of 3:00.0, but assigning them to the guest
> > is not recommended.  Can you provided your qemu command line?  
> 
> I did hand all of them to the guest OS. Removing 1c.0 & 1c.6 from the qemu 
> command line seems to have done the trick. Thanks!

Great, though I'm still not sure how we were generating those DMAR
faults.

> Here's my working qemu command line:
> qemu-kvm -no-reboot -enable-kvm -cpu host -smp 4 -m 6G \
>   -drive 
> file=/home/test/qemu/images/win7_base_updated.qcow2,if=virtio,cache=none,media=disk,format=qcow2,index=0
>  \
>   -full-screen -no-quit -no-frame -display sdl -vnc :1 -k de -usbdevice 
> tablet \
>   -vga std -global VGA.vgamem_mb=256 \
>   -netdev tap,id=guest0,ifname=tap0,script=no,downscript=no \
>   -net nic,netdev=guest0,model=virtio,macaddr=00:16:35:BE:EF:12  \
>   -rtc base=localtime \
>   -device vfio-pci,host=00:1b.0,id=audio \
>   -device vfio-pci,host=00:1a.0,id=ehci1 \
>   -device vfio-pci,host=00:1d.0,id=ehci2 \
>   -device vfio-pci,host=03:00.0,id=xhci1 \
>   -monitor tcp::,server,nowait
> 
> 
> > We need
> > to re-visit how to handle pcieport devices with vfio-pci, perhaps
> > white-listing it as a vfio "compatible" driver, but this still should
> > not interfere with devices external to the group.
> > 
> > The DMAR fault address looks pretty bogus unless you happen to have
> > 100GB+ of ram in the system.
> 
> Nope, definitely not. :)
> 
> > vfio makes use of the IOMMU API for programming DMA translations, so an
> > reserved fields would have to be programmed by intel-iommu itself.  We
> > could of course be passing some kind of bogus data that intel-iommu
> > isn't catching.  If you're assigning the root ports to the guest, I'd
> > start with that, don't do it.  Attach them to vfio, but don't give them
> > to the guest.  Maybe that'll give us a hint.  I also notice that your
> > USB 3 controller is dead:
> > 
> > 03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
> > (rev ff) (prog-if ff)
> > !!! Unknown header type 7f
> > 
> > We only see unknown header type 7f when the read from the device returns
> > -1.  This might have something to do with the root port above it (1c.6)
> > being in state D3.  Windows likes to put unused devices in D3, which
> > leads me to suspect you are giving it to the guest.  
> 
> There error does no longer occur. lspci now shows this:
> 
> -- snip --
> 03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
> (rev 04) (prog-if 30 [XHCI])
>   Subsystem: Intel Corporation Device 2008
>   Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR+ FastB2B- DisINTx+
>   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-  SERR-Interrupt: pin A routed to IRQ 18
>   Region 0: Memory at fe50 (64-bit, non-prefetchable) [disabled] 
> [size=8K]
>   Capabilities: [50] Power Management version 3
>   Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA 
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>   Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>   Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
>   Address:   Data: 
>   Capabilities: [90] MSI-X: Enable- Count=8 Masked-
>   Vector table: BAR=0 offset=1000
>   PBA: BAR=0 offset=1080
>   Capabilities: [a0] Express (v2) Endpoint, MSI 00
>   DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s 
> unlimited, L1 unlimited
>   ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>   DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
> Unsupported-
>   RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
>   MaxPayload 128 bytes, MaxReadReq 128 bytes
>   DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ 
> TransPend-
>   LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 
> <4us, L1 unlimited
>   ClockPM+ Surprise- LLActRep- BwNot-
>   LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
>   ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
>   LnkSta: Speed 2.5GT/s, Width x1,

Re: DMAR faults from unrelated device when vfio is used

2013-02-05 Thread Alex Williamson

On Tue, 2013-02-05 at 08:37 -0700, Alex Williamson wrote:
> On Tue, 2013-02-05 at 14:31 +0100, David Gstir wrote:
> > Am Montag, den 04.02.2013, 08:49 -0700 schrieb Alex Williamson:
> > 
> > > Can you clarify what you mean by assign?  Are you actually assigning the
> > > root ports to the qemu guest (1c.0 & 1c.6)?  vfio will require they be
> > > owned by vfio-pci to make use of 3:00.0, but assigning them to the guest
> > > is not recommended.  Can you provided your qemu command line?  
> > 
> > I did hand all of them to the guest OS. Removing 1c.0 & 1c.6 from the qemu 
> > command line seems to have done the trick. Thanks!
> 
> Great, though I'm still not sure how we were generating those DMAR
> faults.
> 
> > Here's my working qemu command line:
> > qemu-kvm -no-reboot -enable-kvm -cpu host -smp 4 -m 6G \
> >   -drive 
> > file=/home/test/qemu/images/win7_base_updated.qcow2,if=virtio,cache=none,media=disk,format=qcow2,index=0
> >  \
> >   -full-screen -no-quit -no-frame -display sdl -vnc :1 -k de -usbdevice 
> > tablet \
> >   -vga std -global VGA.vgamem_mb=256 \
> >   -netdev tap,id=guest0,ifname=tap0,script=no,downscript=no \
> >   -net nic,netdev=guest0,model=virtio,macaddr=00:16:35:BE:EF:12  \
> >   -rtc base=localtime \
> >   -device vfio-pci,host=00:1b.0,id=audio \
> >   -device vfio-pci,host=00:1a.0,id=ehci1 \
> >   -device vfio-pci,host=00:1d.0,id=ehci2 \
> >   -device vfio-pci,host=03:00.0,id=xhci1 \
> >   -monitor tcp::,server,nowait
> > 
> > 
> > > We need
> > > to re-visit how to handle pcieport devices with vfio-pci, perhaps
> > > white-listing it as a vfio "compatible" driver, but this still should
> > > not interfere with devices external to the group.
> > > 
> > > The DMAR fault address looks pretty bogus unless you happen to have
> > > 100GB+ of ram in the system.
> > 
> > Nope, definitely not. :)
> > 
> > > vfio makes use of the IOMMU API for programming DMA translations, so an
> > > reserved fields would have to be programmed by intel-iommu itself.  We
> > > could of course be passing some kind of bogus data that intel-iommu
> > > isn't catching.  If you're assigning the root ports to the guest, I'd
> > > start with that, don't do it.  Attach them to vfio, but don't give them
> > > to the guest.  Maybe that'll give us a hint.  I also notice that your
> > > USB 3 controller is dead:
> > > 
> > > 03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
> > > (rev ff) (prog-if ff)
> > >   !!! Unknown header type 7f
> > > 
> > > We only see unknown header type 7f when the read from the device returns
> > > -1.  This might have something to do with the root port above it (1c.6)
> > > being in state D3.  Windows likes to put unused devices in D3, which
> > > leads me to suspect you are giving it to the guest.  
> > 
> > There error does no longer occur. lspci now shows this:
> > 
> > -- snip --
> > 03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
> > (rev 04) (prog-if 30 [XHCI])
> > Subsystem: Intel Corporation Device 2008
> > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
> > Stepping- SERR+ FastB2B- DisINTx+
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-  > SERR-  > Interrupt: pin A routed to IRQ 18
> > Region 0: Memory at fe50 (64-bit, non-prefetchable) [disabled] 
> > [size=8K]
> > Capabilities: [50] Power Management version 3
> > Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA 
> > PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
> > Address:   Data: 
> > Capabilities: [90] MSI-X: Enable- Count=8 Masked-
> > Vector table: BAR=0 offset=1000
> > PBA: BAR=0 offset=1080
> > Capabilities: [a0] Express (v2) Endpoint, MSI 00
> > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s 
> > unlimited, L1 unlimited
> > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> > DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
> > Unsupported-
> > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
> > MaxPayload 128 bytes, MaxReadReq 128 bytes
> >

Re: DMAR faults from unrelated device when vfio is used

2013-02-06 Thread Alex Williamson

On Wed, 2013-02-06 at 19:09 +0100, Richard Weinberger wrote:
> Hi,
> 
> Am Tue, 05 Feb 2013 13:36:53 -0700
> schrieb Alex Williamson :
> > > Ugh, the infamous and useless error 10.  It could be anything.
> > > I've got a system with onboard usb3, let me see what windows does
> > > with it here first.  Thanks,
> > 
> > Well, I've got an Etron USB3 HBA and (un)fortunately it works just
> > fine with a Win7 guest.  There's really nothing special about USB
> > controllers from a PCI device assignment perspective.  Have you tried
> > the latest upstream qemu bits?  Thanks,
> 
> USB3 does also not work within a Linux guest.
> xhci in debug mode gives a bit more infos.

Does the card work with pci-assign or are both broken?
> 
> [1.157888] xhci_hcd :00:07.0: xHCI Host Controller
> [1.157899] xhci_hcd :00:07.0: new USB bus registered, assigned bus 
> number 4
> [1.157948] xhci_hcd :00:07.0: // Halt the HC
> [1.157957] xhci_hcd :00:07.0: Resetting HCD
> [1.157962] xhci_hcd :00:07.0: // Reset the HC
> [1.158111] usb 3-1: new full-speed USB device number 2 using uhci_hcd
> [1.158125] xhci_hcd :00:07.0: Wait for controller to be ready for 
> doorbell rings
> [1.158130] xhci_hcd :00:07.0: Reset complete
> [1.158133] xhci_hcd :00:07.0: Enabling 64-bit DMA addresses.
> [1.158135] xhci_hcd :00:07.0: Calling HCD init
> [1.158136] xhci_hcd :00:07.0: xhci_init
> [1.158137] xhci_hcd :00:07.0: xHCI doesn't need link TRB QUIRK
> [1.158640] xhci_hcd :00:07.0: Finished xhci_init
> [1.158642] xhci_hcd :00:07.0: Called HCD init
> [1.158698] xhci_hcd :00:07.0: irq 11, io mem 0xfebf4000
> [1.158699] xhci_hcd :00:07.0: xhci_run
> [1.159578] xhci_hcd :00:07.0: irq 40 for MSI/MSI-X
> [1.159697] xhci_hcd :00:07.0: irq 41 for MSI/MSI-X
> [1.159720] xhci_hcd :00:07.0: irq 42 for MSI/MSI-X
> [1.159736] xhci_hcd :00:07.0: irq 43 for MSI/MSI-X
> [1.159752] xhci_hcd :00:07.0: irq 44 for MSI/MSI-X
> [1.179682] xhci_hcd :00:07.0: Setting event ring polling timer
> [1.179686] xhci_hcd :00:07.0: Command ring memory map follows:
> [1.179693] xhci_hcd :00:07.0: ERST memory map follows:
> [1.179695] xhci_hcd :00:07.0: Event ring:
> [1.179702] xhci_hcd :00:07.0: ERST deq = 64'h36820400
> [1.179703] xhci_hcd :00:07.0: // Set the interrupt modulation register
> [1.179710] xhci_hcd :00:07.0: // Enable interrupts, cmd = 0x4.
> [1.179715] xhci_hcd :00:07.0: // Enabling event ring interrupter 
> c9e68620 by writing 0x2 to irq_pending
> [1.179737] xhci_hcd :00:07.0: Finished xhci_run for USB2 roothub
> [1.179752] usb usb4: New USB device found, idVendor=1d6b, idProduct=0002
> [1.179753] usb usb4: New USB device strings: Mfr=3, Product=2, 
> SerialNumber=1
> [1.179755] usb usb4: Product: xHCI Host Controller
> [1.179756] usb usb4: Manufacturer: Linux 3.8.0-rc6-2.10-desktop xhci_hcd
> [1.179757] usb usb4: SerialNumber: :00:07.0
> [1.179967] xHCI xhci_add_endpoint called for root hub
> [1.179971] xHCI xhci_check_bandwidth called for root hub
> [1.180081] hub 4-0:1.0: USB hub found
> [1.180094] hub 4-0:1.0: 2 ports detected
> [1.180200] xhci_hcd :00:07.0: xHCI Host Controller
> [1.180206] xhci_hcd :00:07.0: new USB bus registered, assigned bus 
> number 5
> [1.180214] xhci_hcd :00:07.0: Enabling 64-bit DMA addresses.
> [1.180219] xhci_hcd :00:07.0: // Turn on HC, cmd = 0x5.
> [1.245201] xhci_hcd :00:07.0: Host took too long to start, waited 
> 16000 microseconds.
> 
> This one looks interesting.

Yep, the register never got to the state it was looking for.

> [1.245414] xhci_hcd :00:07.0: // Halt the HC
> [1.245424] xhci_hcd :00:07.0: startup error -19
> [1.245551] xhci_hcd :00:07.0: USB bus 5 deregistered
> [1.245556] xhci_hcd :00:07.0: remove, state 1
> [1.245560] usb usb4: USB disconnect, device number 1
> [1.245608] xHCI xhci_drop_endpoint called for root hub
> [1.245609] xHCI xhci_check_bandwidth called for root hub
> [1.245684] xhci_hcd :00:07.0: // Halt the HC
> [1.245695] xhci_hcd :00:07.0: // Reset the HC
> [1.245741] xhci_hcd :00:07.0: Wait for controller to be ready for 
> doorbell rings
> [1.256413] xhci_hcd :00:07.0: // Disabling event ring interrupts
> [1.256427] xhci_hcd :00:07.0: cleaning up memory
> [1.256440] xhci_hcd :00:07.0: xhci_stop completed - status = 1
> [1.256446] xhci_hcd :00:07.0: USB bus 4 deregistered
>

Re: DMAR faults from unrelated device when vfio is used

2013-02-06 Thread Alex Williamson

On Wed, 2013-02-06 at 21:25 +0100, Richard Weinberger wrote:
> Hi,
> 
> Am Wed, 06 Feb 2013 11:47:20 -0700
> schrieb Alex Williamson : 
> > Does the card work with pci-assign or are both broken?
> 
> It works with pci-assign. :-\

When you tested this, did you detach the group from vfio or use it as
is?  In your previous message I see this:

03:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host 
Controller [1033:0194] (rev ff)

/sys/kernel/iommu_groups/7/devices:
total 0
lrwxrwxrwx 1 root root 0 Feb  4 10:29 :00:1c.0 -> 
../../../../devices/pci:00/:00:1c.0
lrwxrwxrwx 1 root root 0 Feb  4 10:29 :00:1c.6 -> 
../../../../devices/pci:00/:00:1c.6
lrwxrwxrwx 1 root root 0 Feb  4 10:29 :03:00.0 -> 
../../../../devices/pci:00/:00:1c.6/:03:00.0

This seemed like a good card to have in my test cache, so I went and got
one and it works fine for me... but I've been playing with pcieport
because I don't think we're handling them correctly in vfio.

Can you provide lspci -vvv -s 1c.6 while the guest is running?  I'm
going to bet that

Control: I/O+ Mem+ BusMaster+

is not set, which it would have been if pci-assign was tested without
the group bound to vfio.  I think the solution is going to be something
around white-listing pcieport, which you can easily test with a kernel
patch like this:

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 12c264d..48a97fb 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -442,7 +442,7 @@ static struct vfio_device *vfio_group_get_device(struct vfio
  * a device.  It's not always practical to leave a device within a group
  * driverless as it could get re-bound to something unsafe.
  */
-static const char * const vfio_driver_whitelist[] = { "pci-stub" };
+static const char * const vfio_driver_whitelist[] = { "pci-stub", "pcieport" };

 static bool vfio_whitelisted_driver(struct device_driver *drv)
 {

Then you won't need to bind 1c.0 or 1c.6 to vfio-pci and hopefully
things will work.  The other problem you might hit is that the pciehp
service driver may also be bound to these slots and somehow deletes the
pci device and re-adds it when a device reset happens.  This causes all
sorts of badness.  The solution here is to unbind the child device from
pciehp, ie:

echo :00:1c.0:pcie04 | sudo \
tee /sys/bus/pci_express/drivers/pciehp/unbind
echo :00:1c.6:pcie04 | sudo \
tee /sys/bus/pci_express/drivers/pciehp/unbind

Hopefully combined that will make things work, please let me know.
Another option is to move the device to a slot where it isn't grouped
with the root port above it, assuming it's a plugin card.  Also if we
could determine that these root ports support PCI ACS but just don't
report it, we could change the grouping and avoid root ports grouped
with devices.

I'm still trying to formulate how to fix this long term, whether we
should whitelist pcieport and require userspace to do this kind of set
(need a hotplug stub driver?) or if vfio-pci needs to gain some basic
pcieport functionality that can enable the device and bind service
drivers we want (aer) and avoid ones we don't (pciehp).  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: DMAR faults from unrelated device when vfio is used

2013-02-07 Thread Alex Williamson

On Thu, 2013-02-07 at 23:23 +0100, Richard Weinberger wrote:
> Hi,
> 
> Am Wed, 06 Feb 2013 15:45:37 -0700
> schrieb Alex Williamson :
> 
> > On Wed, 2013-02-06 at 21:25 +0100, Richard Weinberger wrote:
> > > Hi,
> > > 
> > > Am Wed, 06 Feb 2013 11:47:20 -0700
> > > schrieb Alex Williamson : 
> > > > Does the card work with pci-assign or are both broken?
> > > 
> > > It works with pci-assign. :-\
> > 
> > When you tested this, did you detach the group from vfio or use it as
> > is?  In your previous message I see this:
> 
> I've detached it.
> 
> > 03:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host
> > Controller [1033:0194] (rev ff)
> > 
> > /sys/kernel/iommu_groups/7/devices:
> > total 0
> > lrwxrwxrwx 1 root root 0 Feb  4 10:29 :00:1c.0
> > -> ../../../../devices/pci:00/:00:1c.0 lrwxrwxrwx 1 root root
> > 0 Feb  4 10:29 :00:1c.6
> > -> ../../../../devices/pci:00/:00:1c.6 lrwxrwxrwx 1 root root
> > 0 Feb  4 10:29 :03:00.0
> > -> ../../../../devices/pci:00/:00:1c.6/:03:00.0
> > 
> > This seemed like a good card to have in my test cache, so I went and
> > got one and it works fine for me... but I've been playing with
> > pcieport because I don't think we're handling them correctly in vfio.
> > 
> > Can you provide lspci -vvv -s 1c.6 while the guest is running?  I'm
> > going to bet that
> > 
> > Control: I/O+ Mem+ BusMaster+
> 
> Do you want "lspci -vvv -s 1c.6" after attaching 1c.6 to vfio and not
> using pci-assign?

Was looking for while attached to vfio with the guest running after xhci
has failed to attach to it, but it's not really necessary, I'm pretty
sure of the result given that it work when the root port is left alone.


> > is not set, which it would have been if pci-assign was tested without
> > the group bound to vfio.  I think the solution is going to be
> > something around white-listing pcieport, which you can easily test
> > with a kernel patch like this:
> > 
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index 12c264d..48a97fb 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -442,7 +442,7 @@ static struct vfio_device
> > *vfio_group_get_device(struct vfio
> >   * a device.  It's not always practical to leave a device within a
> > group
> >   * driverless as it could get re-bound to something unsafe.
> >   */
> > -static const char * const vfio_driver_whitelist[] = { "pci-stub" };
> > +static const char * const vfio_driver_whitelist[] = { "pci-stub",
> > "pcieport" }; 
> >  static bool vfio_whitelisted_driver(struct device_driver *drv)
> >  {
> 
> If I whitelist pcieport USB3 works within the guests. :-)
> Binding 1c.0 and 1c.6 is no longer needed.
> Next week I'll run some more tests with USB3 devices.

Great!  Thanks for the test.  I assume you didn't need to do anything
with unbinding pciehp?

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] vfio powerpc: enabled on powernv platform

2013-02-11 Thread Alex Williamson

On Mon, 2013-02-11 at 22:54 +1100, Alexey Kardashevskiy wrote:
> This patch initializes IOMMU groups based on the IOMMU
> configuration discovered during the PCI scan on POWERNV
> (POWER non virtualized) platform. The IOMMU groups are
> to be used later by VFIO driver (PCI pass through).
> 
> It also implements an API for mapping/unmapping pages for
> guest PCI drivers and providing DMA window properties.
> This API is going to be used later by QEMU-VFIO to handle
> h_put_tce hypercalls from the KVM guest.
> 
> The iommu_put_tce_user_mode() does only a single page mapping
> as an API for adding many mappings at once is going to be
> added later.
> 
> Although this driver has been tested only on the POWERNV
> platform, it should work on any platform which supports
> TCE tables. As h_put_tce hypercall is received by the host
> kernel and processed by the QEMU (what involves calling
> the host kernel again), performance is not the best -
> circa 220MB/s on 10Gb ethernet network.
> 
> To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
> option and configure VFIO as required.
> 
> Cc: David Gibson 
> Signed-off-by: Alexey Kardashevskiy 

Yay, it's not dead! ;)

I'd love some kind of changelog to know what to look for in here,
especially given 2mo since the last version.

> ---
>  arch/powerpc/include/asm/iommu.h|   15 ++
>  arch/powerpc/kernel/iommu.c |  343 
> +++
>  arch/powerpc/platforms/powernv/pci-ioda.c   |1 +
>  arch/powerpc/platforms/powernv/pci-p5ioc2.c |5 +-
>  arch/powerpc/platforms/powernv/pci.c|3 +
>  drivers/iommu/Kconfig   |8 +
>  6 files changed, 374 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index cbfe678..900294b 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -76,6 +76,9 @@ struct iommu_table {
>   struct iommu_pool large_pool;
>   struct iommu_pool pools[IOMMU_NR_POOLS];
>   unsigned long *it_map;   /* A simple allocation bitmap for now */
> +#ifdef CONFIG_IOMMU_API
> + struct iommu_group *it_group;
> +#endif
>  };
>  
>  struct scatterlist;
> @@ -98,6 +101,8 @@ extern void iommu_free_table(struct iommu_table *tbl, 
> const char *node_name);
>   */
>  extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>   int nid);
> +extern void iommu_register_group(struct iommu_table * tbl,
> +  int domain_number, unsigned long pe_num);
>  
>  extern int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
>   struct scatterlist *sglist, int nelems,
> @@ -147,5 +152,15 @@ static inline void iommu_restore(void)
>  }
>  #endif
>  
> +/* The API to support IOMMU operations for VFIO */
> +extern long iommu_clear_tce_user_mode(struct iommu_table *tbl,
> + unsigned long ioba, unsigned long tce_value,
> + unsigned long npages);
> +extern long iommu_put_tce_user_mode(struct iommu_table *tbl,
> + unsigned long ioba, unsigned long tce);
> +
> +extern void iommu_flush_tce(struct iommu_table *tbl);
> +extern long iommu_lock_table(struct iommu_table *tbl, bool lock);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_IOMMU_H */
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 7c309fe..b4fdabc 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -37,6 +37,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -45,6 +46,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DBG(...)
>  
> @@ -707,11 +709,39 @@ struct iommu_table *iommu_init_table(struct iommu_table 
> *tbl, int nid)
>   return tbl;
>  }
>  
> +static void group_release(void *iommu_data)
> +{
> + struct iommu_table *tbl = iommu_data;
> + tbl->it_group = NULL;
> +}
> +
> +void iommu_register_group(struct iommu_table * tbl,
> + int domain_number, unsigned long pe_num)
> +{
> + struct iommu_group *grp;
> +
> + grp = iommu_group_alloc();
> + if (IS_ERR(grp)) {
> + pr_info("powerpc iommu api: cannot create new group, err=%ld\n",
> + PTR_ERR(grp));
> + return;
> + }
> + tbl->it_group = grp;
> + iommu_group_set_iommudata(grp, tbl, group_release);
> + iommu_group_set_name(grp, kasprintf(GFP_KERNEL, "domain%d-pe%lx",
> + domain_number, pe_num));
> +}
> +
>  void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>  {
>   unsigned long bitmap_sz;
>   unsigned int order;
>  
> + if (tbl && tbl->it_group) {
> + iommu_group_put(tbl->it_group);
> + BUG_ON(tbl->it_group);
> + }
> +
>   if (!tbl || !tbl->it_map) {
>   printk(KERN_ERR "%s: expected TCE map for %s\n", __func__,

Re: [PATCH 2/2] vfio powerpc: implemented IOMMU driver for VFIO

2013-02-11 Thread Alex Williamson

On Mon, 2013-02-11 at 22:54 +1100, Alexey Kardashevskiy wrote:
> VFIO implements platform independent stuff such as
> a PCI driver, BAR access (via read/write on a file descriptor
> or direct mapping when possible) and IRQ signaling.
> 
> The platform dependent part includes IOMMU initialization
> and handling. This patch implements an IOMMU driver for VFIO
> which does mapping/unmapping pages for the guest IO and
> provides information about DMA window (required by a POWERPC
> guest).
> 
> The counterpart in QEMU is required to support this functionality.

Revision info would be great here too.

> Cc: David Gibson 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/Kconfig|6 +
>  drivers/vfio/Makefile   |1 +
>  drivers/vfio/vfio_iommu_spapr_tce.c |  269 
> +++
>  include/uapi/linux/vfio.h   |   31 
>  4 files changed, 307 insertions(+)
>  create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 7cd5dec..b464687 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
>   depends on VFIO
>   default n
>  
> +config VFIO_IOMMU_SPAPR_TCE
> + tristate
> + depends on VFIO && SPAPR_TCE_IOMMU
> + default n
> +
>  menuconfig VFIO
>   tristate "VFIO Non-Privileged userspace driver framework"
>   depends on IOMMU_API
>   select VFIO_IOMMU_TYPE1 if X86
> + select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
>   help
> VFIO provides a framework for secure userspace device drivers.
> See Documentation/vfio.txt for more details.
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 2398d4a..72bfabc 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -1,3 +1,4 @@
>  obj-$(CONFIG_VFIO) += vfio.o
>  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> new file mode 100644
> index 000..9b3fa88
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -0,0 +1,269 @@
> +/*
> + * VFIO: IOMMU DMA mapping support for TCE on POWER
> + *
> + * Copyright (C) 2012 IBM Corp.  All rights reserved.

2013 now

> + * Author: Alexey Kardashevskiy 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio_iommu_type1.c:
> + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> + * Author: Alex Williamson 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "a...@ozlabs.ru"
> +#define DRIVER_DESC "VFIO IOMMU SPAPR TCE"
> +
> +static void tce_iommu_detach_group(void *iommu_data,
> + struct iommu_group *iommu_group);
> +
> +/*
> + * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
> + *
> + * This code handles mapping and unmapping of user data buffers
> + * into DMA'ble space using the IOMMU
> + */
> +
> +/*
> + * The container descriptor supports only a single group per container.
> + * Required by the API as the container is not supplied with the IOMMU group
> + * at the moment of initialization.
> + */
> +struct tce_container {
> + struct mutex lock;
> + struct iommu_table *tbl;
> +};
> +
> +static void *tce_iommu_open(unsigned long arg)
> +{
> + struct tce_container *container;
> +
> + if (arg != VFIO_SPAPR_TCE_IOMMU) {
> + pr_err("tce_vfio: Wrong IOMMU type\n");
> + return ERR_PTR(-EINVAL);
> + }
> +
> + container = kzalloc(sizeof(*container), GFP_KERNEL);
> + if (!container)
> + return ERR_PTR(-ENOMEM);
> +
> + mutex_init(&container->lock);
> +
> + return container;
> +}
> +
> +static void tce_iommu_release(void *iommu_data)
> +{
> + struct tce_container *container = iommu_data;
> +
> + WARN_ON(container->tbl && !container->tbl->it_group);
> + if (container->tbl && container->tbl->it_group)
> + tce_iommu_detach_group(iommu_data, container->tbl->it_group);
> +
> + mutex_destroy(&container->lock);
> +
> +

Re: [PATCH 1/2] vfio powerpc: enabled on powernv platform

2013-02-11 Thread Alex Williamson

On Tue, 2013-02-12 at 10:19 +1100, Alexey Kardashevskiy wrote:
> On 12/02/13 09:16, Alex Williamson wrote:
> > On Mon, 2013-02-11 at 22:54 +1100, Alexey Kardashevskiy wrote:
> >> @@ -707,11 +709,39 @@ struct iommu_table *iommu_init_table(struct 
> >> iommu_table *tbl, int nid)
> >>return tbl;
> >>   }
> >>
> >> +static void group_release(void *iommu_data)
> >> +{
> >> +  struct iommu_table *tbl = iommu_data;
> >> +  tbl->it_group = NULL;
> >> +}
> >> +
> >> +void iommu_register_group(struct iommu_table * tbl,
> >> +  int domain_number, unsigned long pe_num)
> >> +{
> >> +  struct iommu_group *grp;
> >> +
> >> +  grp = iommu_group_alloc();
> >> +  if (IS_ERR(grp)) {
> >> +  pr_info("powerpc iommu api: cannot create new group, err=%ld\n",
> >> +  PTR_ERR(grp));
> >> +  return;
> >> +  }
> >> +  tbl->it_group = grp;
> >> +  iommu_group_set_iommudata(grp, tbl, group_release);
> >> +  iommu_group_set_name(grp, kasprintf(GFP_KERNEL, "domain%d-pe%lx",
> >> +  domain_number, pe_num));
> >> +}
> >> +
> >>   void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> >>   {
> >>unsigned long bitmap_sz;
> >>unsigned int order;
> >>
> >> +  if (tbl && tbl->it_group) {
> >> +  iommu_group_put(tbl->it_group);
> >> +  BUG_ON(tbl->it_group);
> >> +  }
> >> +
> >>if (!tbl || !tbl->it_map) {
> >>printk(KERN_ERR "%s: expected TCE map for %s\n", __func__,
> >>node_name);
> >> @@ -876,4 +906,317 @@ void kvm_iommu_unmap_pages(struct kvm *kvm, struct 
> >> kvm_memory_slot *slot)
> >>   {
> >>   }
> >>
> >> +static enum dma_data_direction tce_direction(unsigned long tce)
> >> +{
> >> +  if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
> >> +  return DMA_BIDIRECTIONAL;
> >> +  else if (tce & TCE_PCI_READ)
> >> +  return DMA_TO_DEVICE;
> >> +  else if (tce & TCE_PCI_WRITE)
> >> +  return DMA_FROM_DEVICE;
> >> +  else
> >> +  return DMA_NONE;
> >> +}
> >> +
> >> +void iommu_flush_tce(struct iommu_table *tbl)
> >> +{
> >> +  /* Flush/invalidate TLB caches if necessary */
> >> +  if (ppc_md.tce_flush)
> >> +  ppc_md.tce_flush(tbl);
> >> +
> >> +  /* Make sure updates are seen by hardware */
> >> +  mb();
> >> +}
> >> +EXPORT_SYMBOL_GPL(iommu_flush_tce);
> >> +
> >> +static long tce_clear_param_check(struct iommu_table *tbl,
> >> +  unsigned long ioba, unsigned long tce_value,
> >> +  unsigned long npages)
> >> +{
> >> +  unsigned long size = npages << IOMMU_PAGE_SHIFT;
> >> +
> >> +  /* ppc_md.tce_free() does not support any value but 0 */
> >> +  if (tce_value)
> >> +  return -EINVAL;
> >> +
> >> +  if (ioba & ~IOMMU_PAGE_MASK)
> >> +  return -EINVAL;
> >> +
> >> +  if ((ioba + size) > ((tbl->it_offset + tbl->it_size)
> >> +  << IOMMU_PAGE_SHIFT))
> >> +  return -EINVAL;
> >> +
> >> +  if (ioba < (tbl->it_offset << IOMMU_PAGE_SHIFT))
> >> +  return -EINVAL;
> >> +
> >> +  return 0;
> >
> > Why do these all return long (vs int)?  Is this a POWER-ism?
> 
> No, not really but yeah, I picked it in powerpc code :) I tried to keep 
> them "long" but I noticed "int" below so what is the rule? Change all to int?

I'd say anything that's returning 0/-errno should probably be an int.

> >> +}
> >> +
> >> +static long tce_put_param_check(struct iommu_table *tbl,
> >> +  unsigned long ioba, unsigned long tce)
> >> +{
> >> +  if (!(tce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> >> +  return -EINVAL;
> >> +
> >> +  if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
> >> +  return -EINVAL;
> >> +
> >> +  if (ioba & ~IOMMU_PAGE_MASK)
> >> +  return -EINVAL;
> >> +
> >> +  if ((ioba + IOMMU_PAGE_SIZE) > ((tbl->it_offset + tbl->it_siz

Re: [PATCH 2/2] vfio powerpc: implemented IOMMU driver for VFIO

2013-02-11 Thread Alex Williamson

On Tue, 2013-02-12 at 10:45 +1100, Alexey Kardashevskiy wrote:
> On 12/02/13 09:17, Alex Williamson wrote:
> > On Mon, 2013-02-11 at 22:54 +1100, Alexey Kardashevskiy wrote:
> >> VFIO implements platform independent stuff such as
> >> a PCI driver, BAR access (via read/write on a file descriptor
> >> or direct mapping when possible) and IRQ signaling.
> >>
> >> The platform dependent part includes IOMMU initialization
> >> and handling. This patch implements an IOMMU driver for VFIO
> >> which does mapping/unmapping pages for the guest IO and
> >> provides information about DMA window (required by a POWERPC
> >> guest).
> >>
> >> The counterpart in QEMU is required to support this functionality.
> >
> > Revision info would be great here too.
>  >
> >
> >> Cc: David Gibson 
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >>   drivers/vfio/Kconfig|6 +
> >>   drivers/vfio/Makefile   |1 +
> >>   drivers/vfio/vfio_iommu_spapr_tce.c |  269 
> >> +++
> >>   include/uapi/linux/vfio.h   |   31 
> >>   4 files changed, 307 insertions(+)
> >>   create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
> >>
> >> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> >> index 7cd5dec..b464687 100644
> >> --- a/drivers/vfio/Kconfig
> >> +++ b/drivers/vfio/Kconfig
> >> @@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
> >>depends on VFIO
> >>default n
> >>
> >> +config VFIO_IOMMU_SPAPR_TCE
> >> +  tristate
> >> +  depends on VFIO && SPAPR_TCE_IOMMU
> >> +  default n
> >> +
> >>   menuconfig VFIO
> >>tristate "VFIO Non-Privileged userspace driver framework"
> >>depends on IOMMU_API
> >>select VFIO_IOMMU_TYPE1 if X86
> >> +  select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
> >>help
> >>  VFIO provides a framework for secure userspace device drivers.
> >>  See Documentation/vfio.txt for more details.
> >> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> >> index 2398d4a..72bfabc 100644
> >> --- a/drivers/vfio/Makefile
> >> +++ b/drivers/vfio/Makefile
> >> @@ -1,3 +1,4 @@
> >>   obj-$(CONFIG_VFIO) += vfio.o
> >>   obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> >> +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> >>   obj-$(CONFIG_VFIO_PCI) += pci/
> >> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> >> b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> new file mode 100644
> >> index 000..9b3fa88
> >> --- /dev/null
> >> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> @@ -0,0 +1,269 @@
> >> +/*
> >> + * VFIO: IOMMU DMA mapping support for TCE on POWER
> >> + *
> >> + * Copyright (C) 2012 IBM Corp.  All rights reserved.
> >
> > 2013 now
> >
> >> + * Author: Alexey Kardashevskiy 
> >> + *
> >> + * This program is free software; you can redistribute it and/or modify
> >> + * it under the terms of the GNU General Public License version 2 as
> >> + * published by the Free Software Foundation.
> >> + *
> >> + * Derived from original vfio_iommu_type1.c:
> >> + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> >> + * Author: Alex Williamson 
> >> + */
> >> +
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +
> >> +#define DRIVER_VERSION  "0.1"
> >> +#define DRIVER_AUTHOR   "a...@ozlabs.ru"
> >> +#define DRIVER_DESC "VFIO IOMMU SPAPR TCE"
> >> +
> >> +static void tce_iommu_detach_group(void *iommu_data,
> >> +  struct iommu_group *iommu_group);
> >> +
> >> +/*
> >> + * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
> >> + *
> >> + * This code handles mapping and unmapping of user data buffers
> >> + * into DMA'ble space using the IOMMU
> >> + */
> >> +
> >> +/*
> >> + * The container descriptor supports only a single group per container.
> >> + * Required by the API as the container is not supplied with the IOMMU 
> >> group
> >> + * at the moment of initialization.
> >&

Re: [PATCH] iommu: making IOMMU sysfs nodes API public

2013-02-11 Thread Alex Williamson

On Tue, 2013-02-12 at 15:06 +1100, Alexey Kardashevskiy wrote:
> Having this patch in a tree, adding new nodes in sysfs
> for IOMMU groups is going to be easier.
> 
> The first candidate for this change is a "dma-window-size"
> property which tells a size of a DMA window of the specific
> IOMMU group which can be used later for locked pages accounting.

I'm still churning on this one; I'm nervous this would basically creat
a /proc free-for-all under /sys/kernel/iommu_group/$GROUP/ where any
iommu driver can add random attributes.  That can get ugly for
userspace.

On the other hand, for the application of userspace knowing how much
memory to lock for vfio use of a group, it's an appealing location to
get that information.  Something like libvirt would already be poking
around here to figure out which devices to bind.  Page limits need to be
setup prior to use through vfio, so sysfs is more convenient than
through vfio ioctls.

But then is dma-window-size just a vfio requirement leaking over into
iommu groups?  Can we allow iommu driver based attributes without giving
up control of the namespace?  Thanks,

Alex

> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/iommu/iommu.c |   19 ++-
>  include/linux/iommu.h |   20 
>  2 files changed, 22 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index b0afd3d..58cc298 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -52,22 +52,6 @@ struct iommu_device {
>   char *name;
>  };
>  
> -struct iommu_group_attribute {
> - struct attribute attr;
> - ssize_t (*show)(struct iommu_group *group, char *buf);
> - ssize_t (*store)(struct iommu_group *group,
> -  const char *buf, size_t count);
> -};
> -
> -#define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)\
> -struct iommu_group_attribute iommu_group_attr_##_name =  \
> - __ATTR(_name, _mode, _show, _store)
> -
> -#define to_iommu_group_attr(_attr)   \
> - container_of(_attr, struct iommu_group_attribute, attr)
> -#define to_iommu_group(_kobj)\
> - container_of(_kobj, struct iommu_group, kobj)
> -
>  static ssize_t iommu_group_attr_show(struct kobject *kobj,
>struct attribute *__attr, char *buf)
>  {
> @@ -98,11 +82,12 @@ static const struct sysfs_ops iommu_group_sysfs_ops = {
>   .store = iommu_group_attr_store,
>  };
>  
> -static int iommu_group_create_file(struct iommu_group *group,
> +int iommu_group_create_file(struct iommu_group *group,
>  struct iommu_group_attribute *attr)
>  {
>   return sysfs_create_file(&group->kobj, &attr->attr);
>  }
> +EXPORT_SYMBOL_GPL(iommu_group_create_file);
>  
>  static void iommu_group_remove_file(struct iommu_group *group,
>   struct iommu_group_attribute *attr)
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index f3b99e1..6d24ba7 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -21,6 +21,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  #define IOMMU_READ   (1)
>  #define IOMMU_WRITE  (2)
> @@ -197,6 +198,25 @@ static inline int report_iommu_fault(struct iommu_domain 
> *domain,
>   return ret;
>  }
>  
> +struct iommu_group_attribute {
> + struct attribute attr;
> + ssize_t (*show)(struct iommu_group *group, char *buf);
> + ssize_t (*store)(struct iommu_group *group,
> +  const char *buf, size_t count);
> +};
> +
> +#define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)\
> +struct iommu_group_attribute iommu_group_attr_##_name =  \
> + __ATTR(_name, _mode, _show, _store)
> +
> +#define to_iommu_group_attr(_attr)   \
> + container_of(_attr, struct iommu_group_attribute, attr)
> +#define to_iommu_group(_kobj)\
> + container_of(_kobj, struct iommu_group, kobj)
> +
> +extern int iommu_group_create_file(struct iommu_group *group,
> +struct iommu_group_attribute *attr);
> +
>  #else /* CONFIG_IOMMU_API */
>  
>  struct iommu_ops {};



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] iommu: making IOMMU sysfs nodes API public

2013-02-12 Thread Alex Williamson

On Wed, 2013-02-13 at 01:42 +1100, Alexey Kardashevskiy wrote:
> On 12/02/13 16:07, Alex Williamson wrote:
> > On Tue, 2013-02-12 at 15:06 +1100, Alexey Kardashevskiy wrote:
> >> Having this patch in a tree, adding new nodes in sysfs
> >> for IOMMU groups is going to be easier.
> >>
> >> The first candidate for this change is a "dma-window-size"
> >> property which tells a size of a DMA window of the specific
> >> IOMMU group which can be used later for locked pages accounting.
> >
> > I'm still churning on this one; I'm nervous this would basically creat
> > a /proc free-for-all under /sys/kernel/iommu_group/$GROUP/ where any
> > iommu driver can add random attributes.  That can get ugly for
> > userspace.
> 
> Is not it exactly what sysfs is for (unlike /proc)? :)

Um, I hope it's a little more thought out than /proc.

> > On the other hand, for the application of userspace knowing how much
> > memory to lock for vfio use of a group, it's an appealing location to
> > get that information.  Something like libvirt would already be poking
> > around here to figure out which devices to bind.  Page limits need to be
> > setup prior to use through vfio, so sysfs is more convenient than
> > through vfio ioctls.
> 
> True. DMA window properties do not change since boot so sysfs is the right 
> place to expose them.
>
> > But then is dma-window-size just a vfio requirement leaking over into
> > iommu groups?  Can we allow iommu driver based attributes without giving
> > up control of the namespace?  Thanks,
> 
> Who are you asking these questions? :)

Anyone, including you.  Rather than dropping misc files in sysfs to
describe things about the group, I think the better solution in your
case might be a link from the group to an existing sysfs directory
describing the PE.  I believe your PE is rooted in a PCI bridge, so that
presumably already has a representation in sysfs.  Can the aperture size
be determined from something in sysfs for that bridge already?  I'm just
not ready to create a grab bag of sysfs entries for a group yet.
Thanks,

Alex

> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >>   drivers/iommu/iommu.c |   19 ++-
> >>   include/linux/iommu.h |   20 
> >>   2 files changed, 22 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> >> index b0afd3d..58cc298 100644
> >> --- a/drivers/iommu/iommu.c
> >> +++ b/drivers/iommu/iommu.c
> >> @@ -52,22 +52,6 @@ struct iommu_device {
> >>char *name;
> >>   };
> >>
> >> -struct iommu_group_attribute {
> >> -  struct attribute attr;
> >> -  ssize_t (*show)(struct iommu_group *group, char *buf);
> >> -  ssize_t (*store)(struct iommu_group *group,
> >> -   const char *buf, size_t count);
> >> -};
> >> -
> >> -#define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \
> >> -struct iommu_group_attribute iommu_group_attr_##_name =   \
> >> -  __ATTR(_name, _mode, _show, _store)
> >> -
> >> -#define to_iommu_group_attr(_attr)\
> >> -  container_of(_attr, struct iommu_group_attribute, attr)
> >> -#define to_iommu_group(_kobj) \
> >> -  container_of(_kobj, struct iommu_group, kobj)
> >> -
> >>   static ssize_t iommu_group_attr_show(struct kobject *kobj,
> >> struct attribute *__attr, char *buf)
> >>   {
> >> @@ -98,11 +82,12 @@ static const struct sysfs_ops iommu_group_sysfs_ops = {
> >>.store = iommu_group_attr_store,
> >>   };
> >>
> >> -static int iommu_group_create_file(struct iommu_group *group,
> >> +int iommu_group_create_file(struct iommu_group *group,
> >>   struct iommu_group_attribute *attr)
> >>   {
> >>return sysfs_create_file(&group->kobj, &attr->attr);
> >>   }
> >> +EXPORT_SYMBOL_GPL(iommu_group_create_file);
> >>
> >>   static void iommu_group_remove_file(struct iommu_group *group,
> >>struct iommu_group_attribute *attr)
> >> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> >> index f3b99e1..6d24ba7 100644
> >> --- a/include/linux/iommu.h
> >> +++ b/include/linux/iommu.h
> >> @@ -21,6 +21,7 @@
> >>
> >>   #include 
> >>   #include 
> >> +#include 
> >>
> >&

Re: [PATCH 5/5 v11] iommu/fsl: Freescale PAMU driver and iommu implementation.

2013-04-03 Thread Alex Williamson

On Tue, 2013-04-02 at 18:18 +0200, Joerg Roedel wrote:
> Cc'ing Alex Williamson
> 
> Alex, can you please review the iommu-group part of this patch?

Sure, it looks pretty reasonable.  AIUI, all PCI devices are below some
kind of host bridge that is either new and supports partitioning or old
and doesn't.  I don't know if that's a visibility or isolation
requirement, perhaps PCI ACS-ish.  In the new host bridge case, each
device gets a group.  This seems not to have any quirks for
multifunction devices though.  On AMD and Intel IOMMUs we test
multifunction device ACS support to determine whether all the functions
should be in the same group.  Is there any reason to trust multifunction
devices on PAMU?

I also find it curious what happens to the iommu group of the host
bridge.  In the partitionable case the host bridge group is removed, in
the non-partitionable case the host bridge group becomes the group for
the children, removing the host bridge.  It's unique to PAMU so far that
these host bridges are even in an iommu group (x86 only adds pci
devices), but I don't see it as necessarily wrong leaving it in either
scenario.  Does it solve some problem to remove them from the groups?
Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/5 v11] iommu/fsl: Freescale PAMU driver and iommu implementation.

2013-04-04 Thread Alex Williamson

On Thu, 2013-04-04 at 13:00 +, Sethi Varun-B16395 wrote:
> 
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Wednesday, April 03, 2013 11:32 PM
> > To: Joerg Roedel
> > Cc: Sethi Varun-B16395; Yoder Stuart-B08248; Wood Scott-B07421;
> > io...@lists.linux-foundation.org; linuxppc-...@lists.ozlabs.org; linux-
> > ker...@vger.kernel.org; ga...@kernel.crashing.org;
> > b...@kernel.crashing.org
> > Subject: Re: [PATCH 5/5 v11] iommu/fsl: Freescale PAMU driver and iommu
> > implementation.
> > 
> > On Tue, 2013-04-02 at 18:18 +0200, Joerg Roedel wrote:
> > > Cc'ing Alex Williamson
> > >
> > > Alex, can you please review the iommu-group part of this patch?
> > 
> > Sure, it looks pretty reasonable.  AIUI, all PCI devices are below some
> > kind of host bridge that is either new and supports partitioning or old
> > and doesn't.  I don't know if that's a visibility or isolation
> > requirement, perhaps PCI ACS-ish.  In the new host bridge case, each
> > device gets a group.  This seems not to have any quirks for multifunction
> > devices though.  On AMD and Intel IOMMUs we test multifunction device ACS
> > support to determine whether all the functions should be in the same
> > group.  Is there any reason to trust multifunction devices on PAMU?
> > 
> [Sethi Varun-B16395] In the case where we can partition endpoints we
> can distinguish transactions based on the bus,device,function number
> combination. This support is available in the PCIe controller (host
> bridge).

So can x86 IOMMUs, that's the visibility aspect of IOMMU groups.
Visibility alone doesn't necessarily imply that a device is isolated
though.  A multifunction PCI device that doesn't expose ACS support may
not isolate functions from each other.  For example a peer-to-peer DMA
between functions may not be translated by the upstream IOMMU.  IOMMU
groups should encompass both visibility and isolation.

> > I also find it curious what happens to the iommu group of the host
> > bridge.  In the partitionable case the host bridge group is removed, in
> > the non-partitionable case the host bridge group becomes the group for
> > the children, removing the host bridge.  It's unique to PAMU so far that
> > these host bridges are even in an iommu group (x86 only adds pci
> > devices), but I don't see it as necessarily wrong leaving it in either
> > scenario.  Does it solve some problem to remove them from the groups?
> > Thanks,
> [Sethi Varun-B16395] The PCIe controller isn't a partitionable entity,
> it would always be owned by the host.

Ownership of a device shouldn't play into the group context.  An IOMMU
group should be defined by it's visibility and isolation from other
devices.  Whether the PCIe controller is allowed to be handed to
userspace is a question for VFIO.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/5 v11] iommu/fsl: Freescale PAMU driver and iommu implementation.

2013-04-04 Thread Alex Williamson

On Thu, 2013-04-04 at 16:35 +, Sethi Varun-B16395 wrote:
> 
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Thursday, April 04, 2013 8:52 PM
> > To: Sethi Varun-B16395
> > Cc: Joerg Roedel; Yoder Stuart-B08248; Wood Scott-B07421;
> > io...@lists.linux-foundation.org; linuxppc-...@lists.ozlabs.org; linux-
> > ker...@vger.kernel.org; ga...@kernel.crashing.org;
> > b...@kernel.crashing.org
> > Subject: Re: [PATCH 5/5 v11] iommu/fsl: Freescale PAMU driver and iommu
> > implementation.
> > 
> > On Thu, 2013-04-04 at 13:00 +, Sethi Varun-B16395 wrote:
> > >
> > > > -Original Message-
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > Sent: Wednesday, April 03, 2013 11:32 PM
> > > > To: Joerg Roedel
> > > > Cc: Sethi Varun-B16395; Yoder Stuart-B08248; Wood Scott-B07421;
> > > > io...@lists.linux-foundation.org; linuxppc-...@lists.ozlabs.org;
> > > > linux- ker...@vger.kernel.org; ga...@kernel.crashing.org;
> > > > b...@kernel.crashing.org
> > > > Subject: Re: [PATCH 5/5 v11] iommu/fsl: Freescale PAMU driver and
> > > > iommu implementation.
> > > >
> > > > On Tue, 2013-04-02 at 18:18 +0200, Joerg Roedel wrote:
> > > > > Cc'ing Alex Williamson
> > > > >
> > > > > Alex, can you please review the iommu-group part of this patch?
> > > >
> > > > Sure, it looks pretty reasonable.  AIUI, all PCI devices are below
> > > > some kind of host bridge that is either new and supports
> > > > partitioning or old and doesn't.  I don't know if that's a
> > > > visibility or isolation requirement, perhaps PCI ACS-ish.  In the
> > > > new host bridge case, each device gets a group.  This seems not to
> > > > have any quirks for multifunction devices though.  On AMD and Intel
> > > > IOMMUs we test multifunction device ACS support to determine whether
> > > > all the functions should be in the same group.  Is there any reason
> > to trust multifunction devices on PAMU?
> > > >
> > > [Sethi Varun-B16395] In the case where we can partition endpoints we
> > > can distinguish transactions based on the bus,device,function number
> > > combination. This support is available in the PCIe controller (host
> > > bridge).
> > 
> > So can x86 IOMMUs, that's the visibility aspect of IOMMU groups.
> > Visibility alone doesn't necessarily imply that a device is isolated
> > though.  A multifunction PCI device that doesn't expose ACS support may
> > not isolate functions from each other.  For example a peer-to-peer DMA
> > between functions may not be translated by the upstream IOMMU.  IOMMU
> > groups should encompass both visibility and isolation.
> [Sethi Varun-B16395] We can isolate the DMA access to the host based
> on the to the pci bus,device,function number.

The IOMMU can only isolate DMA that it can see.  A multifunction device
may never expose peer-to-peer DMA to the upstream device, it's
implementation specific.  The ACS flags allow that possibility to be
controlled and prevented.

> I thought that was enough to put devices in to separate iommu groups.
> This is a PCIe controller property which allows us to partition PCIe
> devices. But, what I can understand from your point is that we also
> need to consider isolation at PCIe device level as well. I will check
> for the case of multifunction devices.
> 
> > 
> > > > I also find it curious what happens to the iommu group of the host
> > > > bridge.  In the partitionable case the host bridge group is removed,
> > > > in the non-partitionable case the host bridge group becomes the
> > > > group for the children, removing the host bridge.  It's unique to
> > > > PAMU so far that these host bridges are even in an iommu group (x86
> > > > only adds pci devices), but I don't see it as necessarily wrong
> > > > leaving it in either scenario.  Does it solve some problem to remove
> > them from the groups?
> > > > Thanks,
> > > [Sethi Varun-B16395] The PCIe controller isn't a partitionable entity,
> > > it would always be owned by the host.
> > 
> > Ownership of a device shouldn't play into the group context.  An IOMMU
> > group should be defined by it's visibility and isolation from other
> > devices.  Whether the PCIe controller is allowed to be handed to
> > userspace is a question for

Re: [PATCH v2 2/3] VFIO-AER: Vfio-pci driver changes for supporting AER

2013-01-29 Thread Alex Williamson

On Mon, 2013-01-28 at 12:31 -0700, Alex Williamson wrote:
> On Mon, 2013-01-28 at 09:54 +, Pandarathil, Vijaymohan R wrote:
> > - New VFIO_SET_IRQ ioctl option to pass the eventfd that is signalled 
> > when
> >   an error occurs in the vfio_pci_device
> > 
> > - Register pci_error_handler for the vfio_pci driver
> > 
> > - When the device encounters an error, the error handler registered by
> >   the vfio_pci driver gets invoked by the AER infrastructure
> > 
> > - In the error handler, signal the eventfd registered for the device.
> > 
> > - This results in the qemu eventfd handler getting invoked and
> >   appropriate action taken for the guest.
> > 
> > Signed-off-by: Vijay Mohan Pandarathil 
> > ---
> >  drivers/vfio/pci/vfio_pci.c | 44 
> > -
> >  drivers/vfio/pci/vfio_pci_intrs.c   | 32 +++
> >  drivers/vfio/pci/vfio_pci_private.h |  1 +
> >  include/uapi/linux/vfio.h   |  3 +++
> >  4 files changed, 79 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index b28e66c..ff2a078 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -196,7 +196,9 @@ static int vfio_pci_get_irq_count(struct 
> > vfio_pci_device *vdev, int irq_type)
> >  
> > return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
> > }
> > -   }
> > +   } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX)
> > +   if (pci_is_pcie(vdev->pdev))
> > +   return 1;
> >  
> > return 0;
> >  }
> > @@ -223,9 +225,18 @@ static long vfio_pci_ioctl(void *device_data,
> > if (vdev->reset_works)
> > info.flags |= VFIO_DEVICE_FLAGS_RESET;
> >  
> > +   if (pci_is_pcie(vdev->pdev)) {
> > +   info.flags |= VFIO_DEVICE_FLAGS_PCI_AER;
> > +   info.flags |= VFIO_DEVICE_FLAGS_PCI_AER_NOTIFY;
> 
> Not sure this second flag should be AER specific or if it's even needed,
> see below for more comments on this.
> 
> > +   }
> > +
> > info.num_regions = VFIO_PCI_NUM_REGIONS;
> > info.num_irqs = VFIO_PCI_NUM_IRQS;
> >  
> > +   /* Expose only implemented IRQs */
> > +   if (!(info.flags & VFIO_DEVICE_FLAGS_PCI_AER_NOTIFY))
> > +   info.num_irqs--;
> 
> I'm having second thoughts on this, see further below.
> 
> > +
> > return copy_to_user((void __user *)arg, &info, minsz);
> >  
> > } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
> > @@ -302,6 +313,10 @@ static long vfio_pci_ioctl(void *device_data,
> > if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
> > return -EINVAL;
> >  
> > +   if ((info.index == VFIO_PCI_ERR_IRQ_INDEX) &&
> > +!pci_is_pcie(vdev->pdev))
> > +   return -EINVAL;
> > +
> 
> Perhaps we could incorporate the index test above this too?
> 
> switch (info.index) {
> case VFIO_PCI_INTX_IRQ_INDEX: ... VFIO_PCI_MSIX_IRQ_INDEX:
>   break;
> case VFIO_PCI_ERR_IRQ_INDEX:
>   if (pci_is_pcie(vdev->pdev))
>   break;
> default:
>   return -EINVAL;
> }
> 
> This is more similar to how I've re-written the same for the proposed
> VGA/legacy I/O support.
> 
> > info.flags = VFIO_IRQ_INFO_EVENTFD;
> >  
> > info.count = vfio_pci_get_irq_count(vdev, info.index);
> > @@ -538,11 +553,38 @@ static void vfio_pci_remove(struct pci_dev *pdev)
> > kfree(vdev);
> >  }
> >  
> > +static pci_ers_result_t vfio_err_detected(struct pci_dev *pdev,
> > +   pci_channel_state_t state)
> 
> This is actually AER specific, right?  So perhaps it should be
> vfio_pci_aer_err_detected?
> 
> Also, please follow existing whitespace usage throughout, tabs followed
> by spaces to align function parameter wrap.
> 
> > +{
> > +   struct vfio_pci_device *vpdev;
> > +   void *vdev;
> 
> struct vfio_device *vdev;
> 
> > +
> > +   vdev = vfio_device_get_from_dev(&pdev->dev);
> > +   if (vdev == NULL)
> > +   return PCI_ERS_RESULT_DISCONNECT;
> > +
> > +   vpdev = vfio_device_data(vdev);
> > +   if (vpdev == NULL)
> > +   return PCI_ERS_RESULT_DISC

[PATCH v2] vfio-pci: Add support for VGA region access

2013-01-29 Thread Alex Williamson

PCI defines display class VGA regions at I/O port address 0x3b0, 0x3c0
and MMIO address 0xa.  As these are non-overlapping, we can ignore
the I/O port vs MMIO difference and expose them both in a single
region.  We make use of the VGA arbiter around each access to
configure chipset access as necessary.

Signed-off-by: Alex Williamson 
---

v2: Revise the region access based on VFIO AER discussion.  We don't
need to add flags for everything or over-engineer generic legacy
resources.  GET_REGION_INFO.num_regions should return the the highest
index, as the documentation specifies, not a count.  Thus we always
return VFIO_PCI_NUM_REGIONS and userspace can probe for interesting
things within that.  The existence of VFIO_PCI_VGA_REGION_INDEX
therefore indicates the presense of PCI VGA support and due to the VGA
I/O port and MMIO being non-overlapping, we can expose them in a
single region.

VFIO PCI AER can follow this lead for an IRQ and region for describing
AER events and data with VFIO_PCI_AER_IRQ_INDEX and
VFIO_PCI_AER_REGION_INDEX.

 drivers/vfio/pci/Kconfig|   10 ++
 drivers/vfio/pci/vfio_pci.c |   18 ++
 drivers/vfio/pci/vfio_pci_private.h |4 ++
 drivers/vfio/pci/vfio_pci_rdwr.c|   61 +++
 include/uapi/linux/vfio.h   |9 +
 5 files changed, 102 insertions(+)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 5980758..e84300b 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -6,3 +6,13 @@ config VFIO_PCI
  use of PCI drivers using the VFIO framework.
 
  If you don't know what to do here, say N.
+
+config VFIO_PCI_VGA
+   bool "VFIO PCI support for VGA devices"
+   depends on VFIO_PCI && X86 && VGA_ARB && EXPERIMENTAL
+   help
+ Support for VGA extension to VFIO PCI.  This exposes an additional
+ region on VGA devices for accessing legacy VGA addresses used by
+ BIOS and generic video drivers.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index bb8c8c2..8189cb6 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -84,6 +84,11 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
} else
vdev->msix_bar = 0xFF;
 
+#ifdef CONFIG_VFIO_PCI_VGA
+   if ((pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
+   vdev->has_vga = true;
+#endif
+
return 0;
 }
 
@@ -285,6 +290,16 @@ static long vfio_pci_ioctl(void *device_data,
info.flags = VFIO_REGION_INFO_FLAG_READ;
break;
}
+   case VFIO_PCI_VGA_REGION_INDEX:
+   if (!vdev->has_vga)
+   return -EINVAL;
+
+   info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+   info.size = 0xc;
+   info.flags = VFIO_REGION_INFO_FLAG_READ |
+VFIO_REGION_INFO_FLAG_WRITE;
+
+   break;
default:
return -EINVAL;
}
@@ -386,6 +401,9 @@ static ssize_t vfio_pci_rw(void *device_data, char __user 
*buf,
 
case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
+
+   case VFIO_PCI_VGA_REGION_INDEX:
+   return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
}
 
return -EINVAL;
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index 00d19b9..d7e55d0 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -53,6 +53,7 @@ struct vfio_pci_device {
boolreset_works;
boolextended_caps;
boolbardirty;
+   boolhas_vga;
struct pci_saved_state  *pci_saved_state;
atomic_trefcnt;
 };
@@ -77,6 +78,9 @@ extern ssize_t vfio_pci_config_rw(struct vfio_pci_device 
*vdev,
 extern ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
   size_t count, loff_t *ppos, bool iswrite);
 
+extern ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf,
+  size_t count, loff_t *ppos, bool iswrite);
+
 extern int vfio_pci_init_perm_bits(void);
 extern void vfio_pci_uninit_perm_bits(void);
 
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index e9d78eb..210db24 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "vfio_pci_private.h"
 
@@ -175,3 +176,63 @@ ssize_t vfio

Re: [PATCH] x86, x2apic: Only WARN on broken BIOSes inside a virtual guest

2013-01-31 Thread Alex Williamson

On Thu, 2013-01-31 at 22:00 +0200, Gleb Natapov wrote:
> On Thu, Jan 31, 2013 at 02:34:27PM -0500, Don Zickus wrote:
> > On Thu, Jan 31, 2013 at 08:52:00PM +0200, Gleb Natapov wrote:
> > > > http://www.invisiblethingslab.com/resources/2011/Software%20Attacks%20on%20Intel%20VT-d.pdf
> > > > 
> > > > After talking with folks, the threat of irq injections on virtual guests
> > > > made sense.  However, when discussing if this was possible on bare metal
> > > > machines, we could not come up with a plausible scenario.
> > > > 
> > > The irq injections is something that a guest with assigned device does
> > > to attack a hypervisor it runs on. Interrupt remapping protects host
> > > from this attack. According to pdf above if x2apic is disabled in a
> > > hypervisor interrupt remapping can be bypassed and leave host vulnerable
> > > to guest attack. This means that situation is exactly opposite: warning
> > > has sense on a bare metal, but not in a guest. I am not sure that there is
> > > a hypervisor that emulates interrupt remapping device though and without
> > > it the warning cannot be triggered in a guest.
> > 
> > Ah, it makes sense.  Not sure how I got it backwards then.  So my patch is
> > pointless then?  I'll asked for it to be dropped.
> Yes, it is backwards.
> 
> > 
> > >From my previous discussions with folks, is that KVM was protected from
> > this type of attack.  Is that still true?
> > 
> Copying Alex. He said that to use device assignment without interrupt
> remapping customer needs to opt-in explicitly. Not sure what happens
> with interrupt remapping but with x2apic disabled.

Per the paper above, compatibility format is only vulnerable if EIM
(Extended Interrupt Mode) is clear (x2APIC not enabled) and CFIS in the
global command register is set.  The latter is never set.

> The problem is not limited to virtualization BTW. Any vfio user may
> attack kernel without interrupt remapping so vfio has the same opt-in.

Yep.  Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] kvm: IOMMU read-only mapping support

2013-01-24 Thread Alex Williamson

A couple patches to make KVM IOMMU support honor read-only mappings.
This causes an un-map, re-map when the read-only flag changes and
makes use of it when setting IOMMU attributes.  Thanks,

Alex

---

Alex Williamson (2):
  kvm: Force IOMMU remapping on memory slot read-only flag changes
  kvm: Obey read-only mappings in iommu


 virt/kvm/iommu.c|4 +++-
 virt/kvm/kvm_main.c |   28 
 2 files changed, 27 insertions(+), 5 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] kvm: Force IOMMU remapping on memory slot read-only flag changes

2013-01-24 Thread Alex Williamson

Memory slot flags can be altered without changing other parameters of
the slot.  The read-only attribute is the only one the IOMMU cares
about, so generate an un-map, re-map when this occurs.  This also
avoid unnecessarily re-mapping the slot when no IOMMU visible changes
are made.

Signed-off-by: Alex Williamson 
---
 virt/kvm/kvm_main.c |   28 
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5e709eb..3fec2cd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -731,6 +731,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
struct kvm_memory_slot *slot;
struct kvm_memory_slot old, new;
struct kvm_memslots *slots = NULL, *old_memslots;
+   bool old_iommu_mapped;
 
r = check_memory_region_flags(mem);
if (r)
@@ -772,6 +773,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
new.npages = npages;
new.flags = mem->flags;
 
+   old_iommu_mapped = old.npages;
+
/*
 * Disallow changing a memory slot's size or changing anything about
 * zero sized slots that doesn't involve making them non-zero.
@@ -835,6 +838,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
/* slot was deleted or moved, clear iommu mapping */
kvm_iommu_unmap_pages(kvm, &old);
+   old_iommu_mapped = false;
/* From this point no new shadow pages pointing to a deleted,
 * or moved, memslot will be created.
 *
@@ -863,11 +867,27 @@ int __kvm_set_memory_region(struct kvm *kvm,
goto out_free;
}
 
-   /* map new memory slot into the iommu */
+   /*
+* IOMMU mapping:  New slots need to be mapped.  Old slots need to be
+* un-mapped and re-mapped if their base changes or if flags that the
+* iommu cares about change (read-only).  Base change unmapping is
+* handled above with slot deletion, so we only unmap incompatible
+* flags here.  Anything else the iommu might care about for existing
+* slots (size changes, userspace addr changes) is disallowed above,
+* so any other attribute changes getting here can be skipped.
+*/
if (npages) {
-   r = kvm_iommu_map_pages(kvm, &new);
-   if (r)
-   goto out_slots;
+   if (old_iommu_mapped &&
+   ((new.flags ^ old.flags) & KVM_MEM_READONLY)) {
+   kvm_iommu_unmap_pages(kvm, &old);
+   old_iommu_mapped = false;
+   }
+
+   if (!old_iommu_mapped) {
+   r = kvm_iommu_map_pages(kvm, &new);
+   if (r)
+   goto out_slots;
+   }
}
 
/* actual memory is freed via old in kvm_free_physmem_slot below */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] kvm: Obey read-only mappings in iommu

2013-01-24 Thread Alex Williamson

We've been ignoring read-only mappings and programming everything
into the iommu as read-write.  Fix this to only include the write
access flag when read-only is not set.

Signed-off-by: Alex Williamson 
---
 virt/kvm/iommu.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c
index 4a340cb..72a130b 100644
--- a/virt/kvm/iommu.c
+++ b/virt/kvm/iommu.c
@@ -76,7 +76,9 @@ int kvm_iommu_map_pages(struct kvm *kvm, struct 
kvm_memory_slot *slot)
gfn = slot->base_gfn;
end_gfn = gfn + slot->npages;
 
-   flags = IOMMU_READ | IOMMU_WRITE;
+   flags = IOMMU_READ;
+   if (!(slot->flags & KVM_MEM_READONLY))
+   flags |= IOMMU_WRITE;
if (kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY)
flags |= IOMMU_CACHE;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] kvm: Fix irqfd resampler list walk

2013-01-28 Thread Alex Williamson

On Mon, 2012-12-10 at 18:16 -0200, Marcelo Tosatti wrote:
> On Thu, Dec 06, 2012 at 02:44:59PM -0700, Alex Williamson wrote:
> > Typo for the next pointer means we're walking random data here.
> > 
> > Signed-off-by: Alex Williamson 
> > Cc: sta...@vger.kernel.org [3.7]
> > ---
> > 
> > Not sure if this will make 3.7, so preemptively adding the stable flag
> > 
> >  virt/kvm/eventfd.c |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Applied, thanks.
> 

Hi Marcelo,

This didn't seem to make it into 3.7.0 or any stable 3.7.  Can we
promote it for stable?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] kvm: IOMMU read-only mapping support

2013-01-28 Thread Alex Williamson

On Mon, 2013-01-28 at 21:25 +0900, Takuya Yoshikawa wrote:
> On Mon, 28 Jan 2013 12:59:03 +0200
> Gleb Natapov  wrote:
> 
> > > It sets spte based on the old value that means the readonly flag check
> > > is missed. We need to call kvm_arch_flush_shadow_all under this case.
> > Why not just disallow changing memory region KVM_MEM_READONLY flag
> > without deleting the region?
> 
> Sounds good.
> 
> If you prefer, I can fold the required change into my patch.

That would seem to make my patch 1/2 unnecessary.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/3] VFIO: Wrappers for getting/putting reference to vfio_device

2013-01-28 Thread Alex Williamson

On Mon, 2013-01-28 at 09:54 +, Pandarathil, Vijaymohan R wrote:
>   - Added vfio_device_get_from_vdev(), vfio_device_put_vdev()
>   as wrappers to get/put reference to vfio_device from struct device.
> 
>   - Added vfio_device_data() as a wrapper to get device_data from
>   vfio_device.
> 
> Signed-off-by: Vijay Mohan Pandarathil 
> ---
>  drivers/vfio/vfio.c  | 47 +--
>  include/linux/vfio.h |  3 +++
>  2 files changed, 44 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 12c264d..c2ff1b2 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -642,8 +642,13 @@ int vfio_add_group_dev(struct device *dev,
>  }
>  EXPORT_SYMBOL_GPL(vfio_add_group_dev);
>  
> -/* Test whether a struct device is present in our tracking */
> -static bool vfio_dev_present(struct device *dev)
> +/**
> + * This does a get on the corresponding iommu_group,
> + * vfio_group and the vfio_device. Callers of this
> + * function will hae to call vfio_put_vdev() to

s/hae/have/

Note that your commit log and the patch don't agree on names.
vfio_device_get_from_dev vs vfio_device_get_from_vdev.  vfio_put_vdev vs
vfio_device_put_vdev.  Personally I think they should be:

vfio_device_get_from_dev
vfio_device_put

> + * remove the reference to all objects.

Why do we need to hold references to all the objects?  Holding a
reference to a vfio_device should implicitly hold the vfio_group, which
implicitly holds the iommu_group.  We have to get each to get to the
vfio_device, but once we hold of that reference I believe we can let the
others go.  Is this not the case?

> + */
> +void *vfio_device_get_from_dev(struct device *dev)

Return should be struct vfio_device*.  It doesn't have to be void to be
opaque.

>  {
>   struct iommu_group *iommu_group;
>   struct vfio_group *group;
> @@ -651,25 +656,55 @@ static bool vfio_dev_present(struct device *dev)
>  
>   iommu_group = iommu_group_get(dev);
>   if (!iommu_group)
> - return false;
> + return NULL;
>  
>   group = vfio_group_get_from_iommu(iommu_group);
>   if (!group) {
>   iommu_group_put(iommu_group);
> - return false;
> + return NULL;
>   }
>  
>   device = vfio_group_get_device(group, dev);
>   if (!device) {
>   vfio_group_put(group);
>   iommu_group_put(iommu_group);
> - return false;
> + return NULL;
>   }
> + return device;
> +}
> +EXPORT_SYMBOL_GPL(vfio_device_get_from_dev);
> +
> +void *vfio_device_data(void *data)
> +{

Why wouldn't this take struct vfio_device*?  We're ignoring free type
checking errors even though the user should be treating the device as
opaque.

> + struct vfio_device *device = data;
> + return device->device_data;
> +}
> +EXPORT_SYMBOL_GPL(vfio_device_data);
> +
> +void vfio_device_put_vdev(void *data)
> +{

This also should take a struct vfio_device* and be called
vfio_device_put().  If we fix the above extra reference holding then
it's just the existingvfio_device_put, which just needs to be exported.

> + struct vfio_device *device = data;
> + struct vfio_group *group = device->group;
> + struct iommu_group *iommu_group = group->iommu_group;
>  
>   vfio_device_put(device);
>   vfio_group_put(group);
>   iommu_group_put(iommu_group);
> - return true;
> + return;

Unnecessary explicit return.  Thanks,

Alex

> +}
> +EXPORT_SYMBOL_GPL(vfio_device_put_vdev);
> +
> +/* Test whether a struct device is present in our tracking */
> +static bool vfio_dev_present(struct device *dev)
> +{
> + struct vfio_device *device;
> +
> + device = vfio_device_get_from_dev(dev);
> + if (device) {
> + vfio_device_put_vdev(device);
> + return true;
> + } else
> + return false;
>  }
>  
>  /*
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index ab9e862..e550c09 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -45,6 +45,9 @@ extern int vfio_add_group_dev(struct device *dev,
> void *device_data);
>  
>  extern void *vfio_del_group_dev(struct device *dev);
> +extern void *vfio_device_get_from_dev(struct device *dev);
> +extern void vfio_device_put_vdev(void *device);
> +extern void *vfio_device_data(void *device);
>  
>  /**
>   * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/3] VFIO-AER: Vfio-pci driver changes for supporting AER

2013-01-28 Thread Alex Williamson

On Mon, 2013-01-28 at 09:54 +, Pandarathil, Vijaymohan R wrote:
>   - New VFIO_SET_IRQ ioctl option to pass the eventfd that is signalled 
> when
>   an error occurs in the vfio_pci_device
> 
>   - Register pci_error_handler for the vfio_pci driver
> 
>   - When the device encounters an error, the error handler registered by
>   the vfio_pci driver gets invoked by the AER infrastructure
> 
>   - In the error handler, signal the eventfd registered for the device.
> 
>   - This results in the qemu eventfd handler getting invoked and
>   appropriate action taken for the guest.
> 
> Signed-off-by: Vijay Mohan Pandarathil 
> ---
>  drivers/vfio/pci/vfio_pci.c | 44 
> -
>  drivers/vfio/pci/vfio_pci_intrs.c   | 32 +++
>  drivers/vfio/pci/vfio_pci_private.h |  1 +
>  include/uapi/linux/vfio.h   |  3 +++
>  4 files changed, 79 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index b28e66c..ff2a078 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -196,7 +196,9 @@ static int vfio_pci_get_irq_count(struct vfio_pci_device 
> *vdev, int irq_type)
>  
>   return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
>   }
> - }
> + } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX)
> + if (pci_is_pcie(vdev->pdev))
> + return 1;
>  
>   return 0;
>  }
> @@ -223,9 +225,18 @@ static long vfio_pci_ioctl(void *device_data,
>   if (vdev->reset_works)
>   info.flags |= VFIO_DEVICE_FLAGS_RESET;
>  
> + if (pci_is_pcie(vdev->pdev)) {
> + info.flags |= VFIO_DEVICE_FLAGS_PCI_AER;
> + info.flags |= VFIO_DEVICE_FLAGS_PCI_AER_NOTIFY;

Not sure this second flag should be AER specific or if it's even needed,
see below for more comments on this.

> + }
> +
>   info.num_regions = VFIO_PCI_NUM_REGIONS;
>   info.num_irqs = VFIO_PCI_NUM_IRQS;
>  
> + /* Expose only implemented IRQs */
> + if (!(info.flags & VFIO_DEVICE_FLAGS_PCI_AER_NOTIFY))
> + info.num_irqs--;

I'm having second thoughts on this, see further below.

> +
>   return copy_to_user((void __user *)arg, &info, minsz);
>  
>   } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
> @@ -302,6 +313,10 @@ static long vfio_pci_ioctl(void *device_data,
>   if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
>   return -EINVAL;
>  
> + if ((info.index == VFIO_PCI_ERR_IRQ_INDEX) &&
> +  !pci_is_pcie(vdev->pdev))
> + return -EINVAL;
> +

Perhaps we could incorporate the index test above this too?

switch (info.index) {
case VFIO_PCI_INTX_IRQ_INDEX: ... VFIO_PCI_MSIX_IRQ_INDEX:
break;
case VFIO_PCI_ERR_IRQ_INDEX:
if (pci_is_pcie(vdev->pdev))
break;
default:
return -EINVAL;
}

This is more similar to how I've re-written the same for the proposed
VGA/legacy I/O support.

>   info.flags = VFIO_IRQ_INFO_EVENTFD;
>  
>   info.count = vfio_pci_get_irq_count(vdev, info.index);
> @@ -538,11 +553,38 @@ static void vfio_pci_remove(struct pci_dev *pdev)
>   kfree(vdev);
>  }
>  
> +static pci_ers_result_t vfio_err_detected(struct pci_dev *pdev,
> + pci_channel_state_t state)

This is actually AER specific, right?  So perhaps it should be
vfio_pci_aer_err_detected?

Also, please follow existing whitespace usage throughout, tabs followed
by spaces to align function parameter wrap.

> +{
> + struct vfio_pci_device *vpdev;
> + void *vdev;

struct vfio_device *vdev;

> +
> + vdev = vfio_device_get_from_dev(&pdev->dev);
> + if (vdev == NULL)
> + return PCI_ERS_RESULT_DISCONNECT;
> +
> + vpdev = vfio_device_data(vdev);
> + if (vpdev == NULL)
> + return PCI_ERS_RESULT_DISCONNECT;
> +
> + if (vpdev->err_trigger)
> + eventfd_signal(vpdev->err_trigger, 1);
> +
> + vfio_device_put_vdev(vdev);
> +
> + return PCI_ERS_RESULT_CAN_RECOVER;
> +}
> +
> +static const struct pci_error_handlers vfio_err_handlers = {
> + .error_detected = vfio_err_detected,
> +};
> +
>  static struct pci_driver vfio_pci_driver = {
>   .name   = "vfio-pci",
>   .id_table   = NULL, /* only dynamic ids */
>   .probe  = vfio_pci_probe,
>   .remove = vfio_pci_remove,
> + .err_handler= &vfio_err_handlers,
>  };
>  
>  static void __exit vfio_pci_cleanup(void)
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> b/drivers/vfio/pci/vfio_pci_intrs.c
> index 3639371..f003e08 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -745,6 +745,31 @@ static int vfio_pci_se

Re: [PATCH] drivers/vfio: remove depends on CONFIG_EXPERIMENTAL

2013-02-24 Thread Alex Williamson

On Fri, 2013-02-22 at 23:36 -0800, Kees Cook wrote:
> The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
> while now and is almost always enabled by default. As agreed during the
> Linux kernel summit, remove it from any "depends on" lines in Kconfigs.
> 
> Signed-off-by: Kees Cook 
> Cc: Alex Williamson 
> ---
>  drivers/vfio/pci/Kconfig |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index e84300b..c41b01e 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -9,7 +9,7 @@ config VFIO_PCI
>  
>  config VFIO_PCI_VGA
>   bool "VFIO PCI support for VGA devices"
> - depends on VFIO_PCI && X86 && VGA_ARB && EXPERIMENTAL
> + depends on VFIO_PCI && X86 && VGA_ARB
>   help
> Support for VGA extension to VFIO PCI.  This exposes an additional
> region on VGA devices for accessing legacy VGA addresses used by

Applied.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

igb: NULL pointer dereference

2013-02-24 Thread Alex Williamson

On Linus' current tree I get the the oops below.  I realize I'm still
using the deprecated max_vfs= module option, but this isn't a very
compatible or friendly migration path.  I'm using an 82576 PF as an
interface for connecting an iscsi root disk and I also use the VF off
this interface for misc virtual machine assignment.  Suggestions welcome
on migrating to sysfs based sr-iov enabling from an initramfs without
bouncing the PF interface, but I think the below needs to be fixed
regardless.  git bisected to:

commit fa44f2f185f7f9da19d331929bb1b56c1ccd1d93
Author: Greg Rose 
Date:   Thu Jan 17 01:03:06 2013 -0800

igb: Enable SR-IOV configuration via PCI sysfs interface

Implement callback in the driver for the new PCI bus driver
interface that allows the user to enable/disable SR-IOV
virtual functions in a device via the sysfs interface.

Signed-off-by: Greg Rose 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 

Thanks,

Alex


[4.831907] igb: Intel(R) Gigabit Ethernet Network Driver - version 4.1.2-k
[4.838868] igb: Copyright (c) 2007-2012 Intel Corporation.
[4.844710] igb :01:00.0: Maximum of 7 VFs per PF, using max
[4.850768] igb :01:00.0: Enabling SR-IOV VFs using the module parameter 
is deprecated - please use the pci sysfs interface.
[4.872310] systemd[1]: Starting dracut initqueue hook...
 Startin[4.877937] BUG: unable to handle kernel NULL pointer 
dereference at 0048
[4.887143] IP: [] igb_reset+0xcd/0x460 [igb]
[4.893175] PGD 0 
[4.895267] Oops: 0002 [#1] SMP 
[4.898666] Modules linked in: igb(+) ptp usb_storage pps_core iscsi_tcp 
be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi libiscsi_tcp 
qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi
[4.918125] CPU 2 
[4.919959] Pid: 207, comm: systemd-udevd Not tainted 3.8.0-rc3+ #278 LENOVO 
4157CTO/LENOVO
[4.928541] RIP: 0010:[]  [] 
igb_reset+0xcd/0x460 [igb]
[4.937039] RSP: 0018:880367f63b38  EFLAGS: 00010246
[4.942409] RAX: 00ff RBX: 880367020800 RCX: 0001
[4.949601] RDX:  RSI: 0202 RDI: 880367020800
[4.956803] RBP: 880367f63b58 R08: 88036814d408 R09: 0007
[4.964004] R10: 813eea13 R11: 000f R12: 880367020d68
[4.971205] R13: 880371267000 R14: 0040 R15: 88036702
[4.978398] FS:  7fdebfc3b840() GS:88037fc8() 
knlGS:
[4.986549] CS:  0010 DS:  ES:  CR0: 80050033
[4.992369] CR2: 0048 CR3: 000367df3000 CR4: 07e0
[4.999649] DR0:  DR1:  DR2: 
[5.006893] DR3:  DR6: 0ff0 DR7: 0400
[5.014136] Process systemd-udevd (pid: 207, threadinfo 880367f62000, 
task 880367fa)
[5.023108] Stack:
[5.025280]  880371267000 880371267098 880371267000 
0001
[5.033327]  880367f63bd8 a01593e5 880371267098 
880367020d68
[5.041400]  0002 880367020800 0004 
88037136
[5.049451] Call Trace:
[5.052050]  [] igb_probe+0x855/0xde0 [igb]
[5.058023]  [] local_pci_probe+0x4b/0x80
[5.063925]  [] pci_device_probe+0x111/0x120
[5.070082]  [] driver_probe_device+0x8b/0x390
[5.076375]  [] __driver_attach+0xab/0xb0
[5.082303]  [] ? driver_probe_device+0x390/0x390
[5.088736]  [] bus_for_each_dev+0x55/0x90
[5.094458]  [] driver_attach+0x1e/0x20
[5.099921]  [] bus_add_driver+0x1a0/0x290
[5.105691]  [] ? 0xa017
[5.110937]  [] ? 0xa017
[5.116305]  [] driver_register+0x77/0x170
[5.122287]  [] ? 0xa017
[5.127786]  [] __pci_register_driver+0x4b/0x50
[5.134004]  [] igb_init_module+0x4f/0x1000 [igb]
[5.140565]  [] do_one_initcall+0x12a/0x180
[5.146321]  [] load_module+0x1a8a/0x20c0
[5.152020]  [] ? ddebug_proc_open+0xc0/0xc0
[5.157914]  [] sys_init_module+0xd7/0x120
[5.163639]  [] system_call_fastpath+0x16/0x1b
[5.169702] Code: 83 e9 10 45 85 c9 89 8b 14 08 00 00 74 4d 31 c9 66 0f 1f 
44 00 00 48 63 d1 83 c1 01 48 8d 14 52 48 c1 e2 05 48 03 93 98 0e 00 00 <83> 62 
48 08 3b 8b 94 0e 00 00 72 df 48 89 df e8 2f da ff ff 48 
[5.192637] RIP  [] igb_reset+0xcd/0x460 [igb]
g dracut initque[5.198783]  RSP 
[5.203660] CR2: 0048
ue hook...
[5.207169] ---[ end trace 494789df673e4a4c ]---


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] VFIO updates for v3.9-rc1

2013-02-24 Thread Alex Williamson

Hi Linus,

Please pull for v3.9-rc1.  Thanks,

Alex

The following changes since commit 323a72d83c9b2963bd1e46c8e6963e468d4658d7:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2013-02-13 
12:21:07 -0800)

are available in the git repository at:


  git://github.com/awilliam/linux-vfio.git tags/vfio-v3.9-rc1

for you to fetch changes up to d65530fbc799e4036d4d3da4ab6e9fa6d8c4a447:

  drivers/vfio: remove depends on CONFIG_EXPERIMENTAL (2013-02-24 09:59:44 
-0700)


VFIO for 3.9-rc1

- Fixes PCIe v1 extended capability support
- Cleans up read/write access functions
- Fix Removal test to properly wait until devices are unused
- Enable pcieport driver usage for non-accessible devices w/in groups
- Extensions for PCI VGA support


Alex Williamson (7):
  vfio-pci: Enable PCIe extended capabilities on v1
  vfio-pci: Cleanup read/write functions
  vfio-pci: Cleanup BAR access
  vfio: Protect vfio_dev_present against device_del
  vfio: whitelist pcieport
  vfio-pci: Manage user power state transitions
  vfio-pci: Add support for VGA region access

Kees Cook (1):
  drivers/vfio: remove depends on CONFIG_EXPERIMENTAL

 drivers/vfio/pci/Kconfig|  10 ++
 drivers/vfio/pci/vfio_pci.c |  75 ++
 drivers/vfio/pci/vfio_pci_config.c  |  52 +--
 drivers/vfio/pci/vfio_pci_private.h |  19 +--
 drivers/vfio/pci/vfio_pci_rdwr.c| 281 
 drivers/vfio/vfio.c |  35 ++---
 include/uapi/linux/vfio.h   |   9 ++
 7 files changed, 254 insertions(+), 227 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 1/3] VFIO: Wrapper for getting reference to vfio_device from device

2013-02-25 Thread Alex Williamson

On Sat, 2013-02-23 at 23:25 -0600, Vijay Mohan Pandarathil wrote:
>   - Added vfio_device_get_from_dev() as wrapper to get
>   reference to vfio_device from struct device.
> 
>   - Added vfio_device_data() as a wrapper to get device_data from
>   vfio_device.
> 
> Signed-off-by: Vijay Mohan Pandarathil 
> ---
>  drivers/vfio/vfio.c  | 47 ++-
>  include/linux/vfio.h |  3 +++
>  2 files changed, 37 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 12c264d..863d1d3 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -407,12 +407,13 @@ static void vfio_device_release(struct kref *kref)
>  }
>  
>  /* Device reference always implies a group reference */
> -static void vfio_device_put(struct vfio_device *device)
> +void vfio_device_put(struct vfio_device *device)
>  {
>   struct vfio_group *group = device->group;
>   kref_put_mutex(&device->kref, vfio_device_release, &group->device_lock);
>   vfio_group_put(group);
>  }
> +EXPORT_SYMBOL_GPL(vfio_device_put);
>  
>  static void vfio_device_get(struct vfio_device *device)
>  {
> @@ -642,8 +643,12 @@ int vfio_add_group_dev(struct device *dev,
>  }
>  EXPORT_SYMBOL_GPL(vfio_add_group_dev);
>  
> -/* Test whether a struct device is present in our tracking */
> -static bool vfio_dev_present(struct device *dev)
> +/**
> + * This does a get on the vfio_device from device.
> + * Callers of this function will have to call vfio_put_device() to
> + * remove the reference.
> + */
> +struct vfio_device *vfio_device_get_from_dev(struct device *dev)
>  {

I cc'd you on a patch that changes vfio_dev_present to fix a bug where
the device to iommu group lookup has been shutdown.  Using it for this
purpose probably has the same bug.  That code is in my next branch and
in the most reset pull request to Linus.

I think instead the code becomes much more simple.  We're registering a
callback for a struct device for which vfio-pci is the driver.  During
driver release we disable that callback for the device.  Thus during the
callback, we know the device is owned by vfio-pci, which means that the
drvdata has what we need.  So I think vfio_device_get_from_dev() simply
becomes:

struct vfio_device *vfio_get_device_from_dev(struct device *dev)
{
   struct vfio_device *device = dev_get_drvdata(dev);

   vfio_device_get(device);

   return device;
}
EXPORT_SYMBOL_GPL(vfio_get_device_from_dev);

Thanks,
Alex

>   struct iommu_group *iommu_group;
>   struct vfio_group *group;
> @@ -651,25 +656,41 @@ static bool vfio_dev_present(struct device *dev)
>  
>   iommu_group = iommu_group_get(dev);
>   if (!iommu_group)
> - return false;
> + return NULL;
>  
>   group = vfio_group_get_from_iommu(iommu_group);
>   if (!group) {
>   iommu_group_put(iommu_group);
> - return false;
> + return NULL;
>   }
>  
>   device = vfio_group_get_device(group, dev);
> - if (!device) {
> - vfio_group_put(group);
> - iommu_group_put(iommu_group);
> - return false;
> - }
> -
> - vfio_device_put(device);
>   vfio_group_put(group);
>   iommu_group_put(iommu_group);
> - return true;
> + return device;
> +}
> +EXPORT_SYMBOL_GPL(vfio_device_get_from_dev);
> +
> +/*
> + * Caller must hold a reference to the vfio_device
> + */
> +void *vfio_device_data(struct vfio_device *device)
> +{
> + return device->device_data;
> +}
> +EXPORT_SYMBOL_GPL(vfio_device_data);
> +
> +/* Test whether a struct device is present in our tracking */
> +static bool vfio_dev_present(struct device *dev)
> +{
> + struct vfio_device *device;
> +
> + device = vfio_device_get_from_dev(dev);
> + if (device) {
> + vfio_device_put(device);
> + return true;
> + } else
> + return false;
>  }
>  
>  /*
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index ab9e862..ac8d488 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -45,6 +45,9 @@ extern int vfio_add_group_dev(struct device *dev,
> void *device_data);
>  
>  extern void *vfio_del_group_dev(struct device *dev);
> +extern struct vfio_device *vfio_device_get_from_dev(struct device *dev);
> +extern void vfio_device_put(struct vfio_device *device);
> +extern void *vfio_device_data(struct vfio_device *device);
>  
>  /**
>   * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] iommu: making IOMMU sysfs nodes API public

2013-02-18 Thread Alex Williamson

On Mon, 2013-02-18 at 17:15 +1100, Alexey Kardashevskiy wrote:
> On 13/02/13 04:15, Alex Williamson wrote:
> > On Wed, 2013-02-13 at 01:42 +1100, Alexey Kardashevskiy wrote:
> >> On 12/02/13 16:07, Alex Williamson wrote:
> >>> On Tue, 2013-02-12 at 15:06 +1100, Alexey Kardashevskiy wrote:
> >>>> Having this patch in a tree, adding new nodes in sysfs
> >>>> for IOMMU groups is going to be easier.
> >>>>
> >>>> The first candidate for this change is a "dma-window-size"
> >>>> property which tells a size of a DMA window of the specific
> >>>> IOMMU group which can be used later for locked pages accounting.
> >>>
> >>> I'm still churning on this one; I'm nervous this would basically creat
> >>> a /proc free-for-all under /sys/kernel/iommu_group/$GROUP/ where any
> >>> iommu driver can add random attributes.  That can get ugly for
> >>> userspace.
> >>
> >> Is not it exactly what sysfs is for (unlike /proc)? :)
> >
> > Um, I hope it's a little more thought out than /proc.
> >
> >>> On the other hand, for the application of userspace knowing how much
> >>> memory to lock for vfio use of a group, it's an appealing location to
> >>> get that information.  Something like libvirt would already be poking
> >>> around here to figure out which devices to bind.  Page limits need to be
> >>> setup prior to use through vfio, so sysfs is more convenient than
> >>> through vfio ioctls.
> >>
> >> True. DMA window properties do not change since boot so sysfs is the right
> >> place to expose them.
> >>
> >>> But then is dma-window-size just a vfio requirement leaking over into
> >>> iommu groups?  Can we allow iommu driver based attributes without giving
> >>> up control of the namespace?  Thanks,
> >>
> >> Who are you asking these questions? :)
> >
> > Anyone, including you.  Rather than dropping misc files in sysfs to
> > describe things about the group, I think the better solution in your
> > case might be a link from the group to an existing sysfs directory
> > describing the PE.  I believe your PE is rooted in a PCI bridge, so that
> > presumably already has a representation in sysfs.  Can the aperture size
> > be determined from something in sysfs for that bridge already?  I'm just
> > not ready to create a grab bag of sysfs entries for a group yet.
> > Thanks,
> 
> 
> At the moment there is no information neither in sysfs nor 
> /proc/device-tree about the dma-window. And adding a sysfs entry per PE 
> (powerpc partitionable end-point which is often a PHB but not always) just 
> for VFIO is quite heavy.

How do you learn the window size and PE extents in the host kernel?

> We could add a ppc64 subfolder under /sys/kernel/iommu/xxx/ and put the 
> "dma-window" property there. And replace it with a symlink when and if we 
> add something for PE later. Would work?

To be clear, you're suggesting /sys/kernel/iommu_groups/$GROUP/xxx/,
right?  A subfolder really only limits the scope of the mess, so it's
not much improvement.  What does the interface look like to make those
subfolders?

The problem we're trying to solve is this call flow:

containerfd = open("/dev/vfio/vfio");
ioctl(containerfd, VFIO_GET_API_VERSION);
ioctl(containerfd, VFIO_CHECK_EXTENSION, ...);
groupfd = open("/dev/vfio/$GROUP");
ioctl(groupfd, VFIO_GROUP_GET_STATUS);
ioctl(groupfd, VFIO_GROUP_SET_CONTAINER, &containerfd);

You wanted to lock all the memory for the DMA window here, before we can
call VFIO_IOMMU_GET_INFO, but does it need to happen there?  We still
have a MAP_DMA hook.  We could do it all on the first mapping.  It also
has a flags field that could augment the behavior to trigger page
locking.  Adding the window size to sysfs seems more readily convenient,
but is it so hard for userspace to open the files and call a couple
ioctls to get far enough to call IOMMU_GET_INFO?  I'm unconvinced the
clutter in sysfs more than just a quick fix.  Thanks,

Alex

> >>>> Signed-off-by: Alexey Kardashevskiy 
> >>>> ---
> >>>>drivers/iommu/iommu.c |   19 ++-
> >>>>include/linux/iommu.h |   20 
> >>>>2 files changed, 22 insertions(+), 17 deletions(-)
> >>>>
> >>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> >>>> index b0afd3d..58cc298 100644
> >>>> --- a/drivers/iommu/iommu.c
> >>>> +++ b/drivers/iommu/i

Re: [PATCH] iommu: making IOMMU sysfs nodes API public

2013-02-19 Thread Alex Williamson

On Tue, 2013-02-19 at 16:48 +1100, Alexey Kardashevskiy wrote:
> On 19/02/13 16:24, Alex Williamson wrote:
> > On Mon, 2013-02-18 at 17:15 +1100, Alexey Kardashevskiy wrote:
> >> On 13/02/13 04:15, Alex Williamson wrote:
> >>> On Wed, 2013-02-13 at 01:42 +1100, Alexey Kardashevskiy wrote:
> >>>> On 12/02/13 16:07, Alex Williamson wrote:
> >>>>> On Tue, 2013-02-12 at 15:06 +1100, Alexey Kardashevskiy wrote:
> >>>>>> Having this patch in a tree, adding new nodes in sysfs
> >>>>>> for IOMMU groups is going to be easier.
> >>>>>>
> >>>>>> The first candidate for this change is a "dma-window-size"
> >>>>>> property which tells a size of a DMA window of the specific
> >>>>>> IOMMU group which can be used later for locked pages accounting.
> >>>>>
> >>>>> I'm still churning on this one; I'm nervous this would basically creat
> >>>>> a /proc free-for-all under /sys/kernel/iommu_group/$GROUP/ where any
> >>>>> iommu driver can add random attributes.  That can get ugly for
> >>>>> userspace.
> >>>>
> >>>> Is not it exactly what sysfs is for (unlike /proc)? :)
> >>>
> >>> Um, I hope it's a little more thought out than /proc.
> >>>
> >>>>> On the other hand, for the application of userspace knowing how much
> >>>>> memory to lock for vfio use of a group, it's an appealing location to
> >>>>> get that information.  Something like libvirt would already be poking
> >>>>> around here to figure out which devices to bind.  Page limits need to be
> >>>>> setup prior to use through vfio, so sysfs is more convenient than
> >>>>> through vfio ioctls.
> >>>>
> >>>> True. DMA window properties do not change since boot so sysfs is the 
> >>>> right
> >>>> place to expose them.
> >>>>
> >>>>> But then is dma-window-size just a vfio requirement leaking over into
> >>>>> iommu groups?  Can we allow iommu driver based attributes without giving
> >>>>> up control of the namespace?  Thanks,
> >>>>
> >>>> Who are you asking these questions? :)
> >>>
> >>> Anyone, including you.  Rather than dropping misc files in sysfs to
> >>> describe things about the group, I think the better solution in your
> >>> case might be a link from the group to an existing sysfs directory
> >>> describing the PE.  I believe your PE is rooted in a PCI bridge, so that
> >>> presumably already has a representation in sysfs.  Can the aperture size
> >>> be determined from something in sysfs for that bridge already?  I'm just
> >>> not ready to create a grab bag of sysfs entries for a group yet.
> >>> Thanks,
> >>
> >>
> >> At the moment there is no information neither in sysfs nor
> >> /proc/device-tree about the dma-window. And adding a sysfs entry per PE
> >> (powerpc partitionable end-point which is often a PHB but not always) just
> >> for VFIO is quite heavy.
> >
> > How do you learn the window size and PE extents in the host kernel?
> 
> 
> When the ppc64 code does PCI scan, it creates iommu_table structs per PE. 
> When new PE (i.e. new iommu_table) is discovered I call iommu_group_alloc() 
> and iommu_group_set_iommudata() to link iommu_group with iommu_table. These 
> iommu_table structs have DMA window properties.
> 
> 
> >> We could add a ppc64 subfolder under /sys/kernel/iommu/xxx/ and put the
> >> "dma-window" property there. And replace it with a symlink when and if we
> >> add something for PE later. Would work?
> >
> > To be clear, you're suggesting /sys/kernel/iommu_groups/$GROUP/xxx/,
> > right?  A subfolder really only limits the scope of the mess, so it's
> > not much improvement.
> 
> You suggested some symlink to some ppc64 or pci tree in sysfs, it is not 
> that different.
> 
> > What does the interface look like to make those
> > subfolders?
> 
> int iommu_group_create_platform_file(struct iommu_group *group,
>   struct iommu_group_attribute *attr)
> 
> and that's it.
> 
> > The problem we're trying to solve is this call flow:
> >
> > containerfd = open("/dev/vfio/vfio");
> > ioctl(containerfd, VFIO_GET_API_VERSION);
> > ioctl(co

Re: [PATCH] iommu: making IOMMU sysfs nodes API public

2013-02-19 Thread Alex Williamson

On Tue, 2013-02-19 at 18:38 +1100, David Gibson wrote:
> On Mon, Feb 18, 2013 at 10:24:00PM -0700, Alex Williamson wrote:
> > On Mon, 2013-02-18 at 17:15 +1100, Alexey Kardashevskiy wrote:
> > > On 13/02/13 04:15, Alex Williamson wrote:
> > > > On Wed, 2013-02-13 at 01:42 +1100, Alexey Kardashevskiy wrote:
> > > >> On 12/02/13 16:07, Alex Williamson wrote:
> > > >>> On Tue, 2013-02-12 at 15:06 +1100, Alexey Kardashevskiy wrote:
> > > >>>> Having this patch in a tree, adding new nodes in sysfs
> > > >>>> for IOMMU groups is going to be easier.
> > > >>>>
> > > >>>> The first candidate for this change is a "dma-window-size"
> > > >>>> property which tells a size of a DMA window of the specific
> > > >>>> IOMMU group which can be used later for locked pages accounting.
> > > >>>
> > > >>> I'm still churning on this one; I'm nervous this would basically creat
> > > >>> a /proc free-for-all under /sys/kernel/iommu_group/$GROUP/ where any
> > > >>> iommu driver can add random attributes.  That can get ugly for
> > > >>> userspace.
> > > >>
> > > >> Is not it exactly what sysfs is for (unlike /proc)? :)
> > > >
> > > > Um, I hope it's a little more thought out than /proc.
> > > >
> > > >>> On the other hand, for the application of userspace knowing how much
> > > >>> memory to lock for vfio use of a group, it's an appealing location to
> > > >>> get that information.  Something like libvirt would already be poking
> > > >>> around here to figure out which devices to bind.  Page limits need to 
> > > >>> be
> > > >>> setup prior to use through vfio, so sysfs is more convenient than
> > > >>> through vfio ioctls.
> > > >>
> > > >> True. DMA window properties do not change since boot so sysfs is the 
> > > >> right
> > > >> place to expose them.
> > > >>
> > > >>> But then is dma-window-size just a vfio requirement leaking over into
> > > >>> iommu groups?  Can we allow iommu driver based attributes without 
> > > >>> giving
> > > >>> up control of the namespace?  Thanks,
> > > >>
> > > >> Who are you asking these questions? :)
> > > >
> > > > Anyone, including you.  Rather than dropping misc files in sysfs to
> > > > describe things about the group, I think the better solution in your
> > > > case might be a link from the group to an existing sysfs directory
> > > > describing the PE.  I believe your PE is rooted in a PCI bridge, so that
> > > > presumably already has a representation in sysfs.  Can the aperture size
> > > > be determined from something in sysfs for that bridge already?  I'm just
> > > > not ready to create a grab bag of sysfs entries for a group yet.
> > > > Thanks,
> > > 
> > > 
> > > At the moment there is no information neither in sysfs nor 
> > > /proc/device-tree about the dma-window. And adding a sysfs entry per PE 
> > > (powerpc partitionable end-point which is often a PHB but not always) 
> > > just 
> > > for VFIO is quite heavy.
> > 
> > How do you learn the window size and PE extents in the host kernel?
> > 
> > > We could add a ppc64 subfolder under /sys/kernel/iommu/xxx/ and put the 
> > > "dma-window" property there. And replace it with a symlink when and if we 
> > > add something for PE later. Would work?
> 
> Fwiw, I'd suggest a subfolder named for the type of IOMMU, rather than
> "ppc64".
> 
> > To be clear, you're suggesting /sys/kernel/iommu_groups/$GROUP/xxx/,
> > right?  A subfolder really only limits the scope of the mess, so it's
> > not much improvement.  What does the interface look like to make those
> > subfolders?
> > 
> > The problem we're trying to solve is this call flow:
> > 
> > containerfd = open("/dev/vfio/vfio");
> > ioctl(containerfd, VFIO_GET_API_VERSION);
> > ioctl(containerfd, VFIO_CHECK_EXTENSION, ...);
> > groupfd = open("/dev/vfio/$GROUP");
> > ioctl(groupfd, VFIO_GROUP_GET_STATUS);
> > ioctl(groupfd, VFIO_GROUP_SET_CONTAINER, &containerfd);
> > 
> > You wanted to lock all the memory for the DMA window here

Re: [PATCH] iommu: making IOMMU sysfs nodes API public

2013-02-19 Thread Alex Williamson

On Wed, 2013-02-20 at 13:31 +1100, Alexey Kardashevskiy wrote:
> On 20/02/13 07:11, Alex Williamson wrote:
> > On Tue, 2013-02-19 at 18:38 +1100, David Gibson wrote:
> >> On Mon, Feb 18, 2013 at 10:24:00PM -0700, Alex Williamson wrote:
> >>> On Mon, 2013-02-18 at 17:15 +1100, Alexey Kardashevskiy wrote:
> >>>> On 13/02/13 04:15, Alex Williamson wrote:
> >>>>> On Wed, 2013-02-13 at 01:42 +1100, Alexey Kardashevskiy wrote:
> >>>>>> On 12/02/13 16:07, Alex Williamson wrote:
> >>>>>>> On Tue, 2013-02-12 at 15:06 +1100, Alexey Kardashevskiy wrote:
> >>>>>>>> Having this patch in a tree, adding new nodes in sysfs
> >>>>>>>> for IOMMU groups is going to be easier.
> >>>>>>>>
> >>>>>>>> The first candidate for this change is a "dma-window-size"
> >>>>>>>> property which tells a size of a DMA window of the specific
> >>>>>>>> IOMMU group which can be used later for locked pages accounting.
> >>>>>>>
> >>>>>>> I'm still churning on this one; I'm nervous this would basically creat
> >>>>>>> a /proc free-for-all under /sys/kernel/iommu_group/$GROUP/ where any
> >>>>>>> iommu driver can add random attributes.  That can get ugly for
> >>>>>>> userspace.
> >>>>>>
> >>>>>> Is not it exactly what sysfs is for (unlike /proc)? :)
> >>>>>
> >>>>> Um, I hope it's a little more thought out than /proc.
> >>>>>
> >>>>>>> On the other hand, for the application of userspace knowing how much
> >>>>>>> memory to lock for vfio use of a group, it's an appealing location to
> >>>>>>> get that information.  Something like libvirt would already be poking
> >>>>>>> around here to figure out which devices to bind.  Page limits need to 
> >>>>>>> be
> >>>>>>> setup prior to use through vfio, so sysfs is more convenient than
> >>>>>>> through vfio ioctls.
> >>>>>>
> >>>>>> True. DMA window properties do not change since boot so sysfs is the 
> >>>>>> right
> >>>>>> place to expose them.
> >>>>>>
> >>>>>>> But then is dma-window-size just a vfio requirement leaking over into
> >>>>>>> iommu groups?  Can we allow iommu driver based attributes without 
> >>>>>>> giving
> >>>>>>> up control of the namespace?  Thanks,
> >>>>>>
> >>>>>> Who are you asking these questions? :)
> >>>>>
> >>>>> Anyone, including you.  Rather than dropping misc files in sysfs to
> >>>>> describe things about the group, I think the better solution in your
> >>>>> case might be a link from the group to an existing sysfs directory
> >>>>> describing the PE.  I believe your PE is rooted in a PCI bridge, so that
> >>>>> presumably already has a representation in sysfs.  Can the aperture size
> >>>>> be determined from something in sysfs for that bridge already?  I'm just
> >>>>> not ready to create a grab bag of sysfs entries for a group yet.
> >>>>> Thanks,
> >>>>
> >>>>
> >>>> At the moment there is no information neither in sysfs nor
> >>>> /proc/device-tree about the dma-window. And adding a sysfs entry per PE
> >>>> (powerpc partitionable end-point which is often a PHB but not always) 
> >>>> just
> >>>> for VFIO is quite heavy.
> >>>
> >>> How do you learn the window size and PE extents in the host kernel?
> >>>
> >>>> We could add a ppc64 subfolder under /sys/kernel/iommu/xxx/ and put the
> >>>> "dma-window" property there. And replace it with a symlink when and if we
> >>>> add something for PE later. Would work?
> >>
> >> Fwiw, I'd suggest a subfolder named for the type of IOMMU, rather than
> >> "ppc64".
> >>
> >>> To be clear, you're suggesting /sys/kernel/iommu_groups/$GROUP/xxx/,
> >>> right?  A subfolder really only limits the scope of the mess, so it's
> >>> not much improvement.  What does the inte

Re: [PATCH] iommu: making IOMMU sysfs nodes API public

2013-02-19 Thread Alex Williamson

On Wed, 2013-02-20 at 15:20 +1100, Alexey Kardashevskiy wrote:
> On 20/02/13 14:47, Alex Williamson wrote:
> > On Wed, 2013-02-20 at 13:31 +1100, Alexey Kardashevskiy wrote:
> >> On 20/02/13 07:11, Alex Williamson wrote:
> >>> On Tue, 2013-02-19 at 18:38 +1100, David Gibson wrote:
> >>>> On Mon, Feb 18, 2013 at 10:24:00PM -0700, Alex Williamson wrote:
> >>>>> On Mon, 2013-02-18 at 17:15 +1100, Alexey Kardashevskiy wrote:
> >>>>>> On 13/02/13 04:15, Alex Williamson wrote:
> >>>>>>> On Wed, 2013-02-13 at 01:42 +1100, Alexey Kardashevskiy wrote:
> >>>>>>>> On 12/02/13 16:07, Alex Williamson wrote:
> >>>>>>>>> On Tue, 2013-02-12 at 15:06 +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>> Having this patch in a tree, adding new nodes in sysfs
> >>>>>>>>>> for IOMMU groups is going to be easier.
> >>>>>>>>>>
> >>>>>>>>>> The first candidate for this change is a "dma-window-size"
> >>>>>>>>>> property which tells a size of a DMA window of the specific
> >>>>>>>>>> IOMMU group which can be used later for locked pages accounting.
> >>>>>>>>>
> >>>>>>>>> I'm still churning on this one; I'm nervous this would basically 
> >>>>>>>>> creat
> >>>>>>>>> a /proc free-for-all under /sys/kernel/iommu_group/$GROUP/ where any
> >>>>>>>>> iommu driver can add random attributes.  That can get ugly for
> >>>>>>>>> userspace.
> >>>>>>>>
> >>>>>>>> Is not it exactly what sysfs is for (unlike /proc)? :)
> >>>>>>>
> >>>>>>> Um, I hope it's a little more thought out than /proc.
> >>>>>>>
> >>>>>>>>> On the other hand, for the application of userspace knowing how much
> >>>>>>>>> memory to lock for vfio use of a group, it's an appealing location 
> >>>>>>>>> to
> >>>>>>>>> get that information.  Something like libvirt would already be 
> >>>>>>>>> poking
> >>>>>>>>> around here to figure out which devices to bind.  Page limits need 
> >>>>>>>>> to be
> >>>>>>>>> setup prior to use through vfio, so sysfs is more convenient than
> >>>>>>>>> through vfio ioctls.
> >>>>>>>>
> >>>>>>>> True. DMA window properties do not change since boot so sysfs is the 
> >>>>>>>> right
> >>>>>>>> place to expose them.
> >>>>>>>>
> >>>>>>>>> But then is dma-window-size just a vfio requirement leaking over 
> >>>>>>>>> into
> >>>>>>>>> iommu groups?  Can we allow iommu driver based attributes without 
> >>>>>>>>> giving
> >>>>>>>>> up control of the namespace?  Thanks,
> >>>>>>>>
> >>>>>>>> Who are you asking these questions? :)
> >>>>>>>
> >>>>>>> Anyone, including you.  Rather than dropping misc files in sysfs to
> >>>>>>> describe things about the group, I think the better solution in your
> >>>>>>> case might be a link from the group to an existing sysfs directory
> >>>>>>> describing the PE.  I believe your PE is rooted in a PCI bridge, so 
> >>>>>>> that
> >>>>>>> presumably already has a representation in sysfs.  Can the aperture 
> >>>>>>> size
> >>>>>>> be determined from something in sysfs for that bridge already?  I'm 
> >>>>>>> just
> >>>>>>> not ready to create a grab bag of sysfs entries for a group yet.
> >>>>>>> Thanks,
> >>>>>>
> >>>>>>
> >>>>>> At the moment there is no information neither in sysfs nor
> >>>>>> /proc/device-tree about the dma-window. And adding a sysfs entry per PE
> >>>>>> (powerpc partitionable end-point which is often a PHB but not always) 
> &g

Re: [PATCH] iommu: making IOMMU sysfs nodes API public

2013-02-21 Thread Alex Williamson

On Fri, 2013-02-22 at 11:04 +1100, David Gibson wrote:
> On Tue, Feb 19, 2013 at 01:11:51PM -0700, Alex Williamson wrote:
> > On Tue, 2013-02-19 at 18:38 +1100, David Gibson wrote:
> > > On Mon, Feb 18, 2013 at 10:24:00PM -0700, Alex Williamson wrote:
> > > > On Mon, 2013-02-18 at 17:15 +1100, Alexey Kardashevskiy wrote:
> [snip]
> > > >  Adding the window size to sysfs seems more readily convenient,
> > > > but is it so hard for userspace to open the files and call a couple
> > > > ioctls to get far enough to call IOMMU_GET_INFO?  I'm unconvinced the
> > > > clutter in sysfs more than just a quick fix.  Thanks,
> > > 
> > > And finally, as Alexey points out, isn't the point here so we know how
> > > much rlimit to give qemu?  Using ioctls we'd need a special tool just
> > > to check the dma window sizes, which seems a bit hideous.
> > 
> > Is it more hideous that using iommu groups to report a vfio imposed
> > restriction?  Are a couple open files and a handful of ioctls worse than
> > code to parse directory entries and the future maintenance of an
> > unrestricted grab bag of sysfs entries?
> 
> The fact that the memory is locked is a vfio restriction, but the
> actual dma window size is, genuinely, a property of the group.

A group is an association of devices based on isolation and visibility.
The dma window happens to be associated with a group on your platform,
but that's not always the case.  This is why I was hoping something in
sysfs already reported the dma window so that we could point to it
rather than creating an interface where it doesn't really belong.
Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] igb: Fix null pointer dereference

2013-03-12 Thread Alex Williamson

The max_vfs= option has always been self limiting to the number of VFs
supported by the device.  fa44f2f1 added SR-IOV configuration via
sysfs, but in the process broke this self correction factor.  The
failing path is:

igb_probe
  igb_sw_init
if (max_vfs > 7) {
adapter->vfs_allocated_count = 7;
...
igb_probe_vfs
igb_enable_sriov(, max_vfs)
  if (num_vfs > 7) {
err = -EPERM;
...

This leaves vfs_allocated_count = 7 and vf_data = NULL, so we bomb out
when igb_probe finally calls igb_reset.  It seems like a really bad
idea, and somewhat pointless, to set vfs_allocated_count separate from
vf_data, but limiting max_vfs is enough to avoid the null pointer.

Signed-off-by: Alex Williamson 
---
 drivers/net/ethernet/intel/igb/igb_main.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index ed79a1c..d5b8289 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2656,7 +2656,7 @@ static int igb_sw_init(struct igb_adapter *adapter)
if (max_vfs > 7) {
dev_warn(&pdev->dev,
 "Maximum of 7 VFs per PF, using max\n");
-   adapter->vfs_allocated_count = 7;
+   max_vfs = adapter->vfs_allocated_count = 7;
} else
adapter->vfs_allocated_count = max_vfs;
if (adapter->vfs_allocated_count)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 0/3] AER-KVM: Error containment of VFIO devices assigned to KVM guests

2013-03-12 Thread Alex Williamson

On Sat, 2013-03-09 at 01:52 -0600, Vijay Mohan Pandarathil wrote:
> Add support for error containment when a VFIO device assigned to a KVM
> guest encounters an error. This is for PCIe devices/drivers that support AER
> functionality. When the host OS is notified of an error in a device either
> through the firmware first approach or through an interrupt handled by the AER
> root port driver, the error handler registered by the vfio-pci driver gets
> invoked. The qemu process is signaled through an eventfd registered per
> VFIO device by the qemu process. In the eventfd handler, qemu decides on
> what action to take. In this implementation, guest is brought down to
> contain the error.
> 
> 
> v7:
>  - Rebased to latest upstream
>  - Used device_lock() for synchronising err_trigger access
> v6:
>  - Rebased to latest upstream
>  - Resolved merge conflict with vfio_dev_present()
> v5:
>  - Rebased to latest upstream stable bits
>  - Incorporated v4 feedback
> v4:
>  - Stop the guest instead of terminating
>  - Remove unwanted returns from functions
>  - Incorporate other feedback
> v3:
>  - Removed PCI_AER* flags from device info ioctl.
>  - Incorporated feedback
> v2:
>  - Rebased to latest upstream stable bits
>  - Changed the new ioctl to be part of VFIO_SET_IRQs ioctl
>  - Added a new patch to get/put reference to a vfio device from struct device
>  - Incorporated all other feedback.
> 
> ---
> 
> Vijay Mohan Pandarathil(3):
> 
> [PATCH 1/3] VFIO: Wrapper to get reference to vfio_device from device 
> [PATCH 2/3] VFIO-AER: Vfio-pci driver changes for supporting AER
> [PATCH 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
> 
> Kernel files changed
> 
>  drivers/vfio/vfio.c  | 30 +-
>  include/linux/vfio.h |  3 +++
>  2 files changed, 32 insertions(+), 1 deletion(-)
> 
>  drivers/vfio/pci/vfio_pci.c | 44 -
>  drivers/vfio/pci/vfio_pci_intrs.c   | 64 
> +
>  drivers/vfio/pci/vfio_pci_private.h |  1 +
>  include/uapi/linux/vfio.h   |  1 +
>  4 files changed, 109 insertions(+), 1 deletion(-)

Applied 1 & 2 to my next tree, patch 3 is qemu and depends on the vfio.h
change getting into mainline so will need to be re-sent once that
happens.  Thanks,

Alex

> Qemu files changed
> 
>  hw/vfio_pci.c  | 123 
> +
>  linux-headers/linux/vfio.h |   1 +
>  2 files changed, 124 insertions(+)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] igb: SR-IOV init reordering

2013-03-12 Thread Alex Williamson

igb is ineffective at setting a lower total VFs because:

int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs)
{
...
/* Shouldn't change if VFs already enabled */
if (dev->sriov->ctrl & PCI_SRIOV_CTRL_VFE)
return -EBUSY;

Swap init ordering.

Signed-off-by: Alex Williamson 
---
 drivers/net/ethernet/intel/igb/igb_main.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index d5b8289..ac65c39 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2546,8 +2546,8 @@ static void igb_probe_vfs(struct igb_adapter *adapter)
if ((hw->mac.type == e1000_i210) || (hw->mac.type == e1000_i211))
return;
 
-   igb_enable_sriov(pdev, max_vfs);
pci_sriov_set_totalvfs(pdev, 7);
+   igb_enable_sriov(pdev, max_vfs);
 
 #endif /* CONFIG_PCI_IOV */
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] vfio: include for kmalloc

2013-03-15 Thread Alex Williamson

On Thu, 2013-03-14 at 22:56 +0100, Arnd Bergmann wrote:
> The vfio drivers call kmalloc or kzalloc, but do not
> include , which causes build errors on
> ARM.
> 
> Signed-off-by: Arnd Bergmann 
> Cc: Alex Williamson 
> Cc: k...@vger.kernel.org
> ---
> Please apply for 3.9

Applied, Thanks,

Alex

>  drivers/vfio/pci/vfio_pci_config.c | 1 +
>  drivers/vfio/pci/vfio_pci_intrs.c  | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> b/drivers/vfio/pci/vfio_pci_config.c
> index 964ff22..aeb00fc 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "vfio_pci_private.h"
>  
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> b/drivers/vfio/pci/vfio_pci_intrs.c
> index 3639371..a965091 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "vfio_pci_private.h"
>  



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] udevadm-info: Don't access sysfs 'resource' files

2013-03-16 Thread Alex Williamson

On Sat, 2013-03-16 at 18:03 -0700, Greg KH wrote:
> On Sat, Mar 16, 2013 at 05:50:53PM -0600, Myron Stowe wrote:
> > On Sat, 2013-03-16 at 15:11 -0700, Greg KH wrote:
> > > On Sat, Mar 16, 2013 at 03:35:19PM -0600, Myron Stowe wrote:
> > > > Sysfs includes entries to memory that backs a PCI device's BARs, both 
> > > > I/O
> > > > Port space and MMIO.  This memory regions correspond to the device's
> > > > internal status and control registers used to drive the device.
> > > > 
> > > > Accessing these registers from userspace such as "udevadm info
> > > > --attribute-walk --path=/sys/devices/..." does can not be allowed as
> > > > such accesses outside of the driver, even just reading, can yield
> > > > catastrophic consequences.
> > > > 
> > > > Udevadm-info skips parsing a specific set of sysfs entries including
> > > > 'resource'.  This patch extends the set to include the additional
> > > > 'resource' entries that correspond to a PCI device's BARs.
> > > 
> > > Nice, are you also going to patch bash to prevent a user from reading
> > > these sysfs files as well?  :)
> > > 
> > > And pciutils?
> > > 
> > > You get my point here, right?  The root user just asked to read all of
> > > the data for this device, so why wouldn't you allow it?  Just like
> > > 'lspci' does.  Or bash does.
> > 
> > Yes :P , you raise a very good point, there are a lot of way a user can
> > poke around in those BARs.  However, there is a difference between
> > shooting yourself in the foot and getting what you deserve versus
> > unknowingly executing a common command such as udevadm and having the
> > system hang.
> > > 
> > > If this hardware has a problem, then it needs to be fixed in the kernel,
> > > not have random band-aids added to various userspace programs to paper
> > > over the root problem here.  Please fix the kernel driver and all should
> > > be fine.  No need to change udevadm.
> > 
> > Xiangliang initially proposed a patch within the PCI core.  Ignoring the
> > specific issue with the proposal which I pointed out in the
> > https://lkml.org/lkml/2013/3/7/242 thread, that just doesn't seem like
> > the right place to effect a change either as PCI's core isn't concerned
> > with the contents or access limitations of those regions, those are
> > issues that the driver concerns itself with.
> > 
> > So things seem to be gravitating towards the driver.  I'm fairly
> > ignorant of this area but as Robert succinctly pointed out in the
> > originating thread - the AHCI driver only uses the device's MMIO region.
> > The I/O related regions are for legacy SFF-compatible ATA ports and are
> > not used to driver the device.  This, coupled with the observance that
> > userspace accesses such as udevadm, and others like you additionally
> > point out, do not filter through the device's driver for seems to
> > suggest that changes to the driver will not help here either.
> 
> A PCI quirk should handle this properly, right?  Why not do that?  Worse
> thing, the quirk could just not expose these sysfs files for this
> device, which would solve all userspace program issues, right?

Not exactly.  I/O port access through pci-sysfs was added for userspace
programs, specifically qemu-kvm device assignment.  We use the I/O port
resource# files to access device owned I/O port registers using file
permissions rather than global permissions such as iopl/ioperm.  File
permissions also prevent random users from accessing device registers
through these files, but of course can't stop a privileged app that
chooses to ignore the purpose of these files.  A quirk would therefore
remove a file that actually has a useful purpose for one app just so
another app that has no particular reason for dumping the contents can
run unabated.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] udevadm-info: Don't access sysfs 'resource' files

2013-03-17 Thread Alex Williamson

On Sat, 2013-03-16 at 22:36 -0700, Greg KH wrote:
> On Sat, Mar 16, 2013 at 10:11:22PM -0600, Alex Williamson wrote:
> > On Sat, 2013-03-16 at 18:03 -0700, Greg KH wrote:
> > > On Sat, Mar 16, 2013 at 05:50:53PM -0600, Myron Stowe wrote:
> > > > On Sat, 2013-03-16 at 15:11 -0700, Greg KH wrote:
> > > > > On Sat, Mar 16, 2013 at 03:35:19PM -0600, Myron Stowe wrote:
> > > > > > Sysfs includes entries to memory that backs a PCI device's BARs, 
> > > > > > both I/O
> > > > > > Port space and MMIO.  This memory regions correspond to the device's
> > > > > > internal status and control registers used to drive the device.
> > > > > > 
> > > > > > Accessing these registers from userspace such as "udevadm info
> > > > > > --attribute-walk --path=/sys/devices/..." does can not be allowed as
> > > > > > such accesses outside of the driver, even just reading, can yield
> > > > > > catastrophic consequences.
> > > > > > 
> > > > > > Udevadm-info skips parsing a specific set of sysfs entries including
> > > > > > 'resource'.  This patch extends the set to include the additional
> > > > > > 'resource' entries that correspond to a PCI device's BARs.
> > > > > 
> > > > > Nice, are you also going to patch bash to prevent a user from reading
> > > > > these sysfs files as well?  :)
> > > > > 
> > > > > And pciutils?
> > > > > 
> > > > > You get my point here, right?  The root user just asked to read all of
> > > > > the data for this device, so why wouldn't you allow it?  Just like
> > > > > 'lspci' does.  Or bash does.
> > > > 
> > > > Yes :P , you raise a very good point, there are a lot of way a user can
> > > > poke around in those BARs.  However, there is a difference between
> > > > shooting yourself in the foot and getting what you deserve versus
> > > > unknowingly executing a common command such as udevadm and having the
> > > > system hang.
> > > > > 
> > > > > If this hardware has a problem, then it needs to be fixed in the 
> > > > > kernel,
> > > > > not have random band-aids added to various userspace programs to paper
> > > > > over the root problem here.  Please fix the kernel driver and all 
> > > > > should
> > > > > be fine.  No need to change udevadm.
> > > > 
> > > > Xiangliang initially proposed a patch within the PCI core.  Ignoring the
> > > > specific issue with the proposal which I pointed out in the
> > > > https://lkml.org/lkml/2013/3/7/242 thread, that just doesn't seem like
> > > > the right place to effect a change either as PCI's core isn't concerned
> > > > with the contents or access limitations of those regions, those are
> > > > issues that the driver concerns itself with.
> > > > 
> > > > So things seem to be gravitating towards the driver.  I'm fairly
> > > > ignorant of this area but as Robert succinctly pointed out in the
> > > > originating thread - the AHCI driver only uses the device's MMIO region.
> > > > The I/O related regions are for legacy SFF-compatible ATA ports and are
> > > > not used to driver the device.  This, coupled with the observance that
> > > > userspace accesses such as udevadm, and others like you additionally
> > > > point out, do not filter through the device's driver for seems to
> > > > suggest that changes to the driver will not help here either.
> > > 
> > > A PCI quirk should handle this properly, right?  Why not do that?  Worse
> > > thing, the quirk could just not expose these sysfs files for this
> > > device, which would solve all userspace program issues, right?
> > 
> > Not exactly.  I/O port access through pci-sysfs was added for userspace
> > programs, specifically qemu-kvm device assignment.  We use the I/O port
> > resource# files to access device owned I/O port registers using file
> > permissions rather than global permissions such as iopl/ioperm.  File
> > permissions also prevent random users from accessing device registers
> > through these files, but of course can't stop a privileged app that
> > chooses to ignore the purpose of these files.  A quirk would therefore
> > remove a file that actually has a useful purpose for one app just so
> > another app that has no particular reason for dumping the contents can
> > run unabated.  Thanks,
> 
> The quirk would only be for this one specific device, which obviously
> can't handle this type of access, so why would you want the sysfs files
> even present for it at all?

I'm assuming that the device only breaks because udevadm is dumping the
full I/O port register space of the device and that if an actual driver
was interacting with it through this interface that it would work.  Who
knows how many devices will have read side-effects by udevadm blindly
dumping these files.  Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] VFIO PCI INTx fixes

2012-10-04 Thread Alex Williamson

These patches are now available in my next tree to fix a couple
issues with PCI INTx.  Thanks,

Alex

---

Alex Williamson (2):
  vfio: Fix PCI INTx disable consistency
  vfio: Move PCI INTx eventfd setting earlier


 drivers/vfio/pci/vfio_pci_intrs.c |   18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] vfio: Move PCI INTx eventfd setting earlier

2012-10-04 Thread Alex Williamson

We need to be ready to recieve an interrupt as soon as we call
request_irq, so our eventfd context setting needs to be moved
earlier.  Without this, an interrupt from our device or one
sharing the interrupt line can pass a NULL into eventfd_signal
and oops.

Cc: sta...@vger.kernel.org
Signed-off-by: Alex Williamson 
---

 drivers/vfio/pci/vfio_pci_intrs.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index d8dedc7..c8139a5 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -400,19 +400,20 @@ static int vfio_intx_set_signal(struct vfio_pci_device 
*vdev, int fd)
return PTR_ERR(trigger);
}
 
+   vdev->ctx[0].trigger = trigger;
+
if (!vdev->pci_2_3)
irqflags = 0;
 
ret = request_irq(pdev->irq, vfio_intx_handler,
  irqflags, vdev->ctx[0].name, vdev);
if (ret) {
+   vdev->ctx[0].trigger = NULL;
kfree(vdev->ctx[0].name);
eventfd_ctx_put(trigger);
return ret;
}
 
-   vdev->ctx[0].trigger = trigger;
-
/*
 * INTx disable will stick across the new irq setup,
 * disable_irq won't.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] vfio: Fix PCI INTx disable consistency

2012-10-04 Thread Alex Williamson

The virq_disabled flag tracks the userspace view of INTx masking
across interrupt mode changes, but we're not consistently applying
this to the interrupt and masking handler notion of the device.
Currently if the user sets DisINTx while in MSI or MSIX mode, then
returns to INTx mode (ex. rebooting a qemu guest), the hardware has
DisINTx+, but the management of INTx thinks it's enabled, making it
impossible to actually clear DisINTx.  Fix this by updating the
handler state when INTx is re-enabled.

Cc: sta...@vger.kernel.org
Signed-off-by: Alex Williamson 
---

 drivers/vfio/pci/vfio_pci_intrs.c |   13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index c8139a5..3639371 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -366,6 +366,17 @@ static int vfio_intx_enable(struct vfio_pci_device *vdev)
return -ENOMEM;
 
vdev->num_ctx = 1;
+
+   /*
+* If the virtual interrupt is masked, restore it.  Devices
+* supporting DisINTx can be masked at the hardware level
+* here, non-PCI-2.3 devices will have to wait until the
+* interrupt is enabled.
+*/
+   vdev->ctx[0].masked = vdev->virq_disabled;
+   if (vdev->pci_2_3)
+   pci_intx(vdev->pdev, !vdev->ctx[0].masked);
+
vdev->irq_type = VFIO_PCI_INTX_IRQ_INDEX;
 
return 0;
@@ -419,7 +430,7 @@ static int vfio_intx_set_signal(struct vfio_pci_device 
*vdev, int fd)
 * disable_irq won't.
 */
spin_lock_irqsave(&vdev->irqlock, flags);
-   if (!vdev->pci_2_3 && (vdev->ctx[0].masked || vdev->virq_disabled))
+   if (!vdev->pci_2_3 && vdev->ctx[0].masked)
disable_irq_nosync(pdev->irq);
spin_unlock_irqrestore(&vdev->irqlock, flags);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL (PATCH 0/4) v2] VFIO driver for v3.6

2012-07-26 Thread Alex Williamson

On Wed, 2012-07-25 at 08:53 -0600, Alex Williamson wrote:
> Hi Linus,
> 
> This series includes the VFIO userspace driver interface for the
> 3.6 kernel merge window.  This driver is intended to provide a
> secure interface for device access using IOMMU protection for
> applications like assignment of physical devices to virtual
> machines.  Qemu will be the first user of this interface, enabling
> assignment of PCI devices to Qemu guests.  This interface is
> intended to eventually replace the x86-specific assignment mechanism
> currently available in KVM.  This interface has the advantage of
> being more secure, by working with IOMMU groups to ensure device
> isolation and providing it's own filtered resource access mechanism,
> and also more flexible, in not being x86 or KVM specific (extensions
> to enable POWER are already working).
> 
> As a new driver, I'm including both the individual patches in email,
> as well as a branch to pull from:
> 
> git://github.com/awilliam/linux-vfio.git for-linus
> 
> This driver is originally the work of Tom Lyon, but has since been
> handed over to me and gone through a complete overhaul thanks to the
> input from David Gibson, Ben Herrenschmidt, Chris Wright, Joerg
> Roedel, and others.  This driver has been available in linux-next for
> the last month.  Thanks,

randconfig testing in next found a dependency issue that I've fix in the
for-linus branch above.  Change from v1 is:

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index cc7db62..5980758 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -1,6 +1,6 @@
 config VFIO_PCI
tristate "VFIO support for PCI devices"
-   depends on VFIO && PCI
+   depends on VFIO && PCI && EVENTFD
help
  Support for the PCI VFIO bus driver.  This is required to make
  use of PCI drivers using the VFIO framework.

If anyone wants a full resend of v2 to the list with this change, please
let me know.  Thanks,

Alex

> ---
> 
> Alex Williamson (4):
>   vfio: Add PCI device driver
>   vfio: Type1 IOMMU implementation
>   vfio: Add documentation
>   vfio: VFIO core
> 
> 
>  Documentation/ioctl/ioctl-number.txt |1 
>  Documentation/vfio.txt   |  314 +++
>  MAINTAINERS  |8 
>  drivers/Kconfig  |2 
>  drivers/Makefile |1 
>  drivers/vfio/Kconfig |   16 
>  drivers/vfio/Makefile|3 
>  drivers/vfio/pci/Kconfig |8 
>  drivers/vfio/pci/Makefile|4 
>  drivers/vfio/pci/vfio_pci.c  |  579 +
>  drivers/vfio/pci/vfio_pci_config.c   | 1540 
> ++
>  drivers/vfio/pci/vfio_pci_intrs.c|  740 
>  drivers/vfio/pci/vfio_pci_private.h  |   91 ++
>  drivers/vfio/pci/vfio_pci_rdwr.c |  269 ++
>  drivers/vfio/vfio.c  | 1420 +++
>  drivers/vfio/vfio_iommu_type1.c  |  753 +
>  include/linux/vfio.h |  445 ++
>  17 files changed, 6194 insertions(+)
>  create mode 100644 Documentation/vfio.txt
>  create mode 100644 drivers/vfio/Kconfig
>  create mode 100644 drivers/vfio/Makefile
>  create mode 100644 drivers/vfio/pci/Kconfig
>  create mode 100644 drivers/vfio/pci/Makefile
>  create mode 100644 drivers/vfio/pci/vfio_pci.c
>  create mode 100644 drivers/vfio/pci/vfio_pci_config.c
>  create mode 100644 drivers/vfio/pci/vfio_pci_intrs.c
>  create mode 100644 drivers/vfio/pci/vfio_pci_private.h
>  create mode 100644 drivers/vfio/pci/vfio_pci_rdwr.c
>  create mode 100644 drivers/vfio/vfio.c
>  create mode 100644 drivers/vfio/vfio_iommu_type1.c
>  create mode 100644 include/linux/vfio.h



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: Tree for July 26 (vfio)

2012-07-26 Thread Alex Williamson

On Thu, 2012-07-26 at 08:43 -0700, Randy Dunlap wrote:
> On 07/25/2012 10:04 PM, Stephen Rothwell wrote:
> 
> > Hi all,
> > 
> > 
> > Changes since 20120725:
> > 
> > 
> 
> 
> on x86_64:
> 
>   CC [M]  drivers/vfio/pci/vfio_pci_intrs.o
> drivers/vfio/pci/vfio_pci_intrs.c: In function 'virqfd_enable':
> drivers/vfio/pci/vfio_pci_intrs.c:142:2: error: implicit declaration of 
> function 'eventfd_fget'
> drivers/vfio/pci/vfio_pci_intrs.c:142:7: warning: assignment makes pointer 
> from integer without a cast
> drivers/vfio/pci/vfio_pci_intrs.c:148:2: error: implicit declaration of 
> function 'eventfd_ctx_fileget'
> drivers/vfio/pci/vfio_pci_intrs.c:148:6: warning: assignment makes pointer 
> from integer without a cast
> make[4]: *** [drivers/vfio/pci/vfio_pci_intrs.o] Error 1

Thanks!  vfio-pci is useless without CONFIG_EVENTFD so I've added that
to the Kconfig depends.  Should be fixed in tomorrows tree.  Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more interrupts (lower performance) in bare-metal compared with running VM

2012-07-27 Thread Alex Williamson

On Fri, 2012-07-27 at 22:09 -0500, sheng qiu wrote:
> Hi all,
> 
> i am comparing network throughput performance under bare-metal case
> with that running VM with assigned-device (assigned NIC). i have two
> physical machines (each has a 10Gbit NIC), one is used as remote
> server (run netserver) and the other is used as the target tested one
> (run netperf with different send message size, TCP_STREAM test). the
> remote NIC is connected directly with the tested NIC, both are 10Gbit.
> fore bare-metal case, i enable 1 cpu core, for VM i also configure 1
> vcpu (the memory is sufficient for both bare-metal and VM case).  i
> run netperf for 120 seconds and got the following results:
> 
>send messageinterrupts   throughput (mbit/s)
> bare-metal 256   106962901114.84
> 512   101067861391.92
> 1024  10071032   1508.09
> 2048  4560857 3434.65
> 4096  3292200 4762.26
> 8192  3169801 4733.89
> 163842780529  4892.6
> 
> VM(assigned NIC)   256   3817904  2249.35
>  512   3599007  4342.81
> 1024  3005601  4134.69
>  2048 2952122  4484
>  4096 2682874  4566.34
>  8192 2786719  4734.39
>  16384   2603835  4540.47
> 
> as shown, the interrupts for bare-metal case is much more than the VM
> case for some message size. we also see the throughput for those
> situations is lower than VM case. it's strange that the bare-metal has
> lower performance than the VM case. Does anyone have comments on this?
> i am very confused.

Assigned devices have more latency in the interrupt path since the
interrupt goes through both the host and the guest interrupt stack.  My
guess is that you're approaching the interrupt rate we can handle due to
that added latency.  That's the bad news.  The good news is that the
device must be queuing up packets, so more are processed on each
interrupt.  Once we switch to non-threaded interrupt handling in the
host, that peak interrupt rate should get a significant increase.
TCP_RR is probably a better way to get a feel for interrupt latency.
That's my theory, any others?  Thanks

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 1/2] kvm: Extend irqfd to support level interrupts

2012-07-30 Thread Alex Williamson

On Sun, 2012-07-29 at 18:01 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 24, 2012 at 02:43:14PM -0600, Alex Williamson wrote:
> > In order to inject a level interrupt from an external source using an
> > irqfd, we need to allocate a new irq_source_id.  This allows us to
> > assert and (later) de-assert an interrupt line independently from
> > users of KVM_IRQ_LINE and avoid lost interrupts.
> > 
> > We also add what may appear like a bit of excessive infrastructure
> > around an object for storing this irq_source_id.  However, notice
> > that we only provide a way to assert the interrupt here.  A follow-on
> > interface will make use of the same irq_source_id to allow de-assert.
> > 
> > Signed-off-by: Alex Williamson 
> 
> I think this tracking of source ids is the root of all the problems
> you see with this patchset.
> 
> A source ID is required for an irqfd to be created.
> But if source ID exists after irqfd is destroyed then
> the next create will fail.

Only if there are no available source IDs.

> So the only sane thing to do is to make irqfd manage this resource,
> clean it up completely when irqfd is gone.
> 
> Not to mention, the patch will be smaller :)

The only sane way to do that is to pull the eoifd into KVM_IRQFD and set
them up together.  That's actually what v1 of this endeavor did.  My
intention with splitting eoifd from irqfd is that I think EOI
notification is potentially useful outside of this usage with irqfds and
I wanted an interface that could be used independently.  Someday, an
irqfd may not be the only way to generate a key.  Userspace may also
wish to register to receive notification-only for the existing user
source ID.

I do not think it's sane to have an eoifd configured using KVM_EOIFD and
destroyed using KVM_IRQFD.  As for smaller patch, I'm not convinced.  We
still have to watch for POLLHUP, which pulls in the bulk of the code.
And using the above approach of pulling eoifd setup into irqfd we have
to address what happens to the combined set when either eventfd is
closed.  By your argument closing the irqfd closes the eoifd, but does
closing the eoifd necessarily close the irqfd.  If not then we end up
with the question of how can an eoifd be added to an existing irqfd.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

2012-07-30 Thread Alex Williamson

On Sun, 2012-07-29 at 17:54 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 24, 2012 at 02:43:22PM -0600, Alex Williamson wrote:
> > This new ioctl enables an eventfd to be triggered when an EOI is
> > written for a specified irqchip pin.  The first user of this will
> > be external device assignment through VFIO, using a level irqfd
> > for asserting a PCI INTx interrupt and this interface for de-assert
> > and notification once the interrupt is serviced.
> > 
> > Here we make use of the reference counting of the _irq_source
> > object allowing us to share it with an irqfd and cleanup regardless
> > of the release order.
> > 
> > Signed-off-by: Alex Williamson 
> 
> > ---
> > 
> >  Documentation/virtual/kvm/api.txt |   21 ++
> >  arch/x86/kvm/x86.c|2 
> >  include/linux/kvm.h   |   15 ++
> >  include/linux/kvm_host.h  |   13 +
> >  virt/kvm/eventfd.c|  336 
> > +
> >  virt/kvm/kvm_main.c   |   11 +
> >  6 files changed, 398 insertions(+)
> > 
> > diff --git a/Documentation/virtual/kvm/api.txt 
> > b/Documentation/virtual/kvm/api.txt
> > index 3911e62..8cd6b36 100644
> > --- a/Documentation/virtual/kvm/api.txt
> > +++ b/Documentation/virtual/kvm/api.txt
> > @@ -1989,6 +1989,27 @@ return the hash table order in the parameter.  (If 
> > the guest is using
> >  the virtualized real-mode area (VRMA) facility, the kernel will
> >  re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
> >  
> > +4.77 KVM_EOIFD
> > +
> > +Capability: KVM_CAP_EOIFD
> > +Architectures: x86
> > +Type: vm ioctl
> > +Parameters: struct kvm_eoifd (in)
> > +Returns: 0 on success, < 0 on error
> > +
> > +KVM_EOIFD allows userspace to receive interrupt EOI notification
> > +through an eventfd.
> 
> I thought about it some more, and I think it should be renamed to an
> interrupt ack notification than eoi notification.
> For example, consider userspace that uses threaded interrupts.
> Currently what will happen is each interrupt will be injected
> twice, since on eoi device is still asserting it.

I don't follow, why is userspace writing an eoi to the ioapic if it
hasn't handled the interrupt and why wouldn't the same happen on bare
metal?

> One fix would be to delay event until interrupt is re-enabled.
> Now I am not asking you to fix this immediately,
> but I think we should make the interface generic by
> saying we report an ack to userspace and not specifically EOI.

Using the word "delay" in the context of interrupt delivery raises all
sorts of red flags for me, but I really don't understand your argument.

> >  kvm_eoifd.fd specifies the eventfd used for
> > +notification.  KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd
> > +once assigned.  KVM_EOIFD also requires additional bits set in
> > +kvm_eoifd.flags to bind to the proper interrupt line.  The
> > +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided
> > +and is a key from a level triggered interrupt (configured from
> > +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL).  The EOI notification is bound
> > +to the same GSI and irqchip input as the irqfd.  Both kvm_eoifd.key
> > +and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified on assignment and
> > +de-assignment of KVM_EOIFD.  A level irqfd may only be bound to a
> > +single eoifd.  KVM_CAP_EOIFD_LEVEL_IRQFD indicates support of
> > +KVM_EOIFD_FLAG_LEVEL_IRQFD.
> >  
> 
> Hmm returning the key means we'll need to keep refcounting for source
> IDs around forever. I liked passing the fd better: make implementation
> match interface and not the other way around.

False, a source ID has a finite lifecycle.  The fd approach was broken.
Holding the irqfd context imposed too many dependencies between eoifd
and irqfd necessitating things like one interface disabling another.  I
thoroughly disagree with that approach.

> >  5. The kvm_run structure
> >  
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 9ded39d..8f3164e 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2171,6 +2171,8 @@ int kvm_dev_ioctl_check_extension(long ext)
> > case KVM_CAP_PCI_2_3:
> > case KVM_CAP_KVMCLOCK_CTRL:
> > case KVM_CAP_IRQFD_LEVEL:
> > +   case KVM_CAP_EOIFD:
> > +   case KVM_CAP_EOIFD_LEVEL_IRQFD:
> > r = 1;
> > break;
> > case KVM_CAP_COALESCED_MMIO:
> > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> &

Re: [GIT PULL (PATCH 0/4)] VFIO driver for v3.6

2012-07-30 Thread Alex Williamson

On Fri, 2012-07-27 at 15:32 +1000, Paul Mackerras wrote:
> On Wed, Jul 25, 2012 at 08:53:06AM -0600, Alex Williamson wrote:
> > Hi Linus,
> > 
> > This series includes the VFIO userspace driver interface for the
> > 3.6 kernel merge window.  This driver is intended to provide a
> > secure interface for device access using IOMMU protection for
> > applications like assignment of physical devices to virtual
> > machines.  Qemu will be the first user of this interface, enabling
> > assignment of PCI devices to Qemu guests.  This interface is
> > intended to eventually replace the x86-specific assignment mechanism
> > currently available in KVM.  This interface has the advantage of
> > being more secure, by working with IOMMU groups to ensure device
> > isolation and providing it's own filtered resource access mechanism,
> > and also more flexible, in not being x86 or KVM specific (extensions
> > to enable POWER are already working).
> > 
> > As a new driver, I'm including both the individual patches in email,
> > as well as a branch to pull from:
> > 
> > git://github.com/awilliam/linux-vfio.git for-linus
> > 
> > This driver is originally the work of Tom Lyon, but has since been
> > handed over to me and gone through a complete overhaul thanks to the
> > input from David Gibson, Ben Herrenschmidt, Chris Wright, Joerg
> > Roedel, and others.  This driver has been available in linux-next for
> > the last month.  Thanks,
> 
> Linus,
> 
> Are you thinking of pulling this driver in for 3.6?  I would be glad
> to see it go in since we want to use it with KVM on PowerPC.  If
> possible we'd like the PowerPC bits for it to go in as well.

I'm pretty anxious to find out as well.  Linus, ping, any thoughts on
including this in 3.6?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

2012-07-30 Thread Alex Williamson

On Tue, 2012-07-31 at 03:01 +0300, Michael S. Tsirkin wrote:
> On Mon, Jul 30, 2012 at 10:22:10AM -0600, Alex Williamson wrote:
> > On Sun, 2012-07-29 at 17:54 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 24, 2012 at 02:43:22PM -0600, Alex Williamson wrote:
> > > > This new ioctl enables an eventfd to be triggered when an EOI is
> > > > written for a specified irqchip pin.  The first user of this will
> > > > be external device assignment through VFIO, using a level irqfd
> > > > for asserting a PCI INTx interrupt and this interface for de-assert
> > > > and notification once the interrupt is serviced.
> > > > 
> > > > Here we make use of the reference counting of the _irq_source
> > > > object allowing us to share it with an irqfd and cleanup regardless
> > > > of the release order.
> > > > 
> > > > Signed-off-by: Alex Williamson 
> > > 
> > > > ---
> > > > 
> > > >  Documentation/virtual/kvm/api.txt |   21 ++
> > > >  arch/x86/kvm/x86.c|2 
> > > >  include/linux/kvm.h   |   15 ++
> > > >  include/linux/kvm_host.h  |   13 +
> > > >  virt/kvm/eventfd.c|  336 
> > > > +
> > > >  virt/kvm/kvm_main.c   |   11 +
> > > >  6 files changed, 398 insertions(+)
> > > > 
> > > > diff --git a/Documentation/virtual/kvm/api.txt 
> > > > b/Documentation/virtual/kvm/api.txt
> > > > index 3911e62..8cd6b36 100644
> > > > --- a/Documentation/virtual/kvm/api.txt
> > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > @@ -1989,6 +1989,27 @@ return the hash table order in the parameter.  
> > > > (If the guest is using
> > > >  the virtualized real-mode area (VRMA) facility, the kernel will
> > > >  re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
> > > >  
> > > > +4.77 KVM_EOIFD
> > > > +
> > > > +Capability: KVM_CAP_EOIFD
> > > > +Architectures: x86
> > > > +Type: vm ioctl
> > > > +Parameters: struct kvm_eoifd (in)
> > > > +Returns: 0 on success, < 0 on error
> > > > +
> > > > +KVM_EOIFD allows userspace to receive interrupt EOI notification
> > > > +through an eventfd.
> > > 
> > > I thought about it some more, and I think it should be renamed to an
> > > interrupt ack notification than eoi notification.
> > > For example, consider userspace that uses threaded interrupts.
> > > Currently what will happen is each interrupt will be injected
> > > twice, since on eoi device is still asserting it.
> > 
> > I don't follow, why is userspace writing an eoi to the ioapic if it
> > hasn't handled the interrupt
> 
> It has handled it - it disabled the hardware interrupt.

So it's not injected twice, it's held pending at the ioapic the second
time, just like hardware.  Maybe there's a future optimization there,
but I don't think it's appropriate at this time.

> > and why wouldn't the same happen on bare
> > metal?
> 
> on bare metal level does not matter as long as interrupt
> is disabled.
> 
> > > One fix would be to delay event until interrupt is re-enabled.
> > > Now I am not asking you to fix this immediately,
> > > but I think we should make the interface generic by
> > > saying we report an ack to userspace and not specifically EOI.
> > 
> > Using the word "delay" in the context of interrupt delivery raises all
> > sorts of red flags for me, but I really don't understand your argument.
> 
> I am saying it's an "ack" of interrupt userspace cares about.
> The fact it is done by EOI is an implementation detail.

The implementation is how an EOI is generated on an ioapic, not that an
EOI exists.  How do I read a hardware spec and figure out what "ack of
interrupt" means?

> > > >  kvm_eoifd.fd specifies the eventfd used for
> > > > +notification.  KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd
> > > > +once assigned.  KVM_EOIFD also requires additional bits set in
> > > > +kvm_eoifd.flags to bind to the proper interrupt line.  The
> > > > +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided
> > > > +and is a key from a level triggered interrupt (configured from
> > > > +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL).  The EOI notification is bound
>

Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

2012-07-30 Thread Alex Williamson

On Tue, 2012-07-31 at 03:36 +0300, Michael S. Tsirkin wrote:
> On Mon, Jul 30, 2012 at 06:26:31PM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-31 at 03:01 +0300, Michael S. Tsirkin wrote:
> > > On Mon, Jul 30, 2012 at 10:22:10AM -0600, Alex Williamson wrote:
> > > > On Sun, 2012-07-29 at 17:54 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 24, 2012 at 02:43:22PM -0600, Alex Williamson wrote:
> > > > > > This new ioctl enables an eventfd to be triggered when an EOI is
> > > > > > written for a specified irqchip pin.  The first user of this will
> > > > > > be external device assignment through VFIO, using a level irqfd
> > > > > > for asserting a PCI INTx interrupt and this interface for de-assert
> > > > > > and notification once the interrupt is serviced.
> > > > > > 
> > > > > > Here we make use of the reference counting of the _irq_source
> > > > > > object allowing us to share it with an irqfd and cleanup regardless
> > > > > > of the release order.
> > > > > > 
> > > > > > Signed-off-by: Alex Williamson 
> > > > > 
> > > > > > ---
> > > > > > 
> > > > > >  Documentation/virtual/kvm/api.txt |   21 ++
> > > > > >  arch/x86/kvm/x86.c|2 
> > > > > >  include/linux/kvm.h   |   15 ++
> > > > > >  include/linux/kvm_host.h  |   13 +
> > > > > >  virt/kvm/eventfd.c|  336 
> > > > > > +
> > > > > >  virt/kvm/kvm_main.c   |   11 +
> > > > > >  6 files changed, 398 insertions(+)
> > > > > > 
> > > > > > diff --git a/Documentation/virtual/kvm/api.txt 
> > > > > > b/Documentation/virtual/kvm/api.txt
> > > > > > index 3911e62..8cd6b36 100644
> > > > > > --- a/Documentation/virtual/kvm/api.txt
> > > > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > > > @@ -1989,6 +1989,27 @@ return the hash table order in the 
> > > > > > parameter.  (If the guest is using
> > > > > >  the virtualized real-mode area (VRMA) facility, the kernel will
> > > > > >  re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
> > > > > >  
> > > > > > +4.77 KVM_EOIFD
> > > > > > +
> > > > > > +Capability: KVM_CAP_EOIFD
> > > > > > +Architectures: x86
> > > > > > +Type: vm ioctl
> > > > > > +Parameters: struct kvm_eoifd (in)
> > > > > > +Returns: 0 on success, < 0 on error
> > > > > > +
> > > > > > +KVM_EOIFD allows userspace to receive interrupt EOI notification
> > > > > > +through an eventfd.
> > > > > 
> > > > > I thought about it some more, and I think it should be renamed to an
> > > > > interrupt ack notification than eoi notification.
> > > > > For example, consider userspace that uses threaded 
> > > > > interrupts.interrupts.
> > > > > Currently what will happen is each interrupt will be injected
> > > > > twice, since on eoi device is still asserting it.
> > > > 
> > > > I don't follow, why is userspace writing an eoi to the ioapic if it
> > > > hasn't handled the interrupt
> > > 
> > > It has handled it - it disabled the hardware interrupt.
> > 
> > So it's not injected twice, it's held pending at the ioapic the second
> > time, just like hardware.
> 
> It is not like hardware at all. in hardware there is no overhead
> here you interrupot the guest to run handler in host.

Obviously we have some overhead, we're emulating the guest hardware.
That doesn't make the behavior unlike hardware.

> >  Maybe there's a future optimization there,
> > but I don't think it's appropriate at this time.
> 
> Yes. But to make it *possible* in future we must remove
> the requirement to signal fd immediately on EOI.
> So rename it ackfd.

How does the name make that possible?  We can easily add a flag
EOIFD_FLAG_EOI_ON_REENABLE, or whatever.

> > > > and why wouldn't the same happen on bare
> > > > metal?
> > > 
> > > on bare metal level does not matter as long as interrupt
> > > is disabled.
> > >

Re: [GIT PULL (PATCH 0/4)] VFIO driver for v3.6

2012-07-31 Thread Alex Williamson

On Mon, 2012-07-30 at 22:11 -0700, Linus Torvalds wrote:
> On Mon, Jul 30, 2012 at 4:17 PM, Alex Williamson
>  wrote:
> >
> > I'm pretty anxious to find out as well.  Linus, ping, any thoughts on
> > including this in 3.6?  Thanks,
> 
> I just pulled it, but then I unpulled again when I realized it's not a
> signed tag and it's on github.
> 
> Please, people. Do tagged releases with proper signatures if you're
> not using kernel.org or other controlled servers. In fact, I prefer
> signed tags even if you *do* use kernel.org etc.

Sorry about that, Linus.  I think this is a properly signed tag, please
let me know if I'm still screwing up.  Thanks,

Alex

The following changes since commit 2e3ee613480563a6d5c01b57d342e65cc58c06df:

  Merge tag 'writeback-proportions' of 
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux (2012-07-30 22:14:04 
-0700)

are available in the git repository at:


  g...@github.com:awilliam/linux-vfio.git tags/vfio-for-v3.6

for you to fetch changes up to 89e1f7d4c66d85f42c3d52ea3866eb10cadf6153:

  vfio: Add PCI device driver (2012-07-31 08:16:24 -0600)


VFIO for v3.6


Alex Williamson (4):
  vfio: VFIO core
  vfio: Add documentation
  vfio: Type1 IOMMU implementation
  vfio: Add PCI device driver

 Documentation/ioctl/ioctl-number.txt |1 +
 Documentation/vfio.txt   |  314 +++
 MAINTAINERS  |8 +
 drivers/Kconfig  |2 +
 drivers/Makefile |1 +
 drivers/vfio/Kconfig |   16 +
 drivers/vfio/Makefile|3 +
 drivers/vfio/pci/Kconfig |8 +
 drivers/vfio/pci/Makefile|4 +
 drivers/vfio/pci/vfio_pci.c  |  579 +
 drivers/vfio/pci/vfio_pci_config.c   | 1540 ++
 drivers/vfio/pci/vfio_pci_intrs.c|  740 
 drivers/vfio/pci/vfio_pci_private.h  |   91 ++
 drivers/vfio/pci/vfio_pci_rdwr.c |  269 ++
 drivers/vfio/vfio.c  | 1420 +++
 drivers/vfio/vfio_iommu_type1.c  |  753 +
 include/linux/vfio.h |  445 ++
 17 files changed, 6194 insertions(+)
 create mode 100644 Documentation/vfio.txt
 create mode 100644 drivers/vfio/Kconfig
 create mode 100644 drivers/vfio/Makefile
 create mode 100644 drivers/vfio/pci/Kconfig
 create mode 100644 drivers/vfio/pci/Makefile
 create mode 100644 drivers/vfio/pci/vfio_pci.c
 create mode 100644 drivers/vfio/pci/vfio_pci_config.c
 create mode 100644 drivers/vfio/pci/vfio_pci_intrs.c
 create mode 100644 drivers/vfio/pci/vfio_pci_private.h
 create mode 100644 drivers/vfio/pci/vfio_pci_rdwr.c
 create mode 100644 drivers/vfio/vfio.c
 create mode 100644 drivers/vfio/vfio_iommu_type1.c
 create mode 100644 include/linux/vfio.h



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL (PATCH 0/4)] VFIO driver for v3.6

2012-07-31 Thread Alex Williamson

On Tue, 2012-07-31 at 08:53 -0600, Alex Williamson wrote:
> On Mon, 2012-07-30 at 22:11 -0700, Linus Torvalds wrote:
> > On Mon, Jul 30, 2012 at 4:17 PM, Alex Williamson
> >  wrote:
> > >
> > > I'm pretty anxious to find out as well.  Linus, ping, any thoughts on
> > > including this in 3.6?  Thanks,
> > 
> > I just pulled it, but then I unpulled again when I realized it's not a
> > signed tag and it's on github.
> > 
> > Please, people. Do tagged releases with proper signatures if you're
> > not using kernel.org or other controlled servers. In fact, I prefer
> > signed tags even if you *do* use kernel.org etc.
> 
> Sorry about that, Linus.  I think this is a properly signed tag, please
> let me know if I'm still screwing up.  Thanks,
> 
> Alex
> 
> The following changes since commit 2e3ee613480563a6d5c01b57d342e65cc58c06df:
> 
>   Merge tag 'writeback-proportions' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux (2012-07-30 22:14:04 
> -0700)
> 
> are available in the git repository at:
> 
> 
>   g...@github.com:awilliam/linux-vfio.git tags/vfio-for-v3.6

Ack, git pull-request snuck this by me, obviously this should be:

git://github.com/awilliam/linux-vfio.git tags/vfio-for-v3.6

Thanks,
Alex

> for you to fetch changes up to 89e1f7d4c66d85f42c3d52ea3866eb10cadf6153:
> 
>   vfio: Add PCI device driver (2012-07-31 08:16:24 -0600)
> 
> 
> VFIO for v3.6
> 
> 
> Alex Williamson (4):
>   vfio: VFIO core
>   vfio: Add documentation
>   vfio: Type1 IOMMU implementation
>   vfio: Add PCI device driver
> 
>  Documentation/ioctl/ioctl-number.txt |1 +
>  Documentation/vfio.txt   |  314 +++
>  MAINTAINERS  |8 +
>  drivers/Kconfig  |2 +
>  drivers/Makefile |1 +
>  drivers/vfio/Kconfig |   16 +
>  drivers/vfio/Makefile|3 +
>  drivers/vfio/pci/Kconfig |8 +
>  drivers/vfio/pci/Makefile|4 +
>  drivers/vfio/pci/vfio_pci.c  |  579 +
>  drivers/vfio/pci/vfio_pci_config.c   | 1540 
> ++
>  drivers/vfio/pci/vfio_pci_intrs.c|  740 
>  drivers/vfio/pci/vfio_pci_private.h  |   91 ++
>  drivers/vfio/pci/vfio_pci_rdwr.c |  269 ++
>  drivers/vfio/vfio.c  | 1420 +++
>  drivers/vfio/vfio_iommu_type1.c  |  753 +
>  include/linux/vfio.h |  445 ++
>  17 files changed, 6194 insertions(+)
>  create mode 100644 Documentation/vfio.txt
>  create mode 100644 drivers/vfio/Kconfig
>  create mode 100644 drivers/vfio/Makefile
>  create mode 100644 drivers/vfio/pci/Kconfig
>  create mode 100644 drivers/vfio/pci/Makefile
>  create mode 100644 drivers/vfio/pci/vfio_pci.c
>  create mode 100644 drivers/vfio/pci/vfio_pci_config.c
>  create mode 100644 drivers/vfio/pci/vfio_pci_intrs.c
>  create mode 100644 drivers/vfio/pci/vfio_pci_private.h
>  create mode 100644 drivers/vfio/pci/vfio_pci_rdwr.c
>  create mode 100644 drivers/vfio/vfio.c
>  create mode 100644 drivers/vfio/vfio_iommu_type1.c
>  create mode 100644 include/linux/vfio.h
> 
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] vfio: Fix virqfd release race

2012-09-21 Thread Alex Williamson

vfoi-pci supports a mechanism like KVM's irqfd for unmasking an
interrupt through an eventfd.  There are two ways to shutdown this
interface: 1) close the eventfd, 2) ioctl (such as disabling the
interrupt).  Both of these do the release through a workqueue,
which can result in a segfault if two jobs get queued for the same
virqfd.

Fix this by protecting the pointer to these virqfds by a spinlock.
The vfio pci device will therefore no longer have a reference to it
once the release job is queued under lock.  On the ioctl side, we
still flush the workqueue to ensure that any outstanding releases
are completed.

Signed-off-by: Alex Williamson 
---

 drivers/vfio/pci/vfio_pci_intrs.c |   76 +++--
 1 file changed, 56 insertions(+), 20 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index 211a492..d8dedc7 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -76,9 +76,24 @@ static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, 
int sync, void *key)
schedule_work(&virqfd->inject);
}
 
-   if (flags & POLLHUP)
-   /* The eventfd is closing, detach from VFIO */
-   virqfd_deactivate(virqfd);
+   if (flags & POLLHUP) {
+   unsigned long flags;
+   spin_lock_irqsave(&virqfd->vdev->irqlock, flags);
+
+   /*
+* The eventfd is closing, if the virqfd has not yet been
+* queued for release, as determined by testing whether the
+* vdev pointer to it is still valid, queue it now.  As
+* with kvm irqfds, we know we won't race against the virqfd
+* going away because we hold wqh->lock to get here.
+*/
+   if (*(virqfd->pvirqfd) == virqfd) {
+   *(virqfd->pvirqfd) = NULL;
+   virqfd_deactivate(virqfd);
+   }
+
+   spin_unlock_irqrestore(&virqfd->vdev->irqlock, flags);
+   }
 
return 0;
 }
@@ -93,7 +108,6 @@ static void virqfd_ptable_queue_proc(struct file *file,
 static void virqfd_shutdown(struct work_struct *work)
 {
struct virqfd *virqfd = container_of(work, struct virqfd, shutdown);
-   struct virqfd **pvirqfd = virqfd->pvirqfd;
u64 cnt;
 
eventfd_ctx_remove_wait_queue(virqfd->eventfd, &virqfd->wait, &cnt);
@@ -101,7 +115,6 @@ static void virqfd_shutdown(struct work_struct *work)
eventfd_ctx_put(virqfd->eventfd);
 
kfree(virqfd);
-   *pvirqfd = NULL;
 }
 
 static void virqfd_inject(struct work_struct *work)
@@ -122,15 +135,11 @@ static int virqfd_enable(struct vfio_pci_device *vdev,
int ret = 0;
unsigned int events;
 
-   if (*pvirqfd)
-   return -EBUSY;
-
virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
if (!virqfd)
return -ENOMEM;
 
virqfd->pvirqfd = pvirqfd;
-   *pvirqfd = virqfd;
virqfd->vdev = vdev;
virqfd->handler = handler;
virqfd->thread = thread;
@@ -154,6 +163,23 @@ static int virqfd_enable(struct vfio_pci_device *vdev,
virqfd->eventfd = ctx;
 
/*
+* virqfds can be released by closing the eventfd or directly
+* through ioctl.  These are both done through a workqueue, so
+* we update the pointer to the virqfd under lock to avoid
+* pushing multiple jobs to release the same virqfd.
+*/
+   spin_lock_irq(&vdev->irqlock);
+
+   if (*pvirqfd) {
+   spin_unlock_irq(&vdev->irqlock);
+   ret = -EBUSY;
+   goto fail;
+   }
+   *pvirqfd = virqfd;
+
+   spin_unlock_irq(&vdev->irqlock);
+
+   /*
 * Install our own custom wake-up handling so we are notified via
 * a callback whenever someone signals the underlying eventfd.
 */
@@ -187,19 +213,29 @@ fail:
fput(file);
 
kfree(virqfd);
-   *pvirqfd = NULL;
 
return ret;
 }
 
-static void virqfd_disable(struct virqfd *virqfd)
+static void virqfd_disable(struct vfio_pci_device *vdev,
+  struct virqfd **pvirqfd)
 {
-   if (!virqfd)
-   return;
+   unsigned long flags;
+
+   spin_lock_irqsave(&vdev->irqlock, flags);
+
+   if (*pvirqfd) {
+   virqfd_deactivate(*pvirqfd);
+   *pvirqfd = NULL;
+   }
 
-   virqfd_deactivate(virqfd);
+   spin_unlock_irqrestore(&vdev->irqlock, flags);
 
-   /* Block until we know all outstanding shutdown jobs have completed. */
+   /*
+* Block until we know all outstanding shutdown jobs have completed.
+* Even if we don't queue the job, flush the wq to be sure it's
+* been released.
+

[PATCH v11] kvm: Add resampling irqfds for level triggered interrupts

2012-09-21 Thread Alex Williamson

To emulate level triggered interrupts, add a resample option to
KVM_IRQFD.  When specified, a new resamplefd is provided that notifies
the user when the irqchip has been resampled by the VM.  This may, for
instance, indicate an EOI.  Also in this mode, posting of an interrupt
through an irqfd only asserts the interrupt.  On resampling, the
interrupt is automatically de-asserted prior to user notification.
This enables level triggered interrupts to be posted and re-enabled
from vfio with no userspace intervention.

All resampling irqfds can make use of a single irq source ID, so we
reserve a new one for this interface.

Signed-off-by: Alex Williamson 
---

Taking Michael's idea to create our own lock for resampling irqfds
simplifies things significantly.  We can now avoid any nesting with
the irqfds.lock spinlock so we can call everything we need to
directly.

 Documentation/virtual/kvm/api.txt |   13 +++
 arch/x86/kvm/x86.c|4 +
 include/linux/kvm.h   |   12 +++
 include/linux/kvm_host.h  |5 +
 virt/kvm/eventfd.c|  150 -
 virt/kvm/irq_comm.c   |6 +
 6 files changed, 184 insertions(+), 6 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index bf33aaa..6240366 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1946,6 +1946,19 @@ the guest using the specified gsi pin.  The irqfd is 
removed using
 the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
 and kvm_irqfd.gsi.
 
+With KVM_CAP_IRQFD_RESAMPLE, KVM_IRQFD supports a de-assert and notify
+mechanism allowing emulation of level-triggered, irqfd-based
+interrupts.  When KVM_IRQFD_FLAG_RESAMPLE is set the user must pass an
+additional eventfd in the kvm_irqfd.resamplefd field.  When operating
+in resample mode, posting of an interrupt through kvm_irq.fd asserts
+the specified gsi in the irqchip.  When the irqchip is resampled, such
+as from an EOI, the gsi is de-asserted and the user is notifed via
+kvm_irqfd.resamplefd.  It is the user's responsibility to re-queue
+the interrupt if the device making use of it still requires service.
+Note that closing the resamplefd is not sufficient to disable the
+irqfd.  The KVM_IRQFD_FLAG_RESAMPLE is only necessary on assignment
+and need not be specified with KVM_IRQFD_FLAG_DEASSIGN.
+
 4.76 KVM_PPC_ALLOCATE_HTAB
 
 Capability: KVM_CAP_PPC_ALLOC_HTAB
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2966c84..56f9002 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2177,6 +2177,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_GET_TSC_KHZ:
case KVM_CAP_PCI_2_3:
case KVM_CAP_KVMCLOCK_CTRL:
+   case KVM_CAP_IRQFD_RESAMPLE:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
@@ -6268,6 +6269,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
/* Reserve bit 0 of irq_sources_bitmap for userspace irq source */
set_bit(KVM_USERSPACE_IRQ_SOURCE_ID, &kvm->arch.irq_sources_bitmap);
+   /* Reserve bit 1 of irq_sources_bitmap for irqfd-resampler */
+   set_bit(KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
+   &kvm->arch.irq_sources_bitmap);
 
raw_spin_lock_init(&kvm->arch.tsc_write_lock);
 
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 2ce09aa..a01a3d5 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_GET_SMMU_INFO 78
 #define KVM_CAP_S390_COW 79
 #define KVM_CAP_PPC_ALLOC_HTAB 80
+#define KVM_CAP_IRQFD_RESAMPLE 81
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -683,12 +684,21 @@ struct kvm_xen_hvm_config {
 #endif
 
 #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
+/*
+ * Available with KVM_CAP_IRQFD_RESAMPLE
+ *
+ * KVM_IRQFD_FLAG_RESAMPLE indicates resamplefd is valid and specifies
+ * the irqfd to operate in resampling mode for level triggered interrupt
+ * emlation.  See Documentation/virtual/kvm/api.txt.
+ */
+#define KVM_IRQFD_FLAG_RESAMPLE (1 << 1)
 
 struct kvm_irqfd {
__u32 fd;
__u32 gsi;
__u32 flags;
-   __u8  pad[20];
+   __u32 resamplefd;
+   __u8  pad[16];
 };
 
 struct kvm_clock_data {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b70b48b..6966ce2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -70,7 +70,8 @@
 #define KVM_REQ_PMU   16
 #define KVM_REQ_PMI   17
 
-#define KVM_USERSPACE_IRQ_SOURCE_ID0
+#define KVM_USERSPACE_IRQ_SOURCE_ID0
+#define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID   1
 
 struct kvm;
 struct kvm_vcpu;
@@ -283,6 +284,8 @@ struct kvm {
struct {
spinlock_tlock;
struct list_head  items;
+   struct list_head  resampler_list;
+   struct mutex  resam

[GIT PULL] VFIO fixes for 3.6

2012-09-24 Thread Alex Williamson

Hi Linus,

The following changes since commit c46de2263f42fb4bbde411b9126f471e9343cb22:

  Merge branch 'for-linus' of git://git.kernel.dk/linux-block (2012-09-19 
11:04:34 -0700)

are available in the git repository at:


  git://github.com/awilliam/linux-vfio.git tags/vfio-for-linus

for you to fetch changes up to b68e7fa879cd3b1126a7c455d9da1b70299efc0d:

  vfio: Fix virqfd release race (2012-09-21 10:48:28 -0600)


VFIO doc update and virqfd race fix

----
Alex Williamson (2):
  vfio: Trivial Documentation correction
  vfio: Fix virqfd release race

 Documentation/vfio.txt|  2 +-
 drivers/vfio/pci/vfio_pci_intrs.c | 76 ---
 2 files changed, 57 insertions(+), 21 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: Tree for Sept 24 (iommu)

2012-09-24 Thread Alex Williamson

On Mon, 2012-09-24 at 15:02 -0700, Randy Dunlap wrote:
> On 09/24/2012 07:53 AM, Stephen Rothwell wrote:
> 
> > Hi all,
> > 
> > Today was a train wreck, with lots of new conflicts across several trees
> > and a few build failures as well.
> > 
> > Changes since 201209021:
> > 
> 
> 
> 
> on i386:
> 
> drivers/built-in.o: In function `iommu_group_remove_device':
> (.text+0x74cb10): multiple definition of `iommu_group_remove_device'
> arch/x86/built-in.o:(.text+0x140d0): first defined here
...


Here's a patch to get it past this.  It still doesn't fully build, but
the rest isn't iommu related.  Thanks,

Alex


commit 6955e1f06cb20ddd25665984c330b945443cce36
Author: Alex Williamson 
Date:   Mon Sep 24 21:13:52 2012 -0600

iommu: inline iommu group stub functions

Signed-off-by: Alex Williamson 

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 7e83370..f3b99e1 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -256,72 +256,78 @@ static inline void iommu_set_fault_handler(struct 
iommu_domain *domain,
 {
 }
 
-int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
+static inline int iommu_attach_group(struct iommu_domain *domain,
+struct iommu_group *group)
 {
return -ENODEV;
 }
 
-void iommu_detach_group(struct iommu_domain *domain, struct iommu_group *group)
+static inline void iommu_detach_group(struct iommu_domain *domain,
+ struct iommu_group *group)
 {
 }
 
-struct iommu_group *iommu_group_alloc(void)
+static inline struct iommu_group *iommu_group_alloc(void)
 {
return ERR_PTR(-ENODEV);
 }
 
-void *iommu_group_get_iommudata(struct iommu_group *group)
+static inline void *iommu_group_get_iommudata(struct iommu_group *group)
 {
return NULL;
 }
 
-void iommu_group_set_iommudata(struct iommu_group *group, void *iommu_data,
-  void (*release)(void *iommu_data))
+static inline void iommu_group_set_iommudata(struct iommu_group *group,
+void *iommu_data,
+void (*release)(void *iommu_data))
 {
 }
 
-int iommu_group_set_name(struct iommu_group *group, const char *name)
+static inline int iommu_group_set_name(struct iommu_group *group,
+  const char *name)
 {
return -ENODEV;
 }
 
-int iommu_group_add_device(struct iommu_group *group, struct device *dev)
+static inline int iommu_group_add_device(struct iommu_group *group,
+struct device *dev)
 {
return -ENODEV;
 }
 
-void iommu_group_remove_device(struct device *dev)
+static inline void iommu_group_remove_device(struct device *dev)
 {
 }
 
-int iommu_group_for_each_dev(struct iommu_group *group, void *data,
-int (*fn)(struct device *, void *))
+static inline int iommu_group_for_each_dev(struct iommu_group *group,
+  void *data,
+  int (*fn)(struct device *, void *))
 {
return -ENODEV;
 }
 
-struct iommu_group *iommu_group_get(struct device *dev)
+static inline struct iommu_group *iommu_group_get(struct device *dev)
 {
return NULL;
 }
 
-void iommu_group_put(struct iommu_group *group)
+static inline void iommu_group_put(struct iommu_group *group)
 {
 }
 
-int iommu_group_register_notifier(struct iommu_group *group,
- struct notifier_block *nb)
+static inline int iommu_group_register_notifier(struct iommu_group *group,
+   struct notifier_block *nb)
 {
return -ENODEV;
 }
 
-int iommu_group_unregister_notifier(struct iommu_group *group,
-   struct notifier_block *nb)
+static inline int iommu_group_unregister_notifier(struct iommu_group *group,
+ struct notifier_block *nb)
 {
return 0;
 }
 
-int iommu_group_id(struct iommu_group *group)
+static inline int iommu_group_id(struct iommu_group *group)
 {
return -ENODEV;
 }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 3.6-rc7 boot crash + bisection

2012-09-25 Thread Alex Williamson

On Mon, 2012-09-24 at 21:03 +0200, Florian Dazinger wrote:
> Hi,
> I think I've found a regression, which causes an early boot crash, I
> appended the kernel output via jpg file, since I do not have a serial
> console or sth.
> 
> after bisection, it boils down to this commit:
> 
> 9dcd61303af862c279df86aa97fde7ce371be774 is the first bad commit
> commit 9dcd61303af862c279df86aa97fde7ce371be774
> Author: Alex Williamson 
> Date:   Wed May 30 14:19:07 2012 -0600
> 
> amd_iommu: Support IOMMU groups
> 
> Add IOMMU group support to AMD-Vi device init and uninit code.
> Existing notifiers make sure this gets called for each device.
> 
> Signed-off-by: Alex Williamson 
> Signed-off-by: Joerg Roedel 
> 
> :04 04 2f6b1b8e104d6dfec0abaa9646750f9b5a4f4060
> 837ae95e84f6d3553457c4df595a9caa56843c03 M  drivers

[switching back to mailing list thread]

I asked Florian for dmesg w/ amd_iommu_dump, here's the relevant lines:

[1.485645] AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: 3e info 1300
[1.485683] AMD-Vi:mmio-addr: feb2
[1.485901] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:00.0 flags: 00
[1.485935] AMD-Vi:   DEV_RANGE_END   devid: 00:00.2
[1.485969] AMD-Vi:   DEV_SELECT  devid: 00:02.0 flags: 
00
[1.486002] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 01:00.0 flags: 00
[1.486036] AMD-Vi:   DEV_RANGE_END   devid: 01:00.1
[1.486070] AMD-Vi:   DEV_SELECT  devid: 00:04.0 flags: 
00
[1.486103] AMD-Vi:   DEV_SELECT  devid: 02:00.0 flags: 
00
[1.486137] AMD-Vi:   DEV_SELECT  devid: 00:05.0 flags: 
00
[1.486170] AMD-Vi:   DEV_SELECT  devid: 03:00.0 flags: 
00
[1.486204] AMD-Vi:   DEV_SELECT  devid: 00:06.0 flags: 
00
[1.486238] AMD-Vi:   DEV_SELECT  devid: 04:00.0 flags: 
00
[1.486271] AMD-Vi:   DEV_SELECT  devid: 00:07.0 flags: 
00
[1.486305] AMD-Vi:   DEV_SELECT  devid: 05:00.0 flags: 
00
[1.486338] AMD-Vi:   DEV_SELECT  devid: 00:09.0 flags: 
00
[1.486372] AMD-Vi:   DEV_SELECT  devid: 06:00.0 flags: 
00
[1.486406] AMD-Vi:   DEV_SELECT  devid: 00:0b.0 flags: 
00
[1.486439] AMD-Vi:   DEV_SELECT  devid: 07:00.0 flags: 
00
[1.486473] AMD-Vi:   DEV_ALIAS_RANGE devid: 08:01.0 flags: 
00 devid_to: 08:00.0
[1.486510] AMD-Vi:   DEV_RANGE_END   devid: 08:1f.7
[1.486548] AMD-Vi:   DEV_SELECT  devid: 00:11.0 flags: 
00
[1.486581] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:12.0 flags: 00
[1.486620] AMD-Vi:   DEV_RANGE_END   devid: 00:12.2
[1.486654] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:13.0 flags: 00
[1.486688] AMD-Vi:   DEV_RANGE_END   devid: 00:13.2
[1.486721] AMD-Vi:   DEV_SELECT  devid: 00:14.0 flags: 
d7
[1.486755] AMD-Vi:   DEV_SELECT  devid: 00:14.3 flags: 
00
[1.486788] AMD-Vi:   DEV_SELECT  devid: 00:14.4 flags: 
00
[1.486822] AMD-Vi:   DEV_ALIAS_RANGE devid: 09:00.0 flags: 
00 devid_to: 00:14.4
[1.486859] AMD-Vi:   DEV_RANGE_END   devid: 09:1f.7
[1.486897] AMD-Vi:   DEV_SELECT  devid: 00:14.5 flags: 
00
[1.486931] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:16.0 flags: 00
[1.486965] AMD-Vi:   DEV_RANGE_END   devid: 00:16.2
[1.487055] AMD-Vi: Enabling IOMMU at :00:00.2 cap 0x40


> lspci:
> 00:00.0 Host bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to PCI 
> bridge (external gfx0 port B) (rev 02)
> 00:00.2 IOMMU: Advanced Micro Devices [AMD] nee ATI RD990 I/O Memory 
> Management Unit (IOMMU)
> 00:02.0 PCI bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to PCI 
> bridge (PCI express gpp port B)
> 00:04.0 PCI bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to PCI 
> bridge (PCI express gpp port D)
> 00:05.0 PCI bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to PCI 
> bridge (PCI express gpp port E)
> 00:06.0 PCI bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to PCI 
> bridge (PCI express gpp port F)
> 00:07.0 PCI bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to PCI 
> bridge (PCI express gpp port G)
> 00:09.0 PCI bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to PCI 
> bridge (PCI express gpp port H)
> 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to PCI 
> bridge (NB-SB link)
> 00:11.0 SATA controller: Advanced Micro Devices [AMD] nee ATI 
> SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (rev 40)
> 00:12.0 USB controller: Advanced Micro Devices [AMD]

Re: 3.6-rc7 boot crash + bisection

2012-09-25 Thread Alex Williamson

On Tue, 2012-09-25 at 12:32 -0600, Alex Williamson wrote:
> On Mon, 2012-09-24 at 21:03 +0200, Florian Dazinger wrote:
> > Hi,
> > I think I've found a regression, which causes an early boot crash, I
> > appended the kernel output via jpg file, since I do not have a serial
> > console or sth.
> > 
> > after bisection, it boils down to this commit:
> > 
> > 9dcd61303af862c279df86aa97fde7ce371be774 is the first bad commit
> > commit 9dcd61303af862c279df86aa97fde7ce371be774
> > Author: Alex Williamson 
> > Date:   Wed May 30 14:19:07 2012 -0600
> > 
> > amd_iommu: Support IOMMU groups
> > 
> > Add IOMMU group support to AMD-Vi device init and uninit code.
> > Existing notifiers make sure this gets called for each device.
> > 
> > Signed-off-by: Alex Williamson 
> > Signed-off-by: Joerg Roedel 
> > 
> > :04 04 2f6b1b8e104d6dfec0abaa9646750f9b5a4f4060
> > 837ae95e84f6d3553457c4df595a9caa56843c03 M  drivers
> 
> [switching back to mailing list thread]
> 
> I asked Florian for dmesg w/ amd_iommu_dump, here's the relevant lines:
> 
> [1.485645] AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: 3e info 1300
> [1.485683] AMD-Vi:mmio-addr: feb2
> [1.485901] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:00.0 flags: 00
> [1.485935] AMD-Vi:   DEV_RANGE_END   devid: 00:00.2
> [1.485969] AMD-Vi:   DEV_SELECT  devid: 00:02.0 
> flags: 00
> [1.486002] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 01:00.0 flags: 00
> [1.486036] AMD-Vi:   DEV_RANGE_END   devid: 01:00.1
> [1.486070] AMD-Vi:   DEV_SELECT  devid: 00:04.0 
> flags: 00
> [1.486103] AMD-Vi:   DEV_SELECT  devid: 02:00.0 
> flags: 00
> [1.486137] AMD-Vi:   DEV_SELECT  devid: 00:05.0 
> flags: 00
> [1.486170] AMD-Vi:   DEV_SELECT  devid: 03:00.0 
> flags: 00
> [1.486204] AMD-Vi:   DEV_SELECT  devid: 00:06.0 
> flags: 00
> [1.486238] AMD-Vi:   DEV_SELECT  devid: 04:00.0 
> flags: 00
> [1.486271] AMD-Vi:   DEV_SELECT  devid: 00:07.0 
> flags: 00
> [1.486305] AMD-Vi:   DEV_SELECT  devid: 05:00.0 
> flags: 00
> [1.486338] AMD-Vi:   DEV_SELECT  devid: 00:09.0 
> flags: 00
> [1.486372] AMD-Vi:   DEV_SELECT  devid: 06:00.0 
> flags: 00
> [1.486406] AMD-Vi:   DEV_SELECT  devid: 00:0b.0 
> flags: 00
> [1.486439] AMD-Vi:   DEV_SELECT  devid: 07:00.0 
> flags: 00
> [1.486473] AMD-Vi:   DEV_ALIAS_RANGE devid: 08:01.0 
> flags: 00 devid_to: 08:00.0
> [1.486510] AMD-Vi:   DEV_RANGE_END   devid: 08:1f.7
> [1.486548] AMD-Vi:   DEV_SELECT  devid: 00:11.0 
> flags: 00
> [1.486581] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:12.0 flags: 00
> [1.486620] AMD-Vi:   DEV_RANGE_END   devid: 00:12.2
> [1.486654] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:13.0 flags: 00
> [1.486688] AMD-Vi:   DEV_RANGE_END   devid: 00:13.2
> [1.486721] AMD-Vi:   DEV_SELECT  devid: 00:14.0 
> flags: d7
> [1.486755] AMD-Vi:   DEV_SELECT  devid: 00:14.3 
> flags: 00
> [1.486788] AMD-Vi:   DEV_SELECT  devid: 00:14.4 
> flags: 00
> [1.486822] AMD-Vi:   DEV_ALIAS_RANGE devid: 09:00.0 
> flags: 00 devid_to: 00:14.4
> [1.486859] AMD-Vi:   DEV_RANGE_END   devid: 09:1f.7
> [1.486897] AMD-Vi:   DEV_SELECT  devid: 00:14.5 
> flags: 00
> [1.486931] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:16.0 flags: 00
> [1.486965] AMD-Vi:   DEV_RANGE_END   devid: 00:16.2
> [1.487055] AMD-Vi: Enabling IOMMU at :00:00.2 cap 0x40
> 
> 
> > lspci:
> > 00:00.0 Host bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to PCI 
> > bridge (external gfx0 port B) (rev 02)
> > 00:00.2 IOMMU: Advanced Micro Devices [AMD] nee ATI RD990 I/O Memory 
> > Management Unit (IOMMU)
> > 00:02.0 PCI bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to PCI 
> > bridge (PCI express gpp port B)
> > 00:04.0 PCI bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to PCI 
> > bridge (PCI express gpp port D)
> > 00:05.0 PCI bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to PCI 
> > bridge (PCI express gpp port E)
> > 00:06.0 PCI bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to PCI 
> > bridge (PCI expre

Re: 3.6-rc7 boot crash + bisection

2012-09-25 Thread Alex Williamson

On Tue, 2012-09-25 at 20:54 +0200, Florian Dazinger wrote:
> Am Tue, 25 Sep 2012 12:32:50 -0600
> schrieb Alex Williamson :
> 
> > On Mon, 2012-09-24 at 21:03 +0200, Florian Dazinger wrote:
> > > Hi,
> > > I think I've found a regression, which causes an early boot crash, I
> > > appended the kernel output via jpg file, since I do not have a serial
> > > console or sth.
> > > 
> > > after bisection, it boils down to this commit:
> > > 
> > > 9dcd61303af862c279df86aa97fde7ce371be774 is the first bad commit
> > > commit 9dcd61303af862c279df86aa97fde7ce371be774
> > > Author: Alex Williamson 
> > > Date:   Wed May 30 14:19:07 2012 -0600
> > > 
> > > amd_iommu: Support IOMMU groups
> > > 
> > > Add IOMMU group support to AMD-Vi device init and uninit code.
> > > Existing notifiers make sure this gets called for each device.
> > > 
> > > Signed-off-by: Alex Williamson 
> > > Signed-off-by: Joerg Roedel 
> > > 
> > > :04 04 2f6b1b8e104d6dfec0abaa9646750f9b5a4f4060
> > > 837ae95e84f6d3553457c4df595a9caa56843c03 M  drivers
> > 
> > [switching back to mailing list thread]
> > 
> > I asked Florian for dmesg w/ amd_iommu_dump, here's the relevant lines:
> > 
> > [1.485645] AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: 3e info 1300
> > [1.485683] AMD-Vi:mmio-addr: feb2
> > [1.485901] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:00.0 flags: 00
> > [1.485935] AMD-Vi:   DEV_RANGE_END   devid: 00:00.2
> > [1.485969] AMD-Vi:   DEV_SELECT  devid: 00:02.0 
> > flags: 00
> > [1.486002] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 01:00.0 flags: 00
> > [1.486036] AMD-Vi:   DEV_RANGE_END   devid: 01:00.1
> > [1.486070] AMD-Vi:   DEV_SELECT  devid: 00:04.0 
> > flags: 00
> > [1.486103] AMD-Vi:   DEV_SELECT  devid: 02:00.0 
> > flags: 00
> > [1.486137] AMD-Vi:   DEV_SELECT  devid: 00:05.0 
> > flags: 00
> > [1.486170] AMD-Vi:   DEV_SELECT  devid: 03:00.0 
> > flags: 00
> > [1.486204] AMD-Vi:   DEV_SELECT  devid: 00:06.0 
> > flags: 00
> > [1.486238] AMD-Vi:   DEV_SELECT  devid: 04:00.0 
> > flags: 00
> > [1.486271] AMD-Vi:   DEV_SELECT  devid: 00:07.0 
> > flags: 00
> > [1.486305] AMD-Vi:   DEV_SELECT  devid: 05:00.0 
> > flags: 00
> > [1.486338] AMD-Vi:   DEV_SELECT  devid: 00:09.0 
> > flags: 00
> > [1.486372] AMD-Vi:   DEV_SELECT  devid: 06:00.0 
> > flags: 00
> > [1.486406] AMD-Vi:   DEV_SELECT  devid: 00:0b.0 
> > flags: 00
> > [1.486439] AMD-Vi:   DEV_SELECT  devid: 07:00.0 
> > flags: 00
> > [1.486473] AMD-Vi:   DEV_ALIAS_RANGE devid: 08:01.0 
> > flags: 00 devid_to: 08:00.0
> > [1.486510] AMD-Vi:   DEV_RANGE_END   devid: 08:1f.7
> > [1.486548] AMD-Vi:   DEV_SELECT  devid: 00:11.0 
> > flags: 00
> > [1.486581] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:12.0 flags: 00
> > [1.486620] AMD-Vi:   DEV_RANGE_END   devid: 00:12.2
> > [1.486654] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:13.0 flags: 00
> > [1.486688] AMD-Vi:   DEV_RANGE_END   devid: 00:13.2
> > [1.486721] AMD-Vi:   DEV_SELECT  devid: 00:14.0 
> > flags: d7
> > [1.486755] AMD-Vi:   DEV_SELECT  devid: 00:14.3 
> > flags: 00
> > [1.486788] AMD-Vi:   DEV_SELECT  devid: 00:14.4 
> > flags: 00
> > [1.486822] AMD-Vi:   DEV_ALIAS_RANGE devid: 09:00.0 
> > flags: 00 devid_to: 00:14.4
> > [1.486859] AMD-Vi:   DEV_RANGE_END   devid: 09:1f.7
> > [1.486897] AMD-Vi:   DEV_SELECT  devid: 00:14.5 
> > flags: 00
> > [1.486931] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:16.0 flags: 00
> > [1.486965] AMD-Vi:   DEV_RANGE_END   devid: 00:16.2
> > [1.487055] AMD-Vi: Enabling IOMMU at :00:00.2 cap 0x40
> > 
> > 
> > > lspci:
> > > 00:00.0 Host bridge: Advanced Micro Devices [AMD] nee ATI RD890 PCI to 
> > > PCI bridge (external gfx0 port B) (rev 02)
> > > 00:00.2 IOMMU: Advanced Micro Devices [AMD] nee ATI RD990 I/O Memory 
&g

Re: 3.6-rc7 boot crash + bisection

2012-09-25 Thread Alex Williamson

On Wed, 2012-09-26 at 01:01 +0200, Florian Dazinger wrote:
> Am Tue, 25 Sep 2012 13:43:46 -0600
> schrieb Alex Williamson :
> 
> > On Tue, 2012-09-25 at 20:54 +0200, Florian Dazinger wrote:
> > > Am Tue, 25 Sep 2012 12:32:50 -0600
> > > schrieb Alex Williamson :
> > > 
> > > > On Mon, 2012-09-24 at 21:03 +0200, Florian Dazinger wrote:
> > > > > Hi,
> > > > > I think I've found a regression, which causes an early boot crash, I
> > > > > appended the kernel output via jpg file, since I do not have a serial
> > > > > console or sth.
> > > > > 
> > > > > after bisection, it boils down to this commit:
> > > > > 
> > > > > 9dcd61303af862c279df86aa97fde7ce371be774 is the first bad commit
> > > > > commit 9dcd61303af862c279df86aa97fde7ce371be774
> > > > > Author: Alex Williamson 
> > > > > Date:   Wed May 30 14:19:07 2012 -0600
> > > > > 
> > > > > amd_iommu: Support IOMMU groups
> > > > > 
> > > > > Add IOMMU group support to AMD-Vi device init and uninit code.
> > > > > Existing notifiers make sure this gets called for each device.
> > > > > 
> > > > > Signed-off-by: Alex Williamson 
> > > > > Signed-off-by: Joerg Roedel 
> > > > > 
> > > > > :04 04 2f6b1b8e104d6dfec0abaa9646750f9b5a4f4060
> > > > > 837ae95e84f6d3553457c4df595a9caa56843c03 M  drivers
> > > > 
> > > > [switching back to mailing list thread]
> > > > 
> > > > I asked Florian for dmesg w/ amd_iommu_dump, here's the relevant lines:
> > > > 
> > > > [1.485645] AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: 3e info 
> > > > 1300
> > > > [1.485683] AMD-Vi:mmio-addr: feb2
> > > > [1.485901] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:00.0 flags: 
> > > > 00
> > > > [1.485935] AMD-Vi:   DEV_RANGE_END   devid: 00:00.2
> > > > [1.485969] AMD-Vi:   DEV_SELECT  devid: 00:02.0 
> > > > flags: 00
> > > > [1.486002] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 01:00.0 flags: 
> > > > 00
> > > > [1.486036] AMD-Vi:   DEV_RANGE_END   devid: 01:00.1
> > > > [1.486070] AMD-Vi:   DEV_SELECT  devid: 00:04.0 
> > > > flags: 00
> > > > [1.486103] AMD-Vi:   DEV_SELECT  devid: 02:00.0 
> > > > flags: 00
> > > > [1.486137] AMD-Vi:   DEV_SELECT  devid: 00:05.0 
> > > > flags: 00
> > > > [1.486170] AMD-Vi:   DEV_SELECT  devid: 03:00.0 
> > > > flags: 00
> > > > [1.486204] AMD-Vi:   DEV_SELECT  devid: 00:06.0 
> > > > flags: 00
> > > > [1.486238] AMD-Vi:   DEV_SELECT  devid: 04:00.0 
> > > > flags: 00
> > > > [1.486271] AMD-Vi:   DEV_SELECT  devid: 00:07.0 
> > > > flags: 00
> > > > [1.486305] AMD-Vi:   DEV_SELECT  devid: 05:00.0 
> > > > flags: 00
> > > > [1.486338] AMD-Vi:   DEV_SELECT  devid: 00:09.0 
> > > > flags: 00
> > > > [1.486372] AMD-Vi:   DEV_SELECT  devid: 06:00.0 
> > > > flags: 00
> > > > [1.486406] AMD-Vi:   DEV_SELECT  devid: 00:0b.0 
> > > > flags: 00
> > > > [1.486439] AMD-Vi:   DEV_SELECT  devid: 07:00.0 
> > > > flags: 00
> > > > [1.486473] AMD-Vi:   DEV_ALIAS_RANGE devid: 08:01.0 
> > > > flags: 00 devid_to: 08:00.0
> > > > [1.486510] AMD-Vi:   DEV_RANGE_END   devid: 08:1f.7
> > > > [1.486548] AMD-Vi:   DEV_SELECT  devid: 00:11.0 
> > > > flags: 00
> > > > [1.486581] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:12.0 flags: 
> > > > 00
> > > > [1.486620] AMD-Vi:   DEV_RANGE_END   devid: 00:12.2
> > > > [1.486654] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:13.0 flags: 
> > > > 00
> > > > [1.486688] AMD-Vi:   DEV_RANGE_END   devid: 00:13.2
> > > > [1.486721] AMD-Vi:   DEV_SELECT  devid: 00:14.0 
> > > > flags: d7
> > > >

Re: 3.6-rc7 boot crash + bisection

2012-09-26 Thread Alex Williamson

On Wed, 2012-09-26 at 15:20 +0200, Roedel, Joerg wrote:
> On Tue, Sep 25, 2012 at 01:43:46PM -0600, Alex Williamson wrote:
> > Joerg, any thoughts on a quirk for this?  Unfortunately we can't just
> > skip IOMMU groups when an alias is broken because it puts the other
> > IOMMU groups at risk that might not actually be isolated from this
> > device.  It looks like we parse the alias info before PCI is probed, so
> > maybe we'd need to call the quirk from iommu_init_device itself.
> 
> I fear that the BIOS does everything right and device 08:04.0 is indeed
> using 08:00.0 as request-id. There are a couple of devices where this
> happens, usually when the vendor just took the old 32bit PCI chip, added
> a transparent PCIe-to-PCI bridge to the device and sell it a PCIe.
> 
> So the assumption that every request-id has a corresponding pci_dev
> structure does not hold. I also had made that assumption in the
> AMD IOMMU driver but had to add code which removes that assumption. We
> should look for a way to remove that assumption from the group-code too.

Hmm, that throws a kink in iommu groups.  So perhaps we need to make an
alias interface to iommu groups.  Seems like this could just be an extra
parameter to iommu_group_get and iommu_group_add_device (empty in the
typical case).  Then we have the problem of what's the type for an
alias?  For AMI-Vi, it's a u16, but we need to be more generic than
that.  Maybe iommu groups should just treat it as a void* so iommus can
use a pointer to some structure or a fixed value like a u16 bus:slot.
Thoughts?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2933 matches

Mail list logo