Re: [Xen-devel] [PATCH 42/44] powerpc/cell: use the dma_supported method for ops switching

2017-06-17 Thread Benjamin Herrenschmidt
On Fri, 2017-06-16 at 20:10 +0200, Christoph Hellwig wrote:
> Besides removing the last instance of the set_dma_mask method this also
> reduced the code duplication.

What is your rationale here ? (I have missed patch 0 it seems).

dma_supported() was supposed to be pretty much a "const" function
simply informing whether a given setup is possible. Having it perform
an actual switch of ops seems to be pushing it...

What if a driver wants to test various dma masks and then pick one ?

Where does the API documents that if a driver calls dma_supported() it
then *must* set the corresponding mask and use that ?

I don't like a function that is a "boolean query" like this one to have
such a major side effect.

From an API standpoint, dma_set_mask() is when the mask is established,
and thus when the ops switch should happen.

Ben.

> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/platforms/cell/iommu.c | 25 +
>  1 file changed, 9 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/cell/iommu.c 
> b/arch/powerpc/platforms/cell/iommu.c
> index 497bfbdbd967..29d4f96ed33e 100644
> --- a/arch/powerpc/platforms/cell/iommu.c
> +++ b/arch/powerpc/platforms/cell/iommu.c
> @@ -644,20 +644,14 @@ static void dma_fixed_unmap_sg(struct device *dev, 
> struct scatterlist *sg,
>  direction, attrs);
>  }
>  
> -static int dma_fixed_dma_supported(struct device *dev, u64 mask)
> -{
> - return mask == DMA_BIT_MASK(64);
> -}
> -
> -static int dma_set_mask_and_switch(struct device *dev, u64 dma_mask);
> +static int dma_suported_and_switch(struct device *dev, u64 dma_mask);
>  
>  static const struct dma_map_ops dma_iommu_fixed_ops = {
>   .alloc  = dma_fixed_alloc_coherent,
>   .free   = dma_fixed_free_coherent,
>   .map_sg = dma_fixed_map_sg,
>   .unmap_sg   = dma_fixed_unmap_sg,
> - .dma_supported  = dma_fixed_dma_supported,
> - .set_dma_mask   = dma_set_mask_and_switch,
> + .dma_supported  = dma_suported_and_switch,
>   .map_page   = dma_fixed_map_page,
>   .unmap_page = dma_fixed_unmap_page,
>   .mapping_error  = dma_iommu_mapping_error,
> @@ -952,11 +946,8 @@ static u64 cell_iommu_get_fixed_address(struct device 
> *dev)
>   return dev_addr;
>  }
>  
> -static int dma_set_mask_and_switch(struct device *dev, u64 dma_mask)
> +static int dma_suported_and_switch(struct device *dev, u64 dma_mask)
>  {
> - if (!dev->dma_mask || !dma_supported(dev, dma_mask))
> - return -EIO;
> -
>   if (dma_mask == DMA_BIT_MASK(64) &&
>   cell_iommu_get_fixed_address(dev) != OF_BAD_ADDR) {
>   u64 addr = cell_iommu_get_fixed_address(dev) +
> @@ -965,14 +956,16 @@ static int dma_set_mask_and_switch(struct device *dev, 
> u64 dma_mask)
>   dev_dbg(dev, "iommu: fixed addr = %llx\n", addr);
>   set_dma_ops(dev, &dma_iommu_fixed_ops);
>   set_dma_offset(dev, addr);
> - } else {
> + return 1;
> + }
> +
> + if (dma_iommu_dma_supported(dev, dma_mask)) {
>   dev_dbg(dev, "iommu: not 64-bit, using default ops\n");
>   set_dma_ops(dev, get_pci_dma_ops());
>   cell_dma_dev_setup(dev);
> + return 1;
>   }
>  
> - *dev->dma_mask = dma_mask;
> -
>   return 0;
>  }
>  
> @@ -1127,7 +1120,7 @@ static int __init cell_iommu_fixed_mapping_init(void)
>   cell_iommu_setup_window(iommu, np, dbase, dsize, 0);
>   }
>  
> - dma_iommu_ops.set_dma_mask = dma_set_mask_and_switch;
> + dma_iommu_ops.dma_supported = dma_suported_and_switch;
>   set_pci_dma_ops(&dma_iommu_ops);
>  
>   return 0;

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 42/44] powerpc/cell: use the dma_supported method for ops switching

2017-06-18 Thread Benjamin Herrenschmidt
On Sun, 2017-06-18 at 00:13 -0700, Christoph Hellwig wrote:
> On Sun, Jun 18, 2017 at 06:50:27AM +1000, Benjamin Herrenschmidt wrote:
> > What is your rationale here ? (I have missed patch 0 it seems).
> 
> Less code duplication, more modular dma_map_ops insteance.
> 
> > dma_supported() was supposed to be pretty much a "const" function
> > simply informing whether a given setup is possible. Having it perform
> > an actual switch of ops seems to be pushing it...
> 
> dma_supported() is already gone from the public DMA API as it doesn't
> make sense to be called separately from set_dma_mask.  It will be
> entirely gone in the next series after this one.

Ah ok, in that case it makes much more sense, we can rename it then.

> > What if a driver wants to test various dma masks and then pick one ?
> > 
> > Where does the API documents that if a driver calls dma_supported() it
> > then *must* set the corresponding mask and use that ?
> 
> Where is the API document for _any_ of the dma routines? (A: work in
> progress by me, but I need to clean up the mess of arch hooks before
> it can make any sense)

Heh fair enough.

> > I don't like a function that is a "boolean query" like this one to have
> > such a major side effect.
> > 
> > > From an API standpoint, dma_set_mask() is when the mask is established,
> > 
> > and thus when the ops switch should happen.
> 
> And that's exactly what happens at the driver API level.  It just turns
> out the dma_capable method is they way better place to actually
> implement it, as the ->set_dma_mask method requires lots of code
> duplication while not offering any actual benefit over ->dma_capable.
> And because of that it's gone after this series.
> 
> In theory we could rename ->dma_capable now, but it would require a
> _lot_ of churn.  Give me another merge window or two and we should
> be down to be about 2 handful of dma_map_ops instance, at which point
> we could do all this gratious renaming a lot more easily :)

Sure, I get it now, as long as it's not publicly exposed to drivers
we are fine.

Cheers,
Ben.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] clean up and modularize arch dma_mapping interface V2

2017-06-24 Thread Benjamin Herrenschmidt
On Sat, 2017-06-24 at 09:18 +0200, Christoph Hellwig wrote:
> On Wed, Jun 21, 2017 at 12:24:28PM -0700, tndave wrote:
> > Thanks for doing this.
> > So archs can still have their own definition for dma_set_mask() if 
> > HAVE_ARCH_DMA_SET_MASK is y?
> > (and similarly for dma_set_coherent_mask() when 
> > CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK is y)
> > Any plan to change these?
> 
> Yes, those should go away, but I'm not entirely sure how yet.  We'll
> need some hook for switching between an IOMMU and a direct mapping
> (I guess that's what you want to do for sparc as well?), and I need
> to find the best way to do that.  Reimplementing all of dma_set_mask
> and dma_set_coherent_mask is something that I want to move away from.

I think we still need to do it. For example we have a bunch new "funky"
cases.

We already have the case where we mix the direct and iommu mappings,
on some powerpc platforms that translates in an iommu mapping down at
0 for the 32-bit space and a direct mapping high up in the PCI address
space (which crops the top bits and thus hits memory at 0 onwards).

This type of hybrid layout is needed by some adapters, typically
storage, which want to keep the "coherent" mask at 32-bit but support
64-bit for streaming masks.

Another one we are trying to deal better with at the moment is devices
with DMA addressing limitations. Some GPUs typically (but not only)
have limits that go all accross the gamut, typically I've seen 40 bits,
44 bits and 47 bits... And of course those are "high peformance"
adapters so we don't want to limit them to the comparatively small
iommu mapping with extra overhead.

At the moment, we're looking at a dma_set_mask() hook that will, for
these guys, re-configure the iommu mapping to create a "compressed"
linear mapping of system memory (ie, skipping the holes we have between
chip on P9 for example) using the largest possible iommu page size
(256M on P8, 1G on P9).

This is made tricky of course because several devices can potentially
share a DMA domain based on various platform specific reasons. And of
course we have no way to figure out what's the "common denominator" of
all those devices before they start doing DMA. A driver can start
before the neighbour is probed and a driver can start doing DMAs using
the standard 32-bit mapping without doing dma_set_mask().

So heuristics ... ugh. Better ideas welcome :-) All that to say that we
are far from being able to get rid of dma_set_mask() custom
implementations (and coherent mask too).

I was tempted at some point retiring the 32-bit iommu mapping
completely, just doing that "linear" thing I mentioned above and
swiotlb for the rest, along with introducing ZONE_DMA32 on powerpc
(with the real 64-bit bypass still around for non-limited devices but
that's then just extra speed by bypassing the iommu xlate & cache).

But I worry of the impact on those silly adapters that set the coherent
mask to 32-bits to keep their microcode & descriptor ring down in 32-
bit space. I'm not sure how well ZONE_DMA32 behaves in those cases.

Cheers,
Ben.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v7 5/9] PCI: Add pci_iomap_wc() variants

2015-06-23 Thread Benjamin Herrenschmidt
On Fri, 2015-06-19 at 15:08 -0700, Luis R. Rodriguez wrote:
> From: "Luis R. Rodriguez" 
> 
> PCI BARs tell us whether prefetching is safe, but they don't say anything
> about write combining (WC).  WC changes ordering rules and allows writes to
> be collapsed, so it's not safe in general to use it on a prefetchable
> region.

Well, the PCIe spec at least specifies that a prefetchable BAR also
tolerates write merging... 

> Add pci_iomap_wc() and pci_iomap_wc_range() so drivers can take advantage
> of write combining when they know it's safe.
> 
> On architectures that don't fully support WC, e.g., x86 without PAT,
> drivers for legacy framebuffers may get some of the benefit by using
> arch_phys_wc_add() in addition to pci_iomap_wc().  But arch_phys_wc_add()
> is unreliable and should be avoided in general.  On x86, it uses MTRRs,
> which are limited in number and size, so the results will vary based on
> driver loading order.
> 
> The goals of adding pci_iomap_wc() are to:
> 
> - Give drivers an architecture-independent way to use WC so they can stop
>   using interfaces like mtrr_add() (on x86, pci_iomap_wc() uses
>   PAT when available)
> 
> - Move toward using _PAGE_CACHE_MODE_UC, not _PAGE_CACHE_MODE_UC_MINUS,
>   on x86 on ioremap_nocache() (see de33c442ed2a ("x86 PAT: fix
>   performance drop for glx, use UC minus for ioremap(), ioremap_nocache()
>   and pci_mmap_page_range()")
> 
> Link: 
> http://lkml.kernel.org/r/1426893517-2511-6-git-send-email-mcg...@do-not-panic.com
> Original-posting: 
> http://lkml.kernel.org/r/1432163293-20965-1-git-send-email-mcg...@do-not-panic.com
> Cc: Toshi Kani 
> Cc: Andy Lutomirski 
> Cc: Suresh Siddha 
> Cc: Ingo Molnar 
> Cc: Thomas Gleixner 
> Cc: Juergen Gross 
> Cc: Daniel Vetter 
> Cc: Dave Airlie 
> Cc: Bjorn Helgaas 
> Cc: Antonino Daplas 
> Cc: Jean-Christophe Plagniol-Villard 
> Cc: Tomi Valkeinen 
> Cc: Dave Hansen 
> Cc: Arnd Bergmann 
> Cc: Michael S. Tsirkin 
> Cc: venkatesh.pallip...@intel.com
> Cc: Stefan Bader 
> Cc: Ville Syrjälä 
> Cc: Mel Gorman 
> Cc: Vlastimil Babka 
> Cc: Borislav Petkov 
> Cc: Davidlohr Bueso 
> Cc: konrad.w...@oracle.com
> Cc: ville.syrj...@linux.intel.com
> Cc: david.vra...@citrix.com
> Cc: jbeul...@suse.com
> Cc: Roger Pau Monné 
> Cc: linux-fb...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: xen-de...@lists.xensource.com
> Signed-off-by: Luis R. Rodriguez 
> ---
>  include/asm-generic/pci_iomap.h | 14 ++
>  lib/pci_iomap.c | 61 
> +
>  2 files changed, 75 insertions(+)
> 
> diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h
> index 7389c87..b1e17fc 100644
> --- a/include/asm-generic/pci_iomap.h
> +++ b/include/asm-generic/pci_iomap.h
> @@ -15,9 +15,13 @@ struct pci_dev;
>  #ifdef CONFIG_PCI
>  /* Create a virtual mapping cookie for a PCI BAR (memory or IO) */
>  extern void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long 
> max);
> +extern void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned 
> long max);
>  extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar,
>unsigned long offset,
>unsigned long maxlen);
> +extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
> + unsigned long offset,
> + unsigned long maxlen);
>  /* Create a virtual mapping cookie for a port on a given PCI device.
>   * Do not call this directly, it exists to make it easier for architectures
>   * to override */
> @@ -34,12 +38,22 @@ static inline void __iomem *pci_iomap(struct pci_dev 
> *dev, int bar, unsigned lon
>   return NULL;
>  }
>  
> +static inline void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, 
> unsigned long max)
> +{
> + return NULL;
> +}
>  static inline void __iomem *pci_iomap_range(struct pci_dev *dev, int bar,
>   unsigned long offset,
>   unsigned long maxlen)
>  {
>   return NULL;
>  }
> +static inline void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
> +unsigned long offset,
> +unsigned long maxlen)
> +{
> + return NULL;
> +}
>  #endif
>  
>  #endif /* __ASM_GENERIC_IO_H */
> diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c
> index bcce5f1..9604dcb 100644
> --- a/lib/pci_iomap.c
> +++ b/lib/pci_iomap.c
> @@ -52,6 +52,46 @@ void __iomem *pci_iomap_range(struct pci_dev *dev,
>  EXPORT_SYMBOL(pci_iomap_range);
>  
>  /**
> + * pci_iomap_wc_range - create a virtual WC mapping cookie for a PCI BAR
> + * @dev: PCI device that owns the BAR
> + * @bar: BAR number
> + * @offset: map memory at the given offset in BAR
> + * @maxlen: max length of the memory to map
> + *
> + * Using this function you will get a __iomem address to your device

Re: [Xen-devel] [PATCH v7 5/9] PCI: Add pci_iomap_wc() variants

2015-06-24 Thread Benjamin Herrenschmidt
On Wed, 2015-06-24 at 18:38 +0200, Luis R. Rodriguez wrote:
> On Wed, Jun 24, 2015 at 08:42:23AM +1000, Benjamin Herrenschmidt wrote:
> > On Fri, 2015-06-19 at 15:08 -0700, Luis R. Rodriguez wrote:
> > > From: "Luis R. Rodriguez" 
> > > 
> > > PCI BARs tell us whether prefetching is safe, but they don't say anything
> > > about write combining (WC).  WC changes ordering rules and allows writes 
> > > to
> > > be collapsed, so it's not safe in general to use it on a prefetchable
> > > region.
> > 
> > Well, the PCIe spec at least specifies that a prefetchable BAR also
> > tolerates write merging... 
> 
> How can that be determined and can that be used as a full bullet proof hint
> to enable wc ? And are you sure? :) 

Well, I"m sure the spec says that ;-) But it could be new to PCIe, I
haven't checked legacy PCI.

> Reason all this was stated was to be
> apologetic over why we can't automate this behind the scenes. Otherwise
> we could amend what you stated into the commit log to elaborate on our
> technical apology. Let me know!

At least on powerpc, for mmap of resource to userspace, we take off the
garded bit in the PTE for prefetchable BARs. This has the effect
architecturally of enabling both prefetch and write combine (ie. side
effect) though afaik, the implementations probably don't actually
prefetch. We've done that for years.

In fact we don't have a way to split the notions, it's either G or no G,
which carries both meanings.

Do you have example/case of a device having problems ?

Cheers,
Ben.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v7 5/9] PCI: Add pci_iomap_wc() variants

2015-06-24 Thread Benjamin Herrenschmidt
On Wed, 2015-06-24 at 15:29 -0700, Luis R. Rodriguez wrote:

> Nope but at least what made me squint at this being a possible
> "feature" was that in practice when reviewing all of the kernels
> pending device drivers using MTRR (potential write-combine candidates)
> I encountered a slew of them which had the architectural unfortunate
> practice of combining PCI bars for MMIO and their respective
> write-combined desirable area (framebuffer for video, PIO buffers for
> infiniband, etc). Now, to me that read more as a practice for old
> school devices when such things were likely still being evaluated,
> more modern devices seem to adhere to sticking a full PCI bar with
> write-combining or not. Did you not encounter such mismatch splits on
> powerpc ? Was such possibility addressed?

Yes, I remember we dealt with some networking (or maybe IB) stuff back
in the day. We dealt with it by using a WC mapping and explicit barriers
to prevent combine when not wanted.

It is to be noted that on powerpc at least, writel() and co will never
combine due to the memory barriers in them. Only "normal" stores (or
__raw_writel) will.

On Intel things I different I assume...

The problem I see is that architectures can provide widely different
mechanisms and semantics in those areas and it's hard to define a
generic driver interface.

> If what you are implying here is applicable to the x86 world I'm all
> for enabling this as we'd have less code to maintain but I'll note
> that getting a clarification alone on that prefetchable !=
> write-combining was in and of itself hard, I'd be surprised if we
> could get full architectural buy-in to this as an immediate automatic
> feature.

I'm happy not to make it automatic for kernel space. As for user
mappings, maybe the right thing to do is to let us do what we do by
default with a quirk that can set a flag in pci_dev to disable that
behaviour (maybe on a per BAR basis ?). I think the common case is
that WC works.

>  Because of this and because PAT did have some errata as well,
> I would not be surprised if some PCI bridges / devices would end up
> finding corner cases, as such if we can really do what you're saying
> and unless we can get some super sane certainty over it across the
> board, I'd be inclined to leave such things as a part of a new API.
> Maybe have some folks test using the new API for all calls and after
> some sanity of testing / releases consider a full switch.
> 
> That is, unless of course you're sure all this is sane and would wager
> all-in on it from the get-go.

Cheers,
Ben.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v7 5/9] PCI: Add pci_iomap_wc() variants

2015-06-24 Thread Benjamin Herrenschmidt
On Thu, 2015-06-25 at 02:08 +0200, Luis R. Rodriguez wrote:
> 
> OK thanks I'll proceed with these patches then.
> 
> > As for user mappings,
> 
> Which APIs were you considering in this regard BTW?

mmap of the generic /sys/bus/pci/.../resource*

> > maybe the right thing to do is to let us do what we do by
> > default with a quirk that can set a flag in pci_dev to disable that
> > behaviour (maybe on a per BAR basis ?).
> 
> That might mean it could restrict userspace WC to require devices
> to have WC parts on a full PCI BAR. Although this is restrictive
> having reviewed most WC uses in the kernel I'd think this would be
> a fair compromise to make, but again, if things are still murky
> perhaps best we kiss this idea good bye for now and hope for it
> to come in on future buses or ammendments (if that's even possible?).
> 
> > I think the common case is that WC works.
> 
> If WC does not I will note one hack which migh be worth mentioning --
> just for
> the record, this was devised as a shortcoming of a device where they
> failed to
> split things properly and that *without* WC performance suffered quite
> a bit so
> they made one full PCI BAR WC and as a work around this:
> 
> http://lkml.kernel.org/r/20150416041837.GA5712@hykim-PC
> 
> That is for registers that needed it:
> 
> write; wmb;
> 
> Then if they wanted to wait till the NIC has seen the write, they did:
> 
> write; wmb; read;
> 

Right, and as I mentioned, on some archs like powerpc (and possibly
more), writel() and co contains an implicit mb()

>   Luis--



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v7 5/9] PCI: Add pci_iomap_wc() variants

2015-06-24 Thread Benjamin Herrenschmidt
On Wed, 2015-06-24 at 17:58 -0700, Luis R. Rodriguez wrote:
> On Wed, Jun 24, 2015 at 5:52 PM, Benjamin Herrenschmidt
>  wrote:
> > On Thu, 2015-06-25 at 02:08 +0200, Luis R. Rodriguez wrote:
> >>
> >> OK thanks I'll proceed with these patches then.
> >>
> >> > As for user mappings,
> >>
> >> Which APIs were you considering in this regard BTW?
> >
> > mmap of the generic /sys/bus/pci/.../resource*
> 
> Like? Got a demo patch in mind ? :)

Nope. I was just thinking out loud. Today I have yet to see a problem
with what we do so ...

Cheers,
Ben.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v7 5/9] PCI: Add pci_iomap_wc() variants

2015-06-25 Thread Benjamin Herrenschmidt
On Thu, 2015-06-25 at 21:40 +, Casey Leedom wrote:
> Hhmmm, so what do PowerPC Drivers do when they want to take
> advantage of Write Combining?  Do their own Endian Swizzling
> with the __raw_*() APIs?

Yeah either, we need to fix our relaxed implementation (patch
welcome :-)

Cheers,
Ben.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v8 0/9] pci: add pci_iomap_wc() and pci_ioremap_wc_bar()

2015-06-25 Thread Benjamin Herrenschmidt
On Wed, 2015-06-24 at 18:22 -0700, Luis R. Rodriguez wrote:
> Although I had test compiled this before just to be safe I went ahead and
> successfully test-compiled this set with allmodconfig, specially since I've 
> now
> removed the exports for the devres routines.  Please let me know if these 
> might
> be able to go through you or if there are any questions. I will note the 
> recent
> discussion with Benjamin over the v7 series concluded that the ideas we both
> were alluding to, on automating instead the WC effects for devices seems a bit
> too idealistic for PCI / PCIE for now, but perhaps we should at least consider
> this in the future for userspace mmap() calls [4].

So I've been trying to figure out how to make this practically work for us 
(powerpc).

writel() will never write combine for us, it uses too heavy barriers.

writel_relaxed() today is identical to writel() but we can change it.

The problem is that switching to G=0 mappings (which is what provides us with 
write
combining) also architecturally enables prefetch and speculative loads... and 
again
architecturally (the implementations may differ), kills the effect of the 
lightweight
io barrier eieio which we would have to use in readl_relaxed() and 
writel_relaxed()
to provide their normal semantics.

So it boils down to: Can we modify the documentation of readl_relaxed() and 
writel_relaxed()
to define them as being even further relaxed when using a "wc" mapping ?

Otherwise, the only way out I see for us on powerpc is to bias massively 
writel_relaxed()
against real_relaxed() by putting heavy barriers around the load in the latter 
so we can
keep them completely out of the former and still enable wc.

Ben.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v7 5/9] PCI: Add pci_iomap_wc() variants

2015-06-25 Thread Benjamin Herrenschmidt
On Thu, 2015-06-25 at 21:40 +, Casey Leedom wrote:
> 
> Ah, thanks.  I see now that the __raw_*() APIs don't do any of the
> Endian Swizzling.  Unfortunately the *_relaxed() APIs on PowerPC
> are just defined as the normal *() routines.  From
> arch/powerpc/include/asm/io.h:
> 
> /*
>  * We don't do relaxed operations yet, at least not with this
> semantic
>  */

Yes so I was looking at this but there are some difficulties.
Architecturally, even with I=1 G=1 mappings (normal ioremap), we have no
guarantee of ordering of load vs. store unless I misunderstood
something. I think all current implementations provide some of that but
without barriers in the accessors, we aren't architecturally correct.

However, having those barriers will cause issues with G=0 (write
combine). It's unclear whether eieio() will provide the required
ordering for I=1 G=0 mappings and it will probably break write combine.

I'm looking into it with our HW guys and will try to come up with a
solution for power, but it doesn't help that our memory model conflates
write combining with other relaxations and that all our barriers also
prevent write combine.

Maybe we can bias the relaxed accessors toward write, by having no
barriers in it, and putting extra ones in reads.

Cheers,
Ben.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v7 5/9] PCI: Add pci_iomap_wc() variants

2015-06-26 Thread Benjamin Herrenschmidt
On Fri, 2015-06-26 at 16:24 +, Casey Leedom wrote:
>   Thanks for looking into this Ben.  As it stands now, it seems as
> if Write Combined mappings simply aren't supported and/or all
> driver writers trying to utilize Write Combined mappings have to
> "hand roll" their own solutions which really isn't supportable.
> 
>   One thing that might be considered is simply to treat the desire
> to utilize the Write Combining hardware as a separate issue and
> develop writel_wc(), writeq_wc(), etc.  Those could be defined
> as legal only for Write Combined Mappings and would "do the
> right thing" for each architecture.  

The question then is what is "the right thing". In the powerpc case,
we'll have a non-garded mapping, which means we also get no ordering
between load and stores.

> The initial implementations
> could simply be writel(), writeq(), etc. till each architecture'si
> ssues were investigated and understood.
> 
>   Additionally, it would be very useful to have some sort of
> predicate which could be used to determine if an architecture
> supported Write Combining.  Our drivers would use this to
> eliminate a wasteful attempt to use write combining on those
> architectures where it didn't work.
> 
> Casey
> 
> 
> From: Benjamin Herrenschmidt [b...@kernel.crashing.org]
> Sent: Thursday, June 25, 2015 7:02 PM
> To: Casey Leedom
> Cc: Arnd Bergmann; Luis R. Rodriguez; Michael S. Tsirkin; Bjorn Helgaas; 
> Toshi Kani; Andy Lutomirski; Juergen Gross; Tomi Valkeinen; 
> linux-...@vger.kernel.org; linux-ker...@vger.kernel.org; 
> xen-de...@lists.xensource.com; linux-fbdev; Suresh Siddha; Ingo Molnar; 
> Thomas Gleixner; Daniel Vetter; Dave Airlie; Antonino Daplas; Jean-Christophe 
> Plagniol-Villard; Dave Hansen; venkatesh.pallip...@intel.com; Stefan Bader; 
> ville.syrj...@linux.intel.com; David Vrabel; Jan Beulich; Roger Pau Monné
> Subject: Re: [PATCH v7 5/9] PCI: Add pci_iomap_wc() variants
> 
> On Thu, 2015-06-25 at 21:40 +, Casey Leedom wrote:
> >
> > Ah, thanks.  I see now that the __raw_*() APIs don't do any of the
> > Endian Swizzling.  Unfortunately the *_relaxed() APIs on PowerPC
> > are just defined as the normal *() routines.  From
> > arch/powerpc/include/asm/io.h:
> >
> > /*
> >  * We don't do relaxed operations yet, at least not with this
> > semantic
> >  */
> 
> Yes so I was looking at this but there are some difficulties.
> Architecturally, even with I=1 G=1 mappings (normal ioremap), we have no
> guarantee of ordering of load vs. store unless I misunderstood
> something. I think all current implementations provide some of that but
> without barriers in the accessors, we aren't architecturally correct.
> 
> However, having those barriers will cause issues with G=0 (write
> combine). It's unclear whether eieio() will provide the required
> ordering for I=1 G=0 mappings and it will probably break write combine.
> 
> I'm looking into it with our HW guys and will try to come up with a
> solution for power, but it doesn't help that our memory model conflates
> write combining with other relaxations and that all our barriers also
> prevent write combine.
> 
> Maybe we can bias the relaxed accessors toward write, by having no
> barriers in it, and putting extra ones in reads.
> 
> Cheers,
> Ben.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v7 5/9] PCI: Add pci_iomap_wc() variants

2015-06-26 Thread Benjamin Herrenschmidt
On Fri, 2015-06-26 at 21:31 +0200, Luis R. Rodriguez wrote:
> > Yeah either, we need to fix our relaxed implementation (patch
> > welcome :-)
> 
> Yikes, so although powerpc has useful heuristics to automatically
> enable WC the default write ops have been nullifying its effects?

The heuristic is for userspace mapping which don't use the kernel
accessors. To cut a long story short, this was all done back in the day
to speed up X.org which doesn't use fancy accessors with barriers to
access the framebuffer (neither does the kernel fbdev layer).

> Shouldn't this have been blatantly obvious in performance benchmarks
> and expectations? 

> If not a change should give considerable performance
> gains... If the WC was nullified on powerpc then the experience and
> value of the heuristics have less value and even if they are going
> to be only considered for userspace mmap() it should be taken with
> a bit grain of salt it seems.

It wasn't nullified for the main user at the time, the fb. And I
mentioned an IB adapter or two for which the code had been hand tuned.

Cheers,
Ben.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v7 5/9] PCI: Add pci_iomap_wc() variants

2015-06-26 Thread Benjamin Herrenschmidt
On Fri, 2015-06-26 at 15:41 -0700, Luis R. Rodriguez wrote:

> > It wasn't nullified for the main user at the time, the fb. And I
> > mentioned an IB adapter or two for which the code had been hand
> tuned.
> 
> This still means there could be some affected drivers when used on
> powerpc, no?

Yes. In fact what about things like ARM who also have barriers in their
writel() ? Won't they also break WC ?

I'm trying to work with the architect and designers here to figure out
exactly where we stand and what we can do. As spelled out by our
architecture, things don't look great, because basically, we only have
attribute bit (garded) which when not set implies both WC and out of
order (& prefetch), and unclear barrier semantics in that case as well.

I *think* we might be able to settle with something along the lines of
"writel_relaxed() will allow combine on a WC mapping" but how I'm going
to get there is TBD.

It would be interesting to clarify the semantics of using the relaxed
accessors in combination with WC anyway. I wouldn't mind if the
definition involved also relaxing general ordering :-) It would
definitely make my life easier.

Cheers,
Ben.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v7 5/9] PCI: Add pci_iomap_wc() variants

2015-07-02 Thread Benjamin Herrenschmidt
On Thu, 2015-07-02 at 20:49 +0200, Luis R. Rodriguez wrote:
> > The question then is what is "the right thing". In the powerpc case,
> > we'll have a non-garded mapping, which means we also get no ordering
> > between load and stores.
> 
> I don't follow, you *ordering* between load and stores for WC? We should
> not need that for WC, its why WC is used for only very specific things
> such as framebuffer and PIO (which BTw I still don't quite get all this
> use case for infiniband to be honest, and I will note I do see some
> proprietary hardware extensions like bursts but nothing covering all
> this in a general doc, I think I think it all just has to do that this
> is a hardware hack in reality, which we sell as a feature).

Well, that's the problem, the semantics that we provide to drivers
aren't well defined.

The words "write combine" themselves only specify that writes to subsequent
addresses can be combined into larger transactions. That's in itself is already
quite vague (are their boundaries, limits ? some can depend on bus type, etc...)
though in practice is probably sufficient.

However, overloading a _wc mapping with additional memory model differences
such as loss of ordering between load and stores, etc... is not an obvious
thing to do. I agree it would make *my* life easier if we did it since this
is precisely the semantics provided by a "G=0" mapping on ppc, but we need
to agree and *document* it, otherwise bad things will happen eventually.

We also need to document in that case which barriers can be used to explicitly
enforce the ordering on such a mapping and which barriers can be used to break
write combine (they don't necessarily have to be the same).

We also need to document which accessors will actually provide the write combine
"feature" of a _wc mapping. For example while writel() will do it on Intel, it
will not on ppc and I wouldn't be surprised if a bunch of other archs fall in
the same bucket as ppc (basically anything that has barriers in their writel
implementation to order vs. DMA etc...).

So we might need to explicitly document that writel_relaxed() needs to be used.

Finally what are the precise guaranteed semantics of writeX/readX,
writeX_relaxed/readX_relaxed and __raw_ (everything else) on a _wc mapping, what
do we mandate and document, what do we leave to be implementation dependent ?

Cheers,
Ben.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API

2015-07-28 Thread Benjamin Herrenschmidt
On Tue, 2015-07-28 at 10:16 +0200, Paolo Bonzini wrote:
> 
> On 28/07/2015 03:08, Andy Lutomirski wrote:
> > On Mon, Sep 1, 2014 at 10:39 AM, Andy Lutomirski  
> > wrote:
> >> This fixes virtio on Xen guests as well as on any other platform
> >> that uses virtio_pci on which physical addresses don't match bus
> >> addresses.
> >>
> >> This can be tested with:
> >>
> >> virtme-run --xen xen --kimg arch/x86/boot/bzImage --console
> >>
> >> using virtme from here:
> >>
> >> https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git
> >>
> >> Without these patches, the guest hangs forever.  With these patches,
> >> everything works.
> >>
> > 
> > Dusting off an ancient thread.
> > 
> > Now that the dust has accumulated^Wsettled, is it worth pursuing this?
> >  I think the situation is considerably worse than it was when I
> > originally wrote these patches: I think that QEMU now supports a nasty
> > mode in which the guest's PCI bus appears to be behind an IOMMU but
> > the virtio devices on that bus punch straight through that IOMMU.
> 
> That is an experimental feature (it's x-iommu), so it can change.
> 
> The plan was:
> 
> - for PPC, virtio never honors IOMMU
> 
> - for non-PPC, either have virtio always honor IOMMU, or enforce that
> virtio is not under IOMMU.
> 

I dislike having PPC special cased.

In fact, today x86 guests also assume that virtio bypasses IOMMU I
believe. In fact *all* guests do.

I would much prefer if the information as to whether it honors or not
gets passed to the guest somewhat. My preference goes for passing it via
the virtio config space but there were objections that it should be a
bus property (which is tricky to do with PCI and doesn't properly
reflect the fact that in qemu you can mix & match IOMMU-honoring devices
and bypassing-virtio on the same bus). 

Ben.

> Paolo
> 
> > I have a half-hearted port to modern kernels here:
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_ring_xen
> > 
> > I didn't implement DMA API access for virtio_pci_modern, and I have no
> > idea what to do about detecting whether a given virtio device honors
> > its IOMMU or not.
> > 
> > --Andy
> > 



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API

2015-07-28 Thread Benjamin Herrenschmidt
On Tue, 2015-07-28 at 15:43 -0700, Andy Lutomirski wrote:
> Let me try to summarize a proposal:
> 
> Add a feature flag that indicates IOMMU support.
> 
> New kernels acknowledge that flag on any device that advertises it.
> 
> New kernels always respect the IOMMU (except on PowerPC).

Why ? I disagree, the flag should be honored when set in any
architecture. PowerPC is no different than any other platform in that
regard.

>   New kernels
> optionally refuse to talk to devices that don't have that feature flag
> if the device appears to be behind an IOMMU.  (This presumably
> includes any device whatsoever on an x86 platform with an IOMMU,
> including Xen's fake IOMMU.)
> 
> New QEMU always respects the IOMMU, if any, except on PPC.

This is just a matter of what is the default of the flag, ie we
should have a machine flag that indicates what the default is for
new virtio devices, otherwise, it should be specified per device
as an attribute of the device instance.

I would argue that we should default to "bypass IOMMU" on *all*
architecture due to the performance impact, and to essentially
default to the same behaviour as today. With things like DDW even
powerpc might be able to mostly alleviate the performance impact
so we might to change in the long term, but I tend to prefer
more incremental approaches.

>   New QEMU
> always advertises this feature flag.  If iommu=on, QEMU's virtio
> devices refuse to work unless the driver acknowledges the flag.

This should be configurable.

> On PPC, new QEMU will not respect the IOMMU and will not set the flag.
> New kernels will not talk to devices that set the flag.  If someone
> wants to fix that, then they get to figure out how.

I disagree with the kernel bit and I disagree with special casing PPC in
any shape or form in the code. The only difference should be a default
value for the iommu mode of virtio in qemu set per machine.

You can then feel free to change that default (in a separate patch for
bisectability) on x86 for the sake of Xen.

Ben.

> This results in:
> 
> New kernels work fine with old QEMU unless iommu=on.
> 
> New kernels work with new devices (QEMU and physical devices that set
> the flag) under all circumstances, except on PPC where physical
> devices are and remain broken.
> 
> Xen works work new QEMU and cleanly refuses to interoperate with old
> QEMU.  (This is worse than with just my patches, but it's better than
> the status quo in which the Xen guest corrupts itself and possibly
> corrupts the Xen hypervisor.)
> 
> New kernels with old QEMU with iommu=on optionally refuses to interoperate.
> 
> Old kernels are oblivious.  They work exactly the same as they do
> today except that they fail cleanly with new QEMU with iommu=on.  Old
> kernels continue to fail with physical virtio devices if they're
> behind an iommu.
> 
> Old physical virtio devices that don't advertise the flag fail cleanly
> if the host uses an iommu.  The driver could optionally whitelist such
> devices.
> 
> PPC works as well as it currently does.
> 
> I'm unsure about the arm64 situation.
> 
> 
> Did I get this right?
> 
> --Andy



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API

2015-07-28 Thread Benjamin Herrenschmidt
On Tue, 2015-07-28 at 16:33 -0700, Andy Lutomirski wrote:
> On Tue, Jul 28, 2015 at 4:21 PM, Benjamin Herrenschmidt
>  wrote:
> > On Tue, 2015-07-28 at 15:43 -0700, Andy Lutomirski wrote:
> >> Let me try to summarize a proposal:
> >>
> >> Add a feature flag that indicates IOMMU support.
> >>
> >> New kernels acknowledge that flag on any device that advertises it.
> >>
> >> New kernels always respect the IOMMU (except on PowerPC).
> >
> > Why ? I disagree, the flag should be honored when set in any
> > architecture. PowerPC is no different than any other platform in that
> > regard.
> 
> Perhaps I should have said instead "someone more familiar with PPC
> than I am should figure out what PPC should do".  For the non-PPC
> case, there is only one instance that I know of in which ignoring the
> IOMMU is beneficial, and that case is the experimental Q35 thing.

"ppc" is many fairly different platforms, some with iommu, some without,
some benefiting from bypass, some less etc... I think ARM will soon be
in a similar basket.

> If new kernels ignore the IOMMU for devices that don't set the flag
> and there are physical devices that already exist and don't set the
> flag, then those devices won't work reliably on most modern
> non-virtual platforms, PPC included.

Are there many virtio physical devices out there ? We are talking about
a virtio flag right ? Or have you been considering something else ?

> >>   New kernels
> >> optionally refuse to talk to devices that don't have that feature flag
> >> if the device appears to be behind an IOMMU.  (This presumably
> >> includes any device whatsoever on an x86 platform with an IOMMU,
> >> including Xen's fake IOMMU.)
> >>
> >> New QEMU always respects the IOMMU, if any, except on PPC.
> >
> > This is just a matter of what is the default of the flag, ie we
> > should have a machine flag that indicates what the default is for
> > new virtio devices, otherwise, it should be specified per device
> > as an attribute of the device instance.
> 
> On x86, I think that even super-peformance-critical virtio devices
> should always honor the iommu, but that the iommu in question should
> be a 1:1 iommu.  I *think* that x86 supports that.  IOW x86 would
> always set the feature flag.

Ok.

> > I would argue that we should default to "bypass IOMMU" on *all*
> > architecture due to the performance impact, and to essentially
> > default to the same behaviour as today. With things like DDW even
> > powerpc might be able to mostly alleviate the performance impact
> > so we might to change in the long term, but I tend to prefer
> > more incremental approaches.
> 
> As above, there's a difference between "bypass IOMMU" and "there is no
> IOMMU".  x86 and, I think, most other platforms are capable of the
> latter.  I'm not sure PPC is.

Depends on the platform. "pseries" isn't since it's already a
paravirtualized plaform, but there are other ppc platforms out there
which behave differently. That's why I think:

 - The kernel should just honor what qemu says, ie, whether the qemu
device honors or bypasses the iommu.

 - Qemu default behaviour should be set via a machine attribute which
can be overriden both globally (the machine one) or per-device. 

> I think that, in an ideal world, there would be no feature flag and
> all virtio devices would always respect the IOMMU.  Unfortunately we
> have existing practice in the form of PPC and Q35 iommu=on that
> conflict with that.

And possibly more as in this is how the qemu virtio devices are written
today, they do not use the proper DMA accessors, they always bypass,
whatever the platform is (so sparc would be in the same boat for
example).

> >>   New QEMU
> >> always advertises this feature flag.  If iommu=on, QEMU's virtio
> >> devices refuse to work unless the driver acknowledges the flag.
> >
> > This should be configurable.
> 
> Would any non-PPC user ever configure it differently?  I suppose if
> you want to support old kernels on new QEMU, you'd flip the switch.

Possibly, have we looked at what ia64, sparc, arm, ... do ? At least
sparc has iommus as well.

Let's try to not make it an architecture issue. As I said above, we have
a kernel that just reacts appropriately based on what qemu says it's
doing, and what qemu does is a per-machine flag to set the default.

> >> On PPC, new QEMU will not respect the IOMMU and will not set the flag.
> >> New kernels will not talk to devices that set the flag.  If someone
> >> wants to fix that

Re: [Xen-devel] [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API

2015-07-28 Thread Benjamin Herrenschmidt
On Tue, 2015-07-28 at 17:47 -0700, Andy Lutomirski wrote:

> Yes, virtio flag.  I dislike having a virtio flag at all, but so far
> no one has come up with any better ideas.  If there was a reliable,
> cross-platform mechanism for per-device PCI bus properties, I'd be all
> for using that instead.

There isn't that I know of, so I think it's the best approach we have.

 .../...

> >  - The kernel should just honor what qemu says, ie, whether the qemu
> > device honors or bypasses the iommu.
> 
> Except for vfio, which maybe just needs a special case: vfio checks if
> the device claims to be virtio and doesn't set the flag, in which case
> vfio just refuses to bind the device.

Right but passing virtio through isn't the highest priority on the
radar, but yes, indeed, it should identify them and reject them.

> >  - Qemu default behaviour should be set via a machine attribute which
> > can be overriden both globally (the machine one) or per-device.
> >
> >> I think that, in an ideal world, there would be no feature flag and
> >> all virtio devices would always respect the IOMMU.  Unfortunately we
> >> have existing practice in the form of PPC and Q35 iommu=on that
> >> conflict with that.
> >
> > And possibly more as in this is how the qemu virtio devices are written
> > today, they do not use the proper DMA accessors, they always bypass,
> > whatever the platform is (so sparc would be in the same boat for
> > example).
> 
> Except that AFAIK Q35 is the only QEMU platform that supports a
> nontrivial IOMMU in the first place.  Are there pseries hosts that
> have a working IOMMU?  Maybe I've just misunderstood.

You may well be correct, I remember that we actually created the iommu
infrastructure to a large extent in qemu for ppc/pseries, then it got
extended when q35 came in.

> >> >>   New QEMU
> >> >> always advertises this feature flag.  If iommu=on, QEMU's virtio
> >> >> devices refuse to work unless the driver acknowledges the flag.
> >> >
> >> > This should be configurable.
> >>
> >> Would any non-PPC user ever configure it differently?  I suppose if
> >> you want to support old kernels on new QEMU, you'd flip the switch.
> >
> > Possibly, have we looked at what ia64, sparc, arm, ... do ? At least
> > sparc has iommus as well.
> 
> I think (I hope!) that ia64 is irrelevant, and last I checked ARM
> didn't have a QEMU-emulated IOMMU.  Maybe things have changed.

Not yet...

 .../...
> >
> > On new machine types, we shouldn't change the behaviour of an existing
> > machine type, and we should keep the default to 0 on ppc/pseries because
> > of backward compatibility issue. But that should be the only place that
> > is "ppc specific", ie, a default value in a machine def structure.
> 
> Fair enough, except I still think we should change the default to be
> "respect IOMMU" on machine types that don't have an IOMMU in the first
> place. 

Ok, but do it in a separate patch because it *is* a behaviour change to
some extent.

>  That way Xen works with old machine types, and I don't think
> we lose anything.
> 
> >
> >> That's the setting that will work in all cases on new guest + new
> >> host, and it's the setting that's safest.  vfio will probably always
> >> malfunction if given a device that looks like it's behind an IOMMU but
> >> doesn't respect it.  For people who need the last bit of performance,
> >> they should use bus-level controls where available (they should be
> >> available everywhere except PPC and maybe arm64) and, ideally, someone
> >> would teach PPC how to exclude devices from the IOMMU cleanly if
> >> possible.  If that can't be done, then there can be an option to
> >> bypass the IOMMU the way it's currently done and no one except PPC
> >> would do it.
> >>
> >> PPC really is different from everything except x86 Q35 iommu=on, and
> >> the latter is experimental.  AFAIK in all other cases, the IOMMU is
> >> respected by virtio, but there is no non-1:1 IOMMU.
> >
> > What about sparc ? I though it was pretty similar to PPC in that
> > regard...
> 
> No clue, honestly.  I could be wrong about the set of existing QEMU
> machine types.

Ok.

Cheers,
Ben.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API

2015-07-29 Thread Benjamin Herrenschmidt
On Wed, 2015-07-29 at 10:17 +0200, Paolo Bonzini wrote:
> 
> On 29/07/2015 02:47, Andy Lutomirski wrote:
> > > > If new kernels ignore the IOMMU for devices that don't set the flag
> > > > and there are physical devices that already exist and don't set the
> > > > flag, then those devices won't work reliably on most modern
> > > > non-virtual platforms, PPC included.
> > >
> > > Are there many virtio physical devices out there ? We are talking about
> > > a virtio flag right ? Or have you been considering something else ?
> >
> > Yes, virtio flag.  I dislike having a virtio flag at all, but so far
> > no one has come up with any better ideas.  If there was a reliable,
> > cross-platform mechanism for per-device PCI bus properties, I'd be all
> > for using that instead.
> 
> No, a virtio flag doesn't make sense.

It wouldn't if we were creating virtio from scratch.

However we have to be realistic here, we are contending with existing
practices and implementation. The fact is qemu *does* bypass any iommu
and has been doing so for a long time, *and* the guest drivers are
written today *also* bypassing all DMA mapping mechanisms and just
passing everything accross.

So if it's a bug, it's a bug on both sides of the fence. We are no
longer in "bug fixing" territory here, it's a fundamental change of ABI.
The ABI might not be what was intended (but that's arguable, see below),
but it is that way.

Arguably it was even known and considered a *feature* by some (including
myself) at the time. It somewhat improved performances on archs where
otherwise every page would have to be mapped/unmapped in guest iommu. In
fact, it also makes vhost a lot easier.

So I disagree, it's de-facto a feature (even if unintended) of the
existing virtio implementations and changing that would be a major
interface change, and thus should be exposed as such.

> Blindly using system memory is a bug in QEMU; it has to be fixed to use
> the right address space, and then whatever the system provides to
> describe "the right address space" can be used (like the DMAR table on x86).

Except that it's not so easy.

For example, on PPC PAPR guests, there is no such thing as a "no IOMMU"
space, the concept doesn't exist. So we have at least three things to
deal with:

 - Existing guests, so we must preserve the existing behaviour for
backward compatibility.

 - vhost is made more complex because it now needs to be informed of the
guest iommu updates

 - New guests have the "new driver" that knows how to map and unmap
would have a performance loss unless some mechanism to create a "no
iommu" space exists which for us would need to be added. Either that or
we rely on DDW which is a way for a guest to create a permanent mapping
of its entire address space in an IOMMU but that incur a significant
waste of host kernel memory.

> On PPC I suppose you could use the host bridge's device tree?  If you
> need a hook, you can add a

No because we can mix and match virtio and other devices on the same
host bridge. Unless we put a property that only applies to virtio
children of the host bridge.

>   bool virtio_should_bypass_iommu(void)
>   {
>   /* lookup something in the device tree?!? */
>   }
>   EXPORT_SYMBOL_GPL(virtio_should_bypass_iommu);
> 
> in some pseries.c file, and in the driver:
> 
>   static bool virtio_bypass_iommu(void)
>   {
>   bool (*fn)(void);
>   
>   fn = symbol_get(virtio_should_bypass_iommu);
>   return fn && fn();
>   }
> 
> Awful, but that's what this thing is.

Ben.

> Paolo



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel