Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-03-24 Thread David Gibson
On Sat, Mar 23, 2019 at 05:01:35PM -0400, Michael S. Tsirkin wrote:
> On Thu, Mar 21, 2019 at 09:05:04PM -0300, Thiago Jung Bauermann wrote:
> > Michael S. Tsirkin  writes:
[snip]
> > >> > Is there any justification to doing that beyond someone putting
> > >> > out slow code in the past?
> > >>
> > >> The definition of the ACCESS_PLATFORM flag is generic and captures the
> > >> notion of memory access restrictions for the device. Unfortunately, on
> > >> powerpc pSeries guests it also implies that the IOMMU is turned on
> > >
> > > IIUC that's really because on pSeries IOMMU is *always* turned on.
> > > Platform has no way to say what you want it to say
> > > which is bypass the iommu for the specific device.
> > 
> > Yes, that's correct. pSeries guests running on KVM are in a gray area
> > where theoretically they use an IOMMU but in practice KVM ignores it.
> > It's unfortunate but it's the reality on the ground today. :-/

Um.. I'm not sure what you mean by this.  As far as I'm concerned
there is always a guest-visible (paravirtualized) IOMMU, and that will
be backed onto the host IOMMU when necessary.

[Actually there is an IOMMU bypass hack that's used by the guest
 firmware, but I don't think we want to expose that]

> Well it's not just the reality, virt setups need something that
> emulated IOMMUs don't provide. That is not uncommon, e.g.
> intel's VTD has a "cache mode" field which AFAIK is only used for virt.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/2] mm: move force_dma_unencrypted() to mem_encrypt.h

2020-02-20 Thread David Gibson
On Thu, Feb 20, 2020 at 05:31:35PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 20, 2020 at 05:23:20PM +0100, Christian Borntraeger wrote:
> > >From a users perspective it makes absolutely perfect sense to use the
> > bounce buffers when they are NEEDED. 
> > Forcing the user to specify iommu_platform just because you need bounce 
> > buffers
> > really feels wrong. And obviously we have a severe performance issue
> > because of the indirections.
> 
> The point is that the user should not have to specify iommu_platform.
> We need to make sure any new hypervisor (especially one that might require
> bounce buffering) always sets it,

So, I have draft qemu patches which enable iommu_platform by default.
But that's really because of other problems with !iommu_platform, not
anything to do with bounce buffering or secure VMs.

The thing is that the hypervisor *doesn't* require bounce buffering.
In the POWER (and maybe s390 as well) models for Secure VMs, it's the
*guest*'s choice to enter secure mode, so the hypervisor has no reason
to know whether the guest needs bounce buffering.  As far as the
hypervisor and qemu are concerned that's a guest internal detail, it
just expects to get addresses it can access whether those are GPAs
(iommu_platform=off) or IOVAs (iommu_platform=on).

> as was a rather bogus legacy hack

It was certainly a bad idea, but it was a bad idea that went into a
public spec and has been widely deployed for many years.  We can't
just pretend it didn't happen and move on.

Turning iommu_platform=on by default breaks old guests, some of which
we still care about.  We can't (automatically) do it only for guests
that need bounce buffering, because the hypervisor doesn't know that
ahead of time.

> that isn't extensibe for cases that for example require bounce buffering.

In fact bounce buffering isn't really the issue from the hypervisor
(or spec's) point of view.  It's the fact that not all of guest memory
is accessible to the hypervisor.  Bounce buffering is just one way the
guest might deal with that.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/2] virtio: let virtio use DMA API when guest RAM is protected

2020-02-20 Thread David Gibson
On Thu, Feb 20, 2020 at 05:13:09PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 20, 2020 at 05:06:06PM +0100, Halil Pasic wrote:
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 867c7ebd3f10..fafc8f924955 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -243,6 +243,9 @@ static bool vring_use_dma_api(struct virtio_device 
> > *vdev)
> > if (!virtio_has_iommu_quirk(vdev))
> > return true;
> >  
> > +   if (force_dma_unencrypted(&vdev->dev))
> > +   return true;
> 
> Hell no.  This is a detail of the platform DMA direct implementation.
> Drivers have no business looking at this flag, and virtio finally needs
> to be fixed to use the DMA API properly for everything but legacy devices.

So, this patch definitely isn't right as it stands, but I'm struggling
to understand what it is you're saying is the right way.

By "legacy devices" I assume you mean pre-virtio-1.0 devices, that
lack the F_VERSION_1 feature flag.  Is that right?  Because I don't
see how being a legacy device or not relates to use of the DMA API.

I *think* what you are suggesting here is that virtio devices that
have !F_IOMMU_PLATFORM should have their dma_ops set up so that the
DMA API treats IOVA==PA, which will satisfy what the device expects.
Then the virtio driver can use the DMA API the same way for both
F_IOMMU_PLATFORM and !F_IOMMU_PLATFORM devices.

But if that works for !F_IOMMU_PLATFORM_DEVICES+F_VERSION_1 devices,
then AFAICT it will work equally well for legacy devices.

Using the DMA API for *everything* in virtio, legacy or not, seems
like a reasonable approach to me.  But, AFAICT, that does require the
DMA layer to have some kind of explicit call to turn on this
behaviour, which the virtio driver would call during initializsation.
I don't think we can do it 100% within the DMA layer, because only the
driver can reasonably know when a device has this weird non-standard
DMA behaviour.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/2] virtio: let virtio use DMA API when guest RAM is protected

2020-02-20 Thread David Gibson
On Thu, Feb 20, 2020 at 05:17:48PM -0800, Ram Pai wrote:
> On Thu, Feb 20, 2020 at 03:55:14PM -0500, Michael S. Tsirkin wrote:
> > On Thu, Feb 20, 2020 at 05:06:06PM +0100, Halil Pasic wrote:
> > > Currently the advanced guest memory protection technologies (AMD SEV,
> > > powerpc secure guest technology and s390 Protected VMs) abuse the
> > > VIRTIO_F_IOMMU_PLATFORM flag to make virtio core use the DMA API, which
> > > is in turn necessary, to make IO work with guest memory protection.
> > > 
> > > But VIRTIO_F_IOMMU_PLATFORM a.k.a. VIRTIO_F_ACCESS_PLATFORM is really a
> > > different beast: with virtio devices whose implementation runs on an SMP
> > > CPU we are still fine with doing all the usual optimizations, it is just
> > > that we need to make sure that the memory protection mechanism does not
> > > get in the way. The VIRTIO_F_ACCESS_PLATFORM mandates more work on the
> > > side of the guest (and possibly he host side as well) than we actually
> > > need.
> > > 
> > > An additional benefit of teaching the guest to make the right decision
> > > (and use DMA API) on it's own is: removing the need, to mandate special
> > > VM configuration for guests that may run with protection. This is
> > > especially interesting for s390 as VIRTIO_F_IOMMU_PLATFORM pushes all
> > > the virtio control structures into the first 2G of guest memory:
> > > something we don't necessarily want to do per-default.
> > > 
> > > Signed-off-by: Halil Pasic 
> > > Tested-by: Ram Pai 
> > > Tested-by: Michael Mueller 
> > 
> > This might work for you but it's fragile, since without
> > VIRTIO_F_ACCESS_PLATFORM hypervisor assumes it gets
> > GPA's, not DMA addresses.
> > 
> > 
> > 
> > IOW this looks like another iteration of:
> > 
> > virtio: Support encrypted memory on powerpc secure guests
> > 
> > which I was under the impression was abandoned as unnecessary.
> 
> It has been abondoned on powerpc. We enabled VIRTIO_F_ACCESS_PLATFORM;
> by default, flag on powerpc.

Uh... we haven't yet, though we're working on it.

> We would like to enable secure guests on powerpc without this flag
> aswell enabled, but past experience has educated us that its not a easy
> path.  However if Halil makes some inroads in this path for s390, we
> will like to support him.
> 
> 
> RP
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/2] virtio: let virtio use DMA API when guest RAM is protected

2020-02-23 Thread David Gibson
On Fri, Feb 21, 2020 at 09:39:38AM -0600, Tom Lendacky wrote:
> On 2/21/20 7:12 AM, Halil Pasic wrote:
> > On Thu, 20 Feb 2020 15:55:14 -0500
> > "Michael S. Tsirkin"  wrote:
> > 
> >> On Thu, Feb 20, 2020 at 05:06:06PM +0100, Halil Pasic wrote:
> >>> Currently the advanced guest memory protection technologies (AMD SEV,
> >>> powerpc secure guest technology and s390 Protected VMs) abuse the
> >>> VIRTIO_F_IOMMU_PLATFORM flag to make virtio core use the DMA API, which
> >>> is in turn necessary, to make IO work with guest memory protection.
> >>>
> >>> But VIRTIO_F_IOMMU_PLATFORM a.k.a. VIRTIO_F_ACCESS_PLATFORM is really a
> >>> different beast: with virtio devices whose implementation runs on an SMP
> >>> CPU we are still fine with doing all the usual optimizations, it is just
> >>> that we need to make sure that the memory protection mechanism does not
> >>> get in the way. The VIRTIO_F_ACCESS_PLATFORM mandates more work on the
> >>> side of the guest (and possibly he host side as well) than we actually
> >>> need.
> >>>
> >>> An additional benefit of teaching the guest to make the right decision
> >>> (and use DMA API) on it's own is: removing the need, to mandate special
> >>> VM configuration for guests that may run with protection. This is
> >>> especially interesting for s390 as VIRTIO_F_IOMMU_PLATFORM pushes all
> >>> the virtio control structures into the first 2G of guest memory:
> >>> something we don't necessarily want to do per-default.
> >>>
> >>> Signed-off-by: Halil Pasic 
> >>> Tested-by: Ram Pai 
> >>> Tested-by: Michael Mueller 
> >>
> >> This might work for you but it's fragile, since without
> >> VIRTIO_F_ACCESS_PLATFORM hypervisor assumes it gets
> >> GPA's, not DMA addresses.
> >>
> > 
> > Thanks for your constructive approach. I do want the hypervisor to
> > assume it gets GPA's. My train of thought was that the guys that need
> > to use IOVA's that are not GPA's when force_dma_unencrypted() will have
> > to to specify VIRTIO_F_ACCESS_PLATFORM (at the device) anyway, because
> > otherwise it won't work. But I see your point: in case of a
> > mis-configuration and provided the DMA API returns IOVA's one could end
> > up trying to touch wrong memory locations. But this should be similar to
> > what would happen if DMA ops are not used, and memory is not made 
> > accessible.
> > 
> >>
> >>
> >> IOW this looks like another iteration of:
> >>
> >>virtio: Support encrypted memory on powerpc secure guests
> >>
> >> which I was under the impression was abandoned as unnecessary.
> > 
> > Unnecessary for powerpc because they do normal PCI. In the context of
> > CCW there are only guest physical addresses (CCW I/O has no concept of
> > IOMMU or IOVAs).
> > 
> >>
> >>
> >> To summarize, the necessary conditions for a hack along these lines
> >> (using DMA API without VIRTIO_F_ACCESS_PLATFORM) are that we detect that:
> >>
> >>   - secure guest mode is enabled - so we know that since we don't share
> >> most memory regular virtio code won't
> >> work, even though the buggy hypervisor didn't set 
> >> VIRTIO_F_ACCESS_PLATFORM
> > 
> > force_dma_unencrypted(&vdev->dev) is IMHO exactly about this.
> > 
> >>   - DMA API is giving us addresses that are actually also physical
> >> addresses
> > 
> > In case of s390 this is given. I talked with the power people before
> > posting this, and they ensured me they can are willing to deal with
> > this. I was hoping to talk abut this with the AMD SEV people here (hence
> > the cc).
> 
> Yes, physical addresses are fine for SEV - the key is that the DMA API is
> used so that an address for unencrypted, or shared, memory is returned.
> E.g. for a dma_alloc_coherent() call this is an allocation that has had
> set_memory_decrypted() called or for a dma_map_page() call this is an
> address from SWIOTLB, which was mapped shared during boot, where the data
> will be bounce-buffered.
> 
> We don't currently support an emulated IOMMU in our SEV guest because that
> would require a lot of support in the driver to make IOMMU data available
> to the hypervisor (I/O page tables, etc.). We would need hardware support
> to really make this work easily in the guest.

A tangent here: not

Re: [PATCH 1/2] mm: move force_dma_unencrypted() to mem_encrypt.h

2020-02-23 Thread David Gibson
On Fri, Feb 21, 2020 at 07:07:02PM +0100, Halil Pasic wrote:
> On Fri, 21 Feb 2020 10:48:15 -0500
> "Michael S. Tsirkin"  wrote:
> 
> > On Fri, Feb 21, 2020 at 02:06:39PM +0100, Halil Pasic wrote:
> > > On Fri, 21 Feb 2020 14:27:27 +1100
> > > David Gibson  wrote:
> > > 
> > > > On Thu, Feb 20, 2020 at 05:31:35PM +0100, Christoph Hellwig wrote:
> > > > > On Thu, Feb 20, 2020 at 05:23:20PM +0100, Christian Borntraeger wrote:
> > > > > > >From a users perspective it makes absolutely perfect sense to use 
> > > > > > >the
> > > > > > bounce buffers when they are NEEDED. 
> > > > > > Forcing the user to specify iommu_platform just because you need 
> > > > > > bounce buffers
> > > > > > really feels wrong. And obviously we have a severe performance issue
> > > > > > because of the indirections.
> > > > > 
> > > > > The point is that the user should not have to specify iommu_platform.
> > > > > We need to make sure any new hypervisor (especially one that might 
> > > > > require
> > > > > bounce buffering) always sets it,
> > > > 
> > > > So, I have draft qemu patches which enable iommu_platform by default.
> > > > But that's really because of other problems with !iommu_platform, not
> > > > anything to do with bounce buffering or secure VMs.
> > > > 
> > > > The thing is that the hypervisor *doesn't* require bounce buffering.
> > > > In the POWER (and maybe s390 as well) models for Secure VMs, it's the
> > > > *guest*'s choice to enter secure mode, so the hypervisor has no reason
> > > > to know whether the guest needs bounce buffering.  As far as the
> > > > hypervisor and qemu are concerned that's a guest internal detail, it
> > > > just expects to get addresses it can access whether those are GPAs
> > > > (iommu_platform=off) or IOVAs (iommu_platform=on).
> > > 
> > > I very much agree!
> > > 
> > > > 
> > > > > as was a rather bogus legacy hack
> > > > 
> > > > It was certainly a bad idea, but it was a bad idea that went into a
> > > > public spec and has been widely deployed for many years.  We can't
> > > > just pretend it didn't happen and move on.
> > > > 
> > > > Turning iommu_platform=on by default breaks old guests, some of which
> > > > we still care about.  We can't (automatically) do it only for guests
> > > > that need bounce buffering, because the hypervisor doesn't know that
> > > > ahead of time.
> > > 
> > > Turning iommu_platform=on for virtio-ccw makes no sense whatsover,
> > > because for CCW I/O there is no such thing as IOMMU and the addresses
> > > are always physical addresses.
> > 
> > Fix the name then. The spec calls is ACCESS_PLATFORM now, which
> > makes much more sense.
> 
> I don't quite get it. Sorry. Maybe I will revisit this later.

Halil, I think I can clarify this.

The "iommu_platform" flag doesn't necessarily have anything to do with
an iommu, although it often will.  Basically it means "access guest
memory via the bus's normal DMA mechanism" rather than "access guest
memory using GPA, because you're the hypervisor and you can do that".

For the case of ccw, both mechanisms end up being the same thing,
since CCW's normal DMA *is* untranslated GPA access.

For this reason, the flag in the spec was renamed to ACCESS_PLATFORM,
but the flag in qemu still has the old name.

AIUI, Michael is saying you could trivially change the name in qemu
(obviously you'd need to alias the old name to the new one for
compatibility).


Actually, the fact that ccw has no translation makes things easier for
you: you don't really have any impediment to turning ACCESS_PLATFORM
on by default, since it doesn't make any real change to how things
work.

The remaining difficulty is that the virtio driver - since it can sit
on multiple buses - won't know this, and will reject the
ACCESS_PLATFORM flag, even though it could just do what it normally
does on ccw and it would work.

For that case, we could consider a hack in qemu where for virtio-ccw
devices *only* we allow the guest to nack the ACCESS_PLATFORM flag and
carry on anyway.  Normally we insist that the guest accept the
ACCESS_PLATFORM flag if offered, because on most platforms they
*don't* amount to the same thing.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 0/2] virtio: decouple protected guest RAM form VIRTIO_F_IOMMU_PLATFORM

2020-02-23 Thread David Gibson
On Fri, Feb 21, 2020 at 03:56:02PM +0100, Halil Pasic wrote:
> On Fri, 21 Feb 2020 14:22:26 +0800
> Jason Wang  wrote:
> 
> > 
> > On 2020/2/21 上午12:06, Halil Pasic wrote:
> > > Currently if one intends to run a memory protection enabled VM with
> > > virtio devices and linux as the guest OS, one needs to specify the
> > > VIRTIO_F_IOMMU_PLATFORM flag for each virtio device to make the guest
> > > linux use the DMA API, which in turn handles the memory
> > > encryption/protection stuff if the guest decides to turn itself into
> > > a protected one. This however makes no sense due to multiple reasons:
> > > * The device is not changed by the fact that the guest RAM is
> > > protected. The so called IOMMU bypass quirk is not affected.
> > > * This usage is not congruent with  standardised semantics of
> > > VIRTIO_F_IOMMU_PLATFORM. Guest memory protected is an orthogonal reason
> > > for using DMA API in virtio (orthogonal with respect to what is
> > > expressed by VIRTIO_F_IOMMU_PLATFORM).
> > >
> > > This series aims to decouple 'have to use DMA API because my (guest) RAM
> > > is protected' and 'have to use DMA API because the device told me
> > > VIRTIO_F_IOMMU_PLATFORM'.
> > >
> > > Please find more detailed explanations about the conceptual aspects in
> > > the individual patches. There is however also a very practical problem
> > > that is addressed by this series.
> > >
> > > For vhost-net the feature VIRTIO_F_IOMMU_PLATFORM has the following side
> > > effect The vhost code assumes it the addresses on the virtio descriptor
> > > ring are not guest physical addresses but iova's, and insists on doing a
> > > translation of these regardless of what transport is used (e.g. whether
> > > we emulate a PCI or a CCW device). (For details see commit 6b1e6cc7855b
> > > "vhost: new device IOTLB API".) On s390 this results in severe
> > > performance degradation (c.a. factor 10).
> > 
> > 
> > Do you see a consistent degradation on the performance, or it only 
> > happen when for during the beginning of the test?
> > 
> 
> AFAIK the degradation is consistent.
> 
> > 
> > > BTW with ccw I/O there is
> > > (architecturally) no IOMMU, so the whole address translation makes no
> > > sense in the context of virtio-ccw.
> > 
> > 
> > I suspect we can do optimization in qemu side.
> > 
> > E.g send memtable entry via IOTLB API when vIOMMU is not enabled.
> > 
> > If this makes sense, I can draft patch to see if there's any difference.
> 
> Frankly I would prefer to avoid IOVAs on the descriptor ring (and the
> then necessary translation) for virtio-ccw altogether. But Michael
> voiced his opinion that we should mandate F_IOMMU_PLATFORM for devices
> that could be used with guests running in protected mode. I don't share
> his opinion, but that's an ongoing discussion.

I'm a bit confused by this.  For the ccw specific case,
F_ACCESS_PLATFORM shouldn't have any impact: for you, IOVA == GPA so
everything is easy.

> Should we end up having to do translation from IOVA in vhost, we are
> very interested in that translation being fast and efficient.
> 
> In that sense we would be very happy to test any optimization that aim
> into that direction.
> 
> Thank you very much for your input!
> 
> Regards,
> Halil
> 
> > 
> > Thanks
> > 
> > 
> > >
> > > Halil Pasic (2):
> > >mm: move force_dma_unencrypted() to mem_encrypt.h
> > >virtio: let virtio use DMA API when guest RAM is protected
> > >
> > >   drivers/virtio/virtio_ring.c |  3 +++
> > >   include/linux/dma-direct.h   |  9 -
> > >   include/linux/mem_encrypt.h  | 10 ++
> > >   3 files changed, 13 insertions(+), 9 deletions(-)
> > >
> > >
> > > base-commit: ca7e1fd1026c5af6a533b4b5447e1d2f153e28f2
> > 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 0/2] virtio: decouple protected guest RAM form VIRTIO_F_IOMMU_PLATFORM

2020-02-23 Thread David Gibson
On Fri, Feb 21, 2020 at 05:41:51PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 20, 2020 at 04:33:35PM -0500, Michael S. Tsirkin wrote:
> > So it sounds like a host issue: the emulation of s390 unnecessarily 
> > complicated.
> > Working around it by the guest looks wrong ...
> 
> Yes.  If your host (and I don't care if you split hypervisor,
> ultravisor and megavisor out in your implementation) wants to
> support a VM architecture where the host can't access all guest
> memory you need to ensure the DMA API is used.  Extra points for
> simply always setting the flag and thus future proofing the scheme.

Moving towards F_ACCESS_PLATFORM everywhere is a good idea (for other
reasons), but that doesn't make the problem as it exists right now go
away.

But, "you need to ensure the DMA API is used" makes no sense from the
host point of view.  The existence of the DMA API is an entirely guest
side, and Linux specific detail, the host can't make decisions based
on that.

For POWER - possibly s390 as well - the hypervisor has no way of
knowing at machine construction time whether it will be an old kernel
(or non Linux OS) which can't support F_ACCESS_PLATFORM, or a guest
which will enter secure mode and therefore requires F_ACCESS_PLATFORM
(according to you).  That's the fundamental problem here.

The normal virtio model of features that the guest can optionally
accept would work nicely here - except that that wouldn't work for the
case of hardware virtio devices, where the access limitations come
from "host" (platform) side and therefore can't be disabled by that
host.

We really do have two cases here: 1) access restrictions originating
with the host/platform (e.g. hardware virtio) and 2) access
restrictions originating with the guest (e.g. secure VMs).  What we
need to do to deal with them is basically the same at the driver
level, but it has subtle and important differences at the platform
level.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/2] virtio: let virtio use DMA API when guest RAM is protected

2020-02-23 Thread David Gibson
On Fri, Feb 21, 2020 at 05:36:45PM +0100, Christoph Hellwig wrote:
> On Fri, Feb 21, 2020 at 01:59:15PM +1100, David Gibson wrote:
> > > Hell no.  This is a detail of the platform DMA direct implementation.
> > > Drivers have no business looking at this flag, and virtio finally needs
> > > to be fixed to use the DMA API properly for everything but legacy devices.
> > 
> > So, this patch definitely isn't right as it stands, but I'm struggling
> > to understand what it is you're saying is the right way.
> > 
> > By "legacy devices" I assume you mean pre-virtio-1.0 devices, that
> > lack the F_VERSION_1 feature flag.  Is that right?  Because I don't
> > see how being a legacy device or not relates to use of the DMA API.
> 
> No.   "legacy" is anything that does not set F_ACCESS_PLATFORM.

Hm, I see.

The trouble is I think we can only reasonably call things "legacy"
when essentially all currently in use OSes have support for the new,
better way of doing things.  That is, alas, not really the case for
F_ACCESS_PLATFORM.

> > I *think* what you are suggesting here is that virtio devices that
> > have !F_IOMMU_PLATFORM should have their dma_ops set up so that the
> > DMA API treats IOVA==PA, which will satisfy what the device expects.
> > Then the virtio driver can use the DMA API the same way for both
> > F_IOMMU_PLATFORM and !F_IOMMU_PLATFORM devices.
> 
> No.  Those should just keep using the existing legacy non-dma ops
> case and be done with it.  No changes to that and most certainly
> no new features.
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/2] mm: move force_dma_unencrypted() to mem_encrypt.h

2020-02-27 Thread David Gibson
On Tue, Feb 25, 2020 at 07:08:02PM +0100, Cornelia Huck wrote:
> On Mon, 24 Feb 2020 19:49:53 +0100
> Halil Pasic  wrote:
> 
> > On Mon, 24 Feb 2020 14:33:14 +1100
> > David Gibson  wrote:
> > 
> > > On Fri, Feb 21, 2020 at 07:07:02PM +0100, Halil Pasic wrote:  
> > > > On Fri, 21 Feb 2020 10:48:15 -0500
> > > > "Michael S. Tsirkin"  wrote:
> > > >   
> > > > > On Fri, Feb 21, 2020 at 02:06:39PM +0100, Halil Pasic wrote:  
> > > > > > On Fri, 21 Feb 2020 14:27:27 +1100
> > > > > > David Gibson  wrote:
> > > > > >   
> > > > > > > On Thu, Feb 20, 2020 at 05:31:35PM +0100, Christoph Hellwig 
> > > > > > > wrote:  
> > > > > > > > On Thu, Feb 20, 2020 at 05:23:20PM +0100, Christian Borntraeger 
> > > > > > > > wrote:  
> > > > > > > > > >From a users perspective it makes absolutely perfect sense 
> > > > > > > > > >to use the  
> > > > > > > > > bounce buffers when they are NEEDED. 
> > > > > > > > > Forcing the user to specify iommu_platform just because you 
> > > > > > > > > need bounce buffers
> > > > > > > > > really feels wrong. And obviously we have a severe 
> > > > > > > > > performance issue
> > > > > > > > > because of the indirections.  
> > > > > > > > 
> > > > > > > > The point is that the user should not have to specify 
> > > > > > > > iommu_platform.
> > > > > > > > We need to make sure any new hypervisor (especially one that 
> > > > > > > > might require
> > > > > > > > bounce buffering) always sets it,  
> > > > > > > 
> > > > > > > So, I have draft qemu patches which enable iommu_platform by 
> > > > > > > default.
> > > > > > > But that's really because of other problems with !iommu_platform, 
> > > > > > > not
> > > > > > > anything to do with bounce buffering or secure VMs.
> > > > > > > 
> > > > > > > The thing is that the hypervisor *doesn't* require bounce 
> > > > > > > buffering.
> > > > > > > In the POWER (and maybe s390 as well) models for Secure VMs, it's 
> > > > > > > the
> > > > > > > *guest*'s choice to enter secure mode, so the hypervisor has no 
> > > > > > > reason
> > > > > > > to know whether the guest needs bounce buffering.  As far as the
> > > > > > > hypervisor and qemu are concerned that's a guest internal detail, 
> > > > > > > it
> > > > > > > just expects to get addresses it can access whether those are GPAs
> > > > > > > (iommu_platform=off) or IOVAs (iommu_platform=on).  
> > > > > > 
> > > > > > I very much agree!
> > > > > >   
> > > > > > >   
> > > > > > > > as was a rather bogus legacy hack  
> > > > > > > 
> > > > > > > It was certainly a bad idea, but it was a bad idea that went into 
> > > > > > > a
> > > > > > > public spec and has been widely deployed for many years.  We can't
> > > > > > > just pretend it didn't happen and move on.
> > > > > > > 
> > > > > > > Turning iommu_platform=on by default breaks old guests, some of 
> > > > > > > which
> > > > > > > we still care about.  We can't (automatically) do it only for 
> > > > > > > guests
> > > > > > > that need bounce buffering, because the hypervisor doesn't know 
> > > > > > > that
> > > > > > > ahead of time.  
> 
> We could default to iommu_platform=on on s390 when the host has active
> support for protected virtualization... but that's just another kind of
> horrible, so let's just pretend I didn't suggest it.

Yeah, that would break migration between hosts with the feature and
hosts without - for everything, not just protected guests.  In general
any kind of guest visible configuration change based on host
properties is incompatible with the qemu/KVM migration model.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2012-01-24 Thread David Gibson
On Tue, Dec 20, 2011 at 09:30:37PM -0700, Alex Williamson wrote:
> On Wed, 2011-12-21 at 14:32 +1100, David Gibson wrote:
> > On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
> > > On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:
> > > > Well.. that's not where it is in Alex's code either.  The iommu layer
> > > > (to the extent that there is such a "layer") supplies the group info,
> > > > but the group management is in vfio, not the iommu layer.  With mine
> > > > it is in the driver core because the struct device seemed the logical
> > > > place for the group id.
> > > 
> > > Okay, seems we have different ideas of what the 'grouping code' is. I
> > > talked about the group enumeration code only. But group handling code is
> > > certainly important to some degree too. But before we argue about the
> > > right place of the code we should agree on the semantics such code
> > > should provide.
> > > 
> > > For me it is fine when the code is in VFIO for now, since VFIO is the
> > > only user at the moment. When more users pop up we can easily move it
> > > out to somewhere else. But the semantics influence the interface to
> > > user-space too, so it is more important now. It splits up into a number
> > > of sub problems:
> > > 
> > >   1) How userspace detects the device<->group relationship?
> > >   2) Do we want group-binding/unbinding to device drivers?
> > >   3) Group attach/detach to iommu-domains?
> > >   4) What to do with hot-added devices?
> > > 
> > > For 1) I think the current solution with the iommu_group file is fine.
> > > It is somewhat expensive for user-space to figure out the per-group
> > > device-sets, but that is a one-time effort so it doesn't really matter.
> > > Probably we can rename 'iommu_group' to 'isolation_group' or
> > > something.
> > 
> > Hrm.  Alex's group code also provides no in-kernel way to enumerate a
> > group, short of walking every device in the system.  And it provides
> > no way to attach information to a group.  It just seems foolish to me
> > to have this concept without some kind of in-kernel handle on it, and
> 
> Who else needs to enumerate groups right now?  Who else needs to attach
> data to a group.  We seem to be caught in this loop of arguing that we
> need driver core based group management, but we don't have any plans to
> make use of it, so it just bloats the kernel for most of the users that
> don't care about it.

So, Ben and I discussed this with David Woodhouse during
linux.conf.au.  He does want to update the core iommu_ops and dma_ops
handling to be group aware, and is willing to do the work for that.
So I will be doing another spin of my isolation code as a basis for
that work of his.  So there's our other user.

> > if you're managing the in-kernel representation you might as well
> > expose it to userspace there as well.
> 
> Unfortunately this is just the start a peeling back layers of the onion.
> We manage groups in the driver core, so the driver core should expose
> them to userspace.  The driver core exposes them to userspace, so now it
> needs to manage permissions for userspace.

That doesn't necessarily follow.  The current draft has the core group
code export character devices on which permissions are managed, but
I'm also considering options where it only exports sysfs and something
else does the character device and permissions.

>  Then we add permissions and
> now we need to provide group access, then we need a channel to an actual
> userspace device driver, zing! we add a whole API there,

And the other option I was looking add had the core providing the char
device but having it's fops get passed straight through to the binder.

> then we need
> group driver binding, then we need group device driver binding, blam!
> another API, then we need...  I don't see a clear end marker that
> doesn't continue to bloat the core and add functionality that nobody
> else needs and we don't even have plans of integrating more pervasively.
> This appears to end with 80 to 90% of the vfio core code moving into the
> driver core.

I don't agree.  Working out the right boundary isn't totally obvious,
certainly, but that doesn't mean a reasonable boundary can't be found.

> 
> > > Regarding 2), I think providing user-space a way to unbind groups of
> > > devices from their drivers is a horrible idea.
> > 
> > Well, I'm not wed to unbinding all the drivers at once. 

Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2012-01-30 Thread David Gibson
On Wed, Jan 25, 2012 at 04:44:53PM -0700, Alex Williamson wrote:
> On Wed, 2012-01-25 at 14:13 +1100, David Gibson wrote:
> > On Tue, Dec 20, 2011 at 09:30:37PM -0700, Alex Williamson wrote:
> > > On Wed, 2011-12-21 at 14:32 +1100, David Gibson wrote:
> > > > On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
> > > > > On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:
> > > > > > Well.. that's not where it is in Alex's code either.  The iommu 
> > > > > > layer
> > > > > > (to the extent that there is such a "layer") supplies the group 
> > > > > > info,
> > > > > > but the group management is in vfio, not the iommu layer.  With mine
> > > > > > it is in the driver core because the struct device seemed the 
> > > > > > logical
> > > > > > place for the group id.
> > > > > 
> > > > > Okay, seems we have different ideas of what the 'grouping code' is. I
> > > > > talked about the group enumeration code only. But group handling code 
> > > > > is
> > > > > certainly important to some degree too. But before we argue about the
> > > > > right place of the code we should agree on the semantics such code
> > > > > should provide.
> > > > > 
> > > > > For me it is fine when the code is in VFIO for now, since VFIO is the
> > > > > only user at the moment. When more users pop up we can easily move it
> > > > > out to somewhere else. But the semantics influence the interface to
> > > > > user-space too, so it is more important now. It splits up into a 
> > > > > number
> > > > > of sub problems:
> > > > > 
> > > > >   1) How userspace detects the device<->group relationship?
> > > > >   2) Do we want group-binding/unbinding to device drivers?
> > > > >   3) Group attach/detach to iommu-domains?
> > > > >   4) What to do with hot-added devices?
> > > > > 
> > > > > For 1) I think the current solution with the iommu_group file is fine.
> > > > > It is somewhat expensive for user-space to figure out the per-group
> > > > > device-sets, but that is a one-time effort so it doesn't really 
> > > > > matter.
> > > > > Probably we can rename 'iommu_group' to 'isolation_group' or
> > > > > something.
> > > > 
> > > > Hrm.  Alex's group code also provides no in-kernel way to enumerate a
> > > > group, short of walking every device in the system.  And it provides
> > > > no way to attach information to a group.  It just seems foolish to me
> > > > to have this concept without some kind of in-kernel handle on it, and
> > > 
> > > Who else needs to enumerate groups right now?  Who else needs to attach
> > > data to a group.  We seem to be caught in this loop of arguing that we
> > > need driver core based group management, but we don't have any plans to
> > > make use of it, so it just bloats the kernel for most of the users that
> > > don't care about it.
> > 
> > So, Ben and I discussed this with David Woodhouse during
> > linux.conf.au.  He does want to update the core iommu_ops and dma_ops
> > handling to be group aware, and is willing to do the work for that.
> > So I will be doing another spin of my isolation code as a basis for
> > that work of his.  So there's our other user.
> 
> Hmm...
> 
> > > > if you're managing the in-kernel representation you might as well
> > > > expose it to userspace there as well.
> > > 
> > > Unfortunately this is just the start a peeling back layers of the onion.
> > > We manage groups in the driver core, so the driver core should expose
> > > them to userspace.  The driver core exposes them to userspace, so now it
> > > needs to manage permissions for userspace.
> > 
> > That doesn't necessarily follow.  The current draft has the core group
> > code export character devices on which permissions are managed, but
> > I'm also considering options where it only exports sysfs and something
> > else does the character device and permissions.
> 
> Let's start to define it then.  A big problem I have with the isolation
> infrastructure you're proposing is that it doesn't have well defined
> bounds or interfaces.  It p

RFC: Device isolation groups

2012-01-31 Thread David Gibson
This patch series introduces a new infrastructure to the driver core
for representing "device isolation groups".  That is, groups of
devices which can be "isolated" in such a way that the rest of the
system can be protected from them, even in the presence of userspace
or a guest OS directly driving the devices.

Isolation will typically be due to an IOMMU which can safely remap DMA
and interrupts coming from these devices.  We need to represent whole
groups, rather than individual devices, because there are a number of
cases where the group can be isolated as a whole, but devices within
it cannot be safely isolated from each other - this usually occurs
because the IOMMU cannot reliably distinguish which device in the
group initiated a transaction.  In other words, isolation groups
represent the minimum safe granularity for passthrough to guests or
userspace.

This series provides the core infraustrcture for tracking isolation
groups, and example implementations initializing the groups
appropriately for two PCI bridges (which include IOMMUs) found on IBM
POWER systems.

Actually using the group information is not included here, but David
Woodhouse has expressed an interest in using a structure like this to
represent operations in iommu_ops more correctly.

Some tracking of groups is a prerequisite for safe passthrough of
devices to guests or userspace, such as done by VFIO.  Current VFIO
patches use the iommu_ops->device_group mechanism for this.  However,
that mechanism is awkward, because without an in-kernel concrete
representation of groups, enumerating a group requires traversing
every device on a given bus type.  It also fails to cover some very
plausible IOMMU topologies, because its groups cannot span devices on
multiple bus types.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/3] device_isolation: Support isolation on POWER p5ioc2 bridges

2012-01-31 Thread David Gibson
This patch adds code to the code for the powernv platform to create
and populate isolation groups on hardware using the p5ioc2 PCI host
bridge used on some IBM POWER systems.

Signed-off-by: Alexey Kardashevskiy 
Signed-off-by: David Gibson 
---
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |   14 +-
 arch/powerpc/platforms/powernv/pci.h|3 +++
 2 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c 
b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index 2649677..e5bb3a6 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -88,10 +89,21 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { 
}
 static void __devinit pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
   struct pci_dev *pdev)
 {
-   if (phb->p5ioc2.iommu_table.it_map == NULL)
+   if (phb->p5ioc2.iommu_table.it_map == NULL) {
iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
+#ifdef CONFIG_DEVICE_ISOLATION
+   phb->p5ioc2.di_group = kzalloc(sizeof(*(phb->p5ioc2.di_group)),
+  GFP_KERNEL);
+   BUG_ON(!phb->p5ioc2.di_group ||
+  (device_isolation_group_init(phb->p5ioc2.di_group,
+   "p5ioc2:%llx", 
phb->opal_id) < 0));
+#endif
+   }
 
set_iommu_table_base(&pdev->dev, &phb->p5ioc2.iommu_table);
+#ifdef CONFIG_DEVICE_ISOLATION
+   device_isolation_dev_add(phb->p5ioc2.di_group, &pdev->dev);
+#endif
 }
 
 static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np,
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index 8bc4796..64ede1e 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -87,6 +87,9 @@ struct pnv_phb {
union {
struct {
struct iommu_table iommu_table;
+#ifdef CONFIG_DEVICE_ISOLATION
+   struct device_isolation_group *di_group;
+#endif
} p5ioc2;
 
struct {
-- 
1.7.8.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 3/3] device_isolation: Support isolation on POWER p7ioc (IODA) bridges

2012-01-31 Thread David Gibson
This patch adds code to the code for the powernv platform to create
and populate isolation groups on hardware using the p7ioc (aka IODA) PCI host
bridge used on some IBM POWER systems.

Signed-off-by: Alexey Kardashevskiy 
Signed-off-by: David Gibson 
---
 arch/powerpc/platforms/powernv/pci-ioda.c |   18 --
 arch/powerpc/platforms/powernv/pci.h  |6 ++
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5e155df..4648475 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -877,6 +878,9 @@ static void __devinit pnv_ioda_setup_bus_dma(struct 
pnv_ioda_pe *pe,
set_iommu_table_base(&dev->dev, &pe->tce32_table);
if (dev->subordinate)
pnv_ioda_setup_bus_dma(pe, dev->subordinate);
+#ifdef CONFIG_DEVICE_ISOLATION
+   device_isolation_dev_add(&pe->di_group, &dev->dev);
+#endif
}
 }
 
@@ -957,11 +961,21 @@ static void __devinit pnv_pci_ioda_setup_dma_pe(struct 
pnv_phb *phb,
}
iommu_init_table(tbl, phb->hose->node);
 
-   if (pe->pdev)
+#ifdef CONFIG_DEVICE_ISOLATION
+   BUG_ON(device_isolation_group_init(&pe->di_group, "ioda:rid%x-pe%x",
+  pe->rid, pe->pe_number) < 0);
+#endif
+
+   if (pe->pdev) {
set_iommu_table_base(&pe->pdev->dev, tbl);
-   else
+#ifdef CONFIG_DEVICE_ISOLATION
+   device_isolation_dev_add(&pe->di_group, &pe->pdev->dev);
+#endif
+   } else
pnv_ioda_setup_bus_dma(pe, pe->pbus);
 
+
+
return;
  fail:
/* XXX Failure: Try to fallback to 64-bit only ? */
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index 64ede1e..3e282b7 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -1,6 +1,8 @@
 #ifndef __POWERNV_PCI_H
 #define __POWERNV_PCI_H
 
+#include 
+
 struct pci_dn;
 
 enum pnv_phb_type {
@@ -60,6 +62,10 @@ struct pnv_ioda_pe {
 
/* Link in list of PE#s */
struct list_headlink;
+
+#ifdef CONFIG_DEVICE_ISOLATION
+   struct device_isolation_group di_group;
+#endif
 };
 
 struct pnv_phb {
-- 
1.7.8.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/3] Device isolation group infrastructure (v3)

2012-01-31 Thread David Gibson
In order to safely drive a device with a userspace driver, or to pass
it through to a guest system, we must first make sure that the device
is isolated in such a way that it cannot interfere with other devices
on the system.  This isolation is only available on some systems and
will generally require an iommu, and might require other support in
bridges or other system hardware.

Often, it's not possible to isolate every device from every other
device in the system.  For example, certain PCI/PCIe bridge
configurations mean that an iommu cannot reliably distinguish which
device behind the bridge initiated a DMA transaction.  Similarly some
buggy PCI multifunction devices initiate all DMAs as function 0, so
the functions cannot be isolated from each other, even if the IOMMU
normally allows this.

Therefore, the user, and code to allow userspace drivers or guest
passthrough, needs a way to determine which devices can be isolated
from which others.  This patch adds infrastructure to handle this by
introducing the concept of a "device isolation group" - a group of
devices which can, as a unit, be safely isolated from the rest of the
system and therefore can be, as a unit, safely assigned to an
unprivileged used or guest.  That is, the groups represent the minimum
granularity with which devices may be assigned to untrusted
components.

This code manages groups, but does not create them or allow use of
grouped devices by a guest.  Creating groups would be done by iommu or
bridge drivers, using the interface this patch provides.  It's
expected that the groups will be used in future by the in-kernel iommu
interface, and would also be used by VFIO or other subsystems to allow
safe passthrough of devices to userspace or guests.

Signed-off-by: Alexey Kardashevskiy 
Signed-off-by: David Gibson 
---
 drivers/base/Kconfig |3 +
 drivers/base/Makefile|1 +
 drivers/base/base.h  |3 +
 drivers/base/core.c  |6 ++
 drivers/base/device_isolation.c  |  184 ++
 drivers/base/init.c  |2 +
 include/linux/device.h   |5 +
 include/linux/device_isolation.h |  100 +
 8 files changed, 304 insertions(+), 0 deletions(-)
 create mode 100644 drivers/base/device_isolation.c
 create mode 100644 include/linux/device_isolation.h

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 7be9f79..a52f2db 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -189,4 +189,7 @@ config DMA_SHARED_BUFFER
  APIs extension; the file's descriptor can then be passed on to other
  driver.
 
+config DEVICE_ISOLATION
+   bool "Enable isolating devices for safe pass-through to guests or user 
space."
+
 endmenu
diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index 2c8272d..5daef29 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_MODULES) += module.o
 endif
 obj-$(CONFIG_SYS_HYPERVISOR) += hypervisor.o
 obj-$(CONFIG_REGMAP)   += regmap/
+obj-$(CONFIG_DEVICE_ISOLATION) += device_isolation.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
 
diff --git a/drivers/base/base.h b/drivers/base/base.h
index b858dfd..713e168 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -25,6 +25,9 @@
  * bus_type/class to be statically allocated safely.  Nothing outside of the
  * driver core should ever touch these fields.
  */
+
+#include 
+
 struct subsys_private {
struct kset subsys;
struct kset *devices_kset;
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 4a67cc0..18edcb1 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "base.h"
 #include "power/power.h"
@@ -644,6 +645,9 @@ void device_initialize(struct device *dev)
lockdep_set_novalidate_class(&dev->mutex);
spin_lock_init(&dev->devres_lock);
INIT_LIST_HEAD(&dev->devres_head);
+#ifdef CONFIG_DEVICE_ISOLATION
+   dev->di_group = NULL;
+#endif
device_pm_init(dev);
set_dev_node(dev, -1);
 }
@@ -1047,6 +1051,8 @@ int device_add(struct device *dev)
class_intf->add_dev(dev, class_intf);
mutex_unlock(&dev->class->p->mutex);
}
+
+   device_isolation_dev_update_sysfs(dev);
 done:
put_device(dev);
return error;
diff --git a/drivers/base/device_isolation.c b/drivers/base/device_isolation.c
new file mode 100644
index 000..4f1f17e
--- /dev/null
+++ b/drivers/base/device_isolation.c
@@ -0,0 +1,184 @@
+/*
+ * device_isolation.c
+ *
+ * Handling of device isolation groups, groups of hardware devices
+ * which are sufficiently isolated by an IOMMU from the rest of the
+ * system that they can be safely given (as a unit) to an unprivileged
+ * user process or guest system to

Re: RFC: Device isolation groups

2012-02-01 Thread David Gibson
On Wed, Feb 01, 2012 at 01:08:39PM -0700, Alex Williamson wrote:
> On Wed, 2012-02-01 at 15:46 +1100, David Gibson wrote:
> > This patch series introduces a new infrastructure to the driver core
> > for representing "device isolation groups".  That is, groups of
> > devices which can be "isolated" in such a way that the rest of the
> > system can be protected from them, even in the presence of userspace
> > or a guest OS directly driving the devices.
> > 
> > Isolation will typically be due to an IOMMU which can safely remap DMA
> > and interrupts coming from these devices.  We need to represent whole
> > groups, rather than individual devices, because there are a number of
> > cases where the group can be isolated as a whole, but devices within
> > it cannot be safely isolated from each other - this usually occurs
> > because the IOMMU cannot reliably distinguish which device in the
> > group initiated a transaction.  In other words, isolation groups
> > represent the minimum safe granularity for passthrough to guests or
> > userspace.
> > 
> > This series provides the core infraustrcture for tracking isolation
> > groups, and example implementations initializing the groups
> > appropriately for two PCI bridges (which include IOMMUs) found on IBM
> > POWER systems.
> > 
> > Actually using the group information is not included here, but David
> > Woodhouse has expressed an interest in using a structure like this to
> > represent operations in iommu_ops more correctly.
> > 
> > Some tracking of groups is a prerequisite for safe passthrough of
> > devices to guests or userspace, such as done by VFIO.  Current VFIO
> > patches use the iommu_ops->device_group mechanism for this.  However,
> > that mechanism is awkward, because without an in-kernel concrete
> > representation of groups, enumerating a group requires traversing
> > every device on a given bus type.  It also fails to cover some very
> > plausible IOMMU topologies, because its groups cannot span devices on
> > multiple bus types.
> 
> So far so good, but there's not much meat on the bone yet.

True..

>  The sysfs
> linking and a list of devices in a group is all pretty straight forward
> and obvious.  I'm not sure yet how this solves the DMA quirks kind of
> issues though.

It doesn't, yet.

>  For instance if we have the ricoh device that uses the
> wrong source ID for DMA from function 1 and we put functions 0 & 1 in an
> isolation group... then what?  And who does device quirk grouping?  Each
> IOMMU driver?

I think so.  It'd be nicer to have this in a common place, but I can't
see a reasonable way of doing that - the IOMMU driver really needs to
have control over group allocation.  We can make it easy for the IOMMU
drivers, though by having a common header quirk and flag which the
IOMMU driver can check.

The other thing I'm considering, is if we actually should take a
whitelist approach, rather than a blacklist approach.  Chances are
that when you factor in various debug registers many, maybe most,
multifunction devices won't actually safely isolate the functions from
each other.  So it might be a better approach to have IOMMU drivers
generally treat a single slot as a group unless the specific device is
whitelisted as having function isolation (and SR-IOV VFs would be
whitelisted, of course).

> For the iommu_device_group() interface, I had imagined that we'd have
> something like:
> 
> struct device *device_dma_alias_quirk(struct device *dev)
> {
>   if (   return ;
> 
>   return dev;
> }
> 
> Then iommu_device_group turns into:
> 
> int iommu_device_group(struct device *dev, unsigned int *groupid)
> {
>   dev = device_dma_alias_quirk(dev);
> if (iommu_present(dev->bus) && dev->bus->iommu_ops->device_group)
> return dev->bus->iommu_ops->device_group(dev, groupid);
> 
> return -ENODEV;
> }
> 
> and device_dma_alias_quirk() is available for dma_ops too.
> 
> So maybe a struct device_isolation_group not only needs a list of
> devices, but it also needs the representative device to do mappings
> identified.

Perhaps.  For now, I was assuming just taking the first list element
would suffice for these cases.

>  dma_ops would then just use dev->di_group->dma_dev for
> mappings, and I assume we call iommu_alloc() with a di_group and instead
> of iommu_attach/detach_device, we'd have iommu_attach/detach_group?

That's the idea, yes.

> What I'm really curious about is where you now stand on what's going to
> happen in device_isolation_bind().  How do 

Re: [PATCH 3/3] device_isolation: Support isolation on POWER p7ioc (IODA) bridges

2012-02-01 Thread David Gibson
On Wed, Feb 01, 2012 at 12:17:05PM -0700, Alex Williamson wrote:
> On Wed, 2012-02-01 at 15:46 +1100, David Gibson wrote:
> > This patch adds code to the code for the powernv platform to create
> > and populate isolation groups on hardware using the p7ioc (aka IODA) PCI 
> > host
> > bridge used on some IBM POWER systems.
> > 
> > Signed-off-by: Alexey Kardashevskiy 
> > Signed-off-by: David Gibson 
> > ---
> >  arch/powerpc/platforms/powernv/pci-ioda.c |   18 --
> >  arch/powerpc/platforms/powernv/pci.h  |6 ++
> >  2 files changed, 22 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> > b/arch/powerpc/platforms/powernv/pci-ioda.c
> > index 5e155df..4648475 100644
> > --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> > @@ -20,6 +20,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -877,6 +878,9 @@ static void __devinit pnv_ioda_setup_bus_dma(struct 
> > pnv_ioda_pe *pe,
> > set_iommu_table_base(&dev->dev, &pe->tce32_table);
> > if (dev->subordinate)
> > pnv_ioda_setup_bus_dma(pe, dev->subordinate);
> > +#ifdef CONFIG_DEVICE_ISOLATION
> > +   device_isolation_dev_add(&pe->di_group, &dev->dev);
> > +#endif
> > }
> >  }
> >  
> > @@ -957,11 +961,21 @@ static void __devinit 
> > pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> > }
> > iommu_init_table(tbl, phb->hose->node);
> >  
> > -   if (pe->pdev)
> > +#ifdef CONFIG_DEVICE_ISOLATION
> > +   BUG_ON(device_isolation_group_init(&pe->di_group, "ioda:rid%x-pe%x",
> > +  pe->rid, pe->pe_number) < 0);
> > +#endif
> > +
> > +   if (pe->pdev) {
> > set_iommu_table_base(&pe->pdev->dev, tbl);
> > -   else
> > +#ifdef CONFIG_DEVICE_ISOLATION
> > +   device_isolation_dev_add(&pe->di_group, &pe->pdev->dev);
> > +#endif
> > +   } else
> > pnv_ioda_setup_bus_dma(pe, pe->pbus);
> 
> Blech, #ifdefs.

Hm, yeah.  The problem is the di_group member not even existing when
!DEVICE_ISOLATION.  Might be able to avoid that with an empty
structure in that case.

> > +
> > +
> > return;
> >   fail:
> > /* XXX Failure: Try to fallback to 64-bit only ? */
> > diff --git a/arch/powerpc/platforms/powernv/pci.h 
> > b/arch/powerpc/platforms/powernv/pci.h
> > index 64ede1e..3e282b7 100644
> > --- a/arch/powerpc/platforms/powernv/pci.h
> > +++ b/arch/powerpc/platforms/powernv/pci.h
> > @@ -1,6 +1,8 @@
> >  #ifndef __POWERNV_PCI_H
> >  #define __POWERNV_PCI_H
> >  
> > +#include 
> > +
> >  struct pci_dn;
> >  
> >  enum pnv_phb_type {
> > @@ -60,6 +62,10 @@ struct pnv_ioda_pe {
> >  
> > /* Link in list of PE#s */
> > struct list_headlink;
> > +
> > +#ifdef CONFIG_DEVICE_ISOLATION
> > +   struct device_isolation_group di_group;
> > +#endif
> 
> Embedding the struct means we need to know the size, which means we
> can't get rid of the #ifdef.  Probably better to use a pointer if we
> don't mind adding a few bytes in the #ifndef case.  Thanks,

I've been back and forth a few types on this, and I've convinced
myself that allowing the group structure to be embedded is a better
idea.  It's a particular help when you need to construct one from
platform or bridge init code that runs before mem_init_done.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/3] Device isolation group infrastructure (v3)

2012-02-08 Thread David Gibson
On Wed, Feb 08, 2012 at 04:27:48PM +0100, Joerg Roedel wrote:
> On Wed, Feb 01, 2012 at 03:46:52PM +1100, David Gibson wrote:
> > In order to safely drive a device with a userspace driver, or to pass
> > it through to a guest system, we must first make sure that the device
> > is isolated in such a way that it cannot interfere with other devices
> > on the system.  This isolation is only available on some systems and
> > will generally require an iommu, and might require other support in
> > bridges or other system hardware.
> > 
> > Often, it's not possible to isolate every device from every other
> > device in the system.  For example, certain PCI/PCIe bridge
> > configurations mean that an iommu cannot reliably distinguish which
> > device behind the bridge initiated a DMA transaction.  Similarly some
> > buggy PCI multifunction devices initiate all DMAs as function 0, so
> > the functions cannot be isolated from each other, even if the IOMMU
> > normally allows this.
> > 
> > Therefore, the user, and code to allow userspace drivers or guest
> > passthrough, needs a way to determine which devices can be isolated
> > from which others.  This patch adds infrastructure to handle this by
> > introducing the concept of a "device isolation group" - a group of
> > devices which can, as a unit, be safely isolated from the rest of the
> > system and therefore can be, as a unit, safely assigned to an
> > unprivileged used or guest.  That is, the groups represent the minimum
> > granularity with which devices may be assigned to untrusted
> > components.
> > 
> > This code manages groups, but does not create them or allow use of
> > grouped devices by a guest.  Creating groups would be done by iommu or
> > bridge drivers, using the interface this patch provides.  It's
> > expected that the groups will be used in future by the in-kernel iommu
> > interface, and would also be used by VFIO or other subsystems to allow
> > safe passthrough of devices to userspace or guests.
> > 
> > Signed-off-by: Alexey Kardashevskiy 
> > Signed-off-by: David Gibson 
> > ---
> >  drivers/base/Kconfig |3 +
> >  drivers/base/Makefile|1 +
> >  drivers/base/base.h  |3 +
> >  drivers/base/core.c  |6 ++
> >  drivers/base/device_isolation.c  |  184 
> > ++
> >  drivers/base/init.c  |2 +
> >  include/linux/device.h   |5 +
> >  include/linux/device_isolation.h |  100 +
> 
> Again, device grouping is done by the IOMMU drivers, so this all belongs
> into the generic iommu-code rather than the driver core.
> 
> I think it makes sense to introduce a device->iommu pointer which
> depends on CONFIG_IOMMU_API and put the group information into it.
> This also has the benefit that we can consolidate all the
> device->arch.iommu pointers into device->iommu as well.

Well, not quite.  In the two example setups in the subsequent patches
the grouping is done by the bridge driver, which in these cases is not
IOMMU_API aware.  They probably should become so, but that's another
project - and relies on the IOMMU_API becoming group aware.

Note that although iommus are the main source of group constraints,
they're not necessarily the only one. Bridge error isolation semantics
may also play a part, for one.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/3] Device isolation group infrastructure (v3)

2012-02-09 Thread David Gibson
On Thu, Feb 09, 2012 at 12:28:05PM +0100, Joerg Roedel wrote:
> On Thu, Feb 09, 2012 at 08:39:28AM +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2012-02-08 at 16:27 +0100, Joerg Roedel wrote:
> > > Again, device grouping is done by the IOMMU drivers, so this all
> > > belongs
> > > into the generic iommu-code rather than the driver core.
> > 
> > Except that there isn't really a "generic iommu code"... discovery,
> > initialization & matching of iommu vs. devices etc... that's all
> > implemented in the arch specific iommu code.
> 
> The whole point of moving the iommu drivers to drivers/iommu was to
> factor out common code. We are not where we want to be yet but the goal
> is to move more code to the generic part.
> 
> For the group-code this means that the generic code should iterate over
> all devices on a bus and build up group structures based on isolation
> information provided by the arch specific code.

And how exactly do you suggest it provide that information.  I really
can't see how an iommu driver would specify its isolation constraints
generally enough, except in the form of code and then we're back to
where we are now.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: RFC: Device isolation groups

2012-03-08 Thread David Gibson
On Wed, Feb 29, 2012 at 12:30:55PM -0700, Alex Williamson wrote:
> On Thu, 2012-02-02 at 12:24 +1100, David Gibson wrote:
> > On Wed, Feb 01, 2012 at 01:08:39PM -0700, Alex Williamson wrote:
[snip]
> Any update to this series?  It would be great if we could map out the
> functionality to the point of breaking down and distributing work... or
> determining if the end result has any value add to what VFIO already
> does.  Thanks,

Yes and no.

No real change on the isolation code per se.  I had been hoping for
feedback from David Woodhouse, but I guess he's been too busy.

In the meantime, however, Alexey has been working on a different
approach to doing PCI passthrough which is more suitable for our
machines.  It is based on passing through an entire virtual host
bridge (which could be a whole host side host bridge, or a subset,
depending on host isolation capabilities), rather than individual
devices.  This makes it substantially simpler than VFIO (we don't need
to virtualize config space or device addresses), and it provides
better enforcement of isolation guarantees (VFIO isolation can be
broken if devices have MMIO backdoors to config space, or if they can
be made to DMA to other devices MMIO addresses instead of RAM
addresses), but does require suitable bridge hardware - pSeries has
such hardware, x86 mostly doesn't (although it wouldn't surprise me if
large server class x86 machines do or will provide the necessary
things).  Even on this sort of hardware the device-centred VFIO
approach may have uses, since it might allow finer grained division,
at the cost of isolation enforcement.

This provides a more concrete case for the isolation infrastructure,
since it would allow the virtual-PHB and VFIO approaches to co-exist.
As Alexey's prototype comes into shape, it should illuminate what
other content we need in the isolation infrastructure to make it fully
usable.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/2] RFC Isolation API

2012-03-13 Thread David Gibson
On Mon, Mar 12, 2012 at 04:32:46PM -0600, Alex Williamson wrote:
> VFIO is completely stalled waiting on a poorly defined device isolation
> infrastructure to take shape.  Rather than waiting any longer, I've
> decided to write my own.  This is nowhere near ready for upstream, but
> attempts to hash out some of the interactions of isolation groups.

Sigh, yeah, I've had trouble getting to this amongst other things,
thanks for pushing something forwards.

> To recap, an isolation group is a set of devices, between which there
> is no guaranteed isolation.  Between isolation groups, we do have
> isolation, generally provided by an IOMMU.  On x86 systems with either
> VT-d or AMD-Vi, we often have function level isolation (assuming you
> trust the device).  Hardware topologies such as PCIe-to-PCI bridges
> can decrease this granularity.  PowerPC systems often set the delimiter
> at the PCI bridge level and refer to these as Partitionable
> Endpoints.

So, this is a bit of an aside from the isolation infrastructure
itself, but it's pretty important I think, and the reason we don't
think the VFIO device-function centric approach is appropriate for
Power.  Function level isolation requires trusting the device a *lot*.
In addition to obvious bugs like those multi-function devices which
use function 0's RID for DMAs from all functions, isolation can be
broken if any of these is true:

* IO or MMIO allows unvirtualized access to the device's config
space - this is a common debug/undocumented feature

* The device can be made to cause a bus-wide error.  Given the
general quality of commodity hardware, I suspect this will be quite
common too.

* There is a multifunction device with any kind of crosstalk
between the functions - again, (possibly undocumented) debug registers
which are shared between functions is pretty common.

* The device can generate DMA bus cycles which might get decoded
by something other than the host bridge.  I suspect this one means
that a DMA capable pre-Express PCI device can never be truly isolated
from things on the same physical bus segment.

And those are just the ones I've thought of so far.  SR-IOV VFs are
probably ok (modulo the inevitable implementation bugs), but other
than that I suspect PCI devices which can really be trusted for
function-level isolation will be pretty rare.  If you have trustable
P2P bridges (as we do on Power servers), though, you can put any PCI
device behind it and have the bridge enforce isolation.  This is why
most pSeries setups have every PCI slot behind a separate P2P bridge,
or in many cases an entire separate host bridge.

> VFIO is a userspace driver interface tailored as a replacement for
> KVM device assignment.  In order to provide secure userspace driver
> interfaces, we must first ensure that we have an isolated device
> infrastructure.  This attempts to define the basics of such an
> interface.
> 
> In addition to isolation groups, this series also introduces the idea
> of an isolation "provider".  This is simply a driver which defines
> isolation groups, for example intel-iommu.  This interface supports
> multiple providers simultaneously.  We also have the idea of a "manager"
> for an isolation group.  When a manager is set for an isolation group,
> it changes the way driver matching works for devices.  We only allow
> matching to drivers registered by the isolation group manager.  Once all
> of the devices in an isolation group are bound to manager registered
> drivers (or no driver), the group is "locked" under manager control.

Yeah, I really should have posted my draft patch a while back which
added isolation group "binders", pretty much equivalent to your
"managers".

> This proposal is far from complete, but I hope it can re-fire some
> discussion and work in this space.  Please let me know what you like,
> what you don't like, and ideas for the gaps.  Thanks,

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] Isolation groups

2012-03-13 Thread David Gibson
+ if (driver->drv == drv) {
> + list_del(&driver->list);
> + kfree(driver);
> + break;
> + }
> + }
> +
> + mutex_unlock(&isolation_lock);
> +}
> +
> +/*
> + * Test whether a driver is a "managed" driver for the group.  This allows
> + * is to preempt normal driver matching and only let our drivers in.
> + */
> +bool isolation_group_manager_driver(struct isolation_group *group,
> + struct device_driver *drv)
> +{
> + struct isolation_manager_driver *driver;
> + bool found = false;
> +
> + if (!group->manager)
> + return found;
> +
> + mutex_lock(&isolation_lock);
> +
> + list_for_each_entry(driver, &group->manager->drivers, list) {
> + if (driver->drv == drv) {
> + found = true;
> + break;
> + }
> + }
> +
> + mutex_unlock(&isolation_lock);
> +
> + return found;
> +}
> +
> +/*
> + * Does the group manager have control of all of the devices in the group?
> + * We consider driver-less devices to be ok here as they don't do DMA and
> + * don't present any interfaces to subvert the rest of the group.
> + */
> +bool isolation_group_manager_locked(struct isolation_group *group)
> +{
> + struct isolation_device *device;
> + struct isolation_manager_driver *driver;
> + bool found, locked = true;
> +
> + if (!group->manager)
> + return false;
> +
> + mutex_lock(&isolation_lock);
> +
> + list_for_each_entry(device, &group->devices, list) {
> + found = false;
> +
> + if (!device->dev->driver)
> + continue;
> +

You could simplify this using isolation_group_manager_driver(),
couldn't you?

> + list_for_each_entry(driver, &group->manager->drivers, list) {
> + if (device->dev->driver == driver->drv) {
> + found = true;
> + break;
> + }
> + }
> +
> + if (!found) {
> + locked = false;
> + break;
> + }
> + }
> +
> + mutex_unlock(&isolation_lock);
> +
> + return locked;
> +}
> +
> +static int __init isolation_init(void)
> +{
> + isolation_kset = kset_create_and_add("isolation", NULL, NULL);
> + 
> + WARN(!isolation_kset, "Failed to initialize isolation group kset\n");
> +
> + return isolation_kset ? 0 : -1;

I'd be tempted to just BUG() here if you can't add the kset - I can't
see any reason it would fail unless you're so short of RAM that you
have bigger problems.  Making this a fatal fail would save having to
double check if the kset is around in the later paths.

> +}
> +subsys_initcall(isolation_init);
> diff --git a/include/linux/device.h b/include/linux/device.h
> index b63fb39..5805c56 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -663,6 +663,10 @@ struct device {
>  
>   struct device_dma_parameters *dma_parms;
>  
> +#ifdef CONFIG_ISOLATION_GROUPS
> + struct isolation_group  *isolation_group;
> +#endif
> +
>   struct list_headdma_pools;  /* dma pools (if dma'ble) */
>  
>   struct dma_coherent_mem *dma_mem; /* internal for coherent mem
> diff --git a/include/linux/isolation.h b/include/linux/isolation.h
> new file mode 100644
> index 000..1d87caf
> --- /dev/null
> +++ b/include/linux/isolation.h
> @@ -0,0 +1,138 @@
> +/*
> + * Isolation group interface
> + *
> + * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
> + * Author: Alex Williamson 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + */
> +
> +#ifndef _LINUX_ISOLATION_H
> +#define _LINUX_ISOLATION_H
> +
> +#define ISOLATION_NOTIFY_ADD_DEVICE  1
> +#define ISOLATION_NOTIFY_DEL_DEVICE  2
> +
> +#ifdef CONFIG_ISOLATION_GROUPS
> +
> +extern struct isolation_group *isolation_create_group(void);
> +extern int isolation_free_group(struct isolation_group *group);
> +extern int isolation_group_add_dev(struct isolation_group *group,
> +struct device *dev);
> +extern int isolation_group_del_dev(struct device *dev);
> +extern int isolation_register_notifier(s

Re: [PATCH 1/2] Isolation groups

2012-03-14 Thread David Gibson
On Tue, Mar 13, 2012 at 10:49:47AM -0600, Alex Williamson wrote:
> On Wed, 2012-03-14 at 01:33 +1100, David Gibson wrote:
> > On Mon, Mar 12, 2012 at 04:32:54PM -0600, Alex Williamson wrote:
> > > Signed-off-by: Alex Williamson 
> > > ---
> > > 
> > >  drivers/base/Kconfig  |   10 +
> > >  drivers/base/Makefile |1 
> > >  drivers/base/base.h   |5 
> > >  drivers/base/isolation.c  |  798 
> > > +
> > >  include/linux/device.h|4 
> > >  include/linux/isolation.h |  138 
> > >  6 files changed, 956 insertions(+), 0 deletions(-)
> > >  create mode 100644 drivers/base/isolation.c
> > >  create mode 100644 include/linux/isolation.h
> > > 
> > > diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> > > index 7be9f79..e98a5f3 100644
> > > --- a/drivers/base/Kconfig
> > > +++ b/drivers/base/Kconfig
> > > @@ -189,4 +189,14 @@ config DMA_SHARED_BUFFER
> > > APIs extension; the file's descriptor can then be passed on to other
> > > driver.
> > >  
> > > +config ISOLATION_GROUPS
> > > + bool "Enable Isolation Group API"
> > > + default n
> > > + depends on EXPERIMENTAL && IOMMU_API
> > > + help
> > > +   This option enables grouping of devices into Isolation Groups
> > > +   which may be used by other subsystems to perform quirks across
> > > +   sets of devices as well as userspace drivers for guaranteeing
> > > +   devices are isolated from the rest of the system.
> > > +
> > >  endmenu
> > > diff --git a/drivers/base/Makefile b/drivers/base/Makefile
> > > index 610f999..047b5f9 100644
> > > --- a/drivers/base/Makefile
> > > +++ b/drivers/base/Makefile
> > > @@ -19,6 +19,7 @@ obj-$(CONFIG_MODULES)   += module.o
> > >  endif
> > >  obj-$(CONFIG_SYS_HYPERVISOR) += hypervisor.o
> > >  obj-$(CONFIG_REGMAP) += regmap/
> > > +obj-$(CONFIG_ISOLATION_GROUPS)   += isolation.o
> > >  
> > >  ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
> > >  
> > > diff --git a/drivers/base/base.h b/drivers/base/base.h
> > > index b858dfd..376758a 100644
> > > --- a/drivers/base/base.h
> > > +++ b/drivers/base/base.h
> > > @@ -1,4 +1,5 @@
> > >  #include 
> > > +#include 
> > >  
> > >  /**
> > >   * struct subsys_private - structure to hold the private to the driver 
> > > core portions of the bus_type/class structure.
> > > @@ -108,6 +109,10 @@ extern int driver_probe_device(struct device_driver 
> > > *drv, struct device *dev);
> > >  static inline int driver_match_device(struct device_driver *drv,
> > > struct device *dev)
> > >  {
> > > + if (isolation_group_managed(to_isolation_group(dev)) &&
> > > + !isolation_group_manager_driver(to_isolation_group(dev), drv))
> > > + return 0;
> > > +
> > >   return drv->bus->match ? drv->bus->match(dev, drv) : 1;
> > >  }
> > >  
> > > diff --git a/drivers/base/isolation.c b/drivers/base/isolation.c
> > > new file mode 100644
> > > index 000..c01365c
> > > --- /dev/null
> > > +++ b/drivers/base/isolation.c
> > > @@ -0,0 +1,798 @@
> > > +/*
> > > + * Isolation group interface
> > > + *
> > > + * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
> > > + * Author: Alex Williamson 
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License version 2 as
> > > + * published by the Free Software Foundation.
> > > + *
> > > + */
> > > +
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +
> > > +static struct kset *isolation_kset;
> > > +/* XXX add more complete locking, maybe rcu */
> > > +static DEFINE_MUTEX(isolation_lock);
> > > +static LIST_HEAD(isolation_groups);
> > > +static LIST_HEAD(isolation_notifiers);
> > > +
> > > +/* Keep these private */
> > > +struct isolation_manager_driver {
> > > + struct device_driver *drv;
> > > + struct list_head list;
> > > +};
> > > +
> > 

Re: [PATCH 1/2] Isolation groups

2012-03-15 Thread David Gibson
On Thu, Mar 15, 2012 at 02:15:01PM -0600, Alex Williamson wrote:
> On Wed, 2012-03-14 at 20:58 +1100, David Gibson wrote:
> > On Tue, Mar 13, 2012 at 10:49:47AM -0600, Alex Williamson wrote:
> > > On Wed, 2012-03-14 at 01:33 +1100, David Gibson wrote:
> > > > On Mon, Mar 12, 2012 at 04:32:54PM -0600, Alex Williamson wrote:
> > > > > +/*
> > > > > + * Add a device to an isolation group.  Isolation groups start empty 
> > > > > and
> > > > > + * must be told about the devices they contain.  Expect this to be 
> > > > > called
> > > > > + * from isolation group providers via notifier.
> > > > > + */
> > > > 
> > > > Doesn't necessarily have to be from a notifier, particularly if the
> > > > provider is integrated into host bridge code.
> > > 
> > > Sure, a provider could do this on it's own if it wants.  This just
> > > provides some infrastructure for a common path.  Also note that this
> > > helps to eliminate all the #ifdef CONFIG_ISOLATION in the provider.  Yet
> > > to be seen whether that can reasonably be the case once isolation groups
> > > are added to streaming DMA paths.
> > 
> > Right, but other than the #ifdef safety, which could be achieved more
> > simply, I'm not seeing what benefit the infrastructure provides over
> > directly calling the bus notifier function.  The infrastructure groups
> > the notifiers by bus type internally, but AFAICT exactly one bus
> > notifier call would become exactly one isolation notifier call, and
> > the notifier callback itself would be almost identical.
> 
> I guess I don't see this as a fundamental design point of the proposal,
> it's just a convenient way to initialize groups as a side-band addition
> until isolation groups become a more fundamental part of the iommu
> infrastructure.  If you want to do that level of integration in your
> provider, do it and make the callbacks w/o the notifier.  If nobody ends
> up using them, we'll remove them.  Maybe it will just end up being a
> bootstrap.  In the typical case, yes, one bus notifier is one isolation
> notifier.  It does however also allow one bus notifier to become
> multiple isolation notifiers, and includes some filtering that would
> just be duplicated if every provider decided to implement their own bus
> notifier.

Uh.. I didn't notice any filtering?  That's why I'm asking.

> > > > > +int isolation_group_add_dev(struct isolation_group *group, struct 
> > > > > device *dev)
> > > > > +{
> > > > > + struct isolation_device *device;
> > > > > + int ret = 0;
> > > > > +
> > > > > + mutex_lock(&isolation_lock);
> > > > > +
> > > > > + if (dev->isolation_group) {
> > > > > + ret = -EBUSY;
> > > > > + goto out;
> > > > 
> > > > This should probably be a BUG_ON() - the isolation provider shouldn't
> > > > be putting the same device into two different groups.
> > > 
> > > Yeah, probably.
> > > 
> > > > > + }
> > > > > +
> > > > > + device = kzalloc(sizeof(*device), GFP_KERNEL);
> > > > > + if (!device) {
> > > > > + ret = -ENOMEM;
> > > > > + goto out;
> > > > > + }
> > > > > +
> > > > > + device->dev = dev;
> > > > > +
> > > > > + /* Cross link the device in sysfs */
> > > > > + ret = sysfs_create_link(&dev->kobj, &group->kobj,
> > > > > + "isolation_group");
> > > > > + if (ret) {
> > > > > + kfree(device);
> > > > > + goto out;
> > > > > + }
> > > > > + 
> > > > > + ret = sysfs_create_link(group->devices_kobj, &dev->kobj,
> > > > > + kobject_name(&dev->kobj));
> > > > 
> > > > So, a problem both here and in my version is what to name the device
> > > > links.  Because they could be potentially be quite widely scattered,
> > > > I'm not sure that kobject_name() is guaranteed to be sufficiently
> > > > unique.
> > > 
> > > Even if the name is not, we're pointing to a unique sysfs location.  I
> &g

Re: [PATCH 1/2] Isolation groups

2012-03-16 Thread David Gibson
On Fri, Mar 16, 2012 at 01:31:18PM -0600, Alex Williamson wrote:
> On Fri, 2012-03-16 at 14:45 +1100, David Gibson wrote:
> > On Thu, Mar 15, 2012 at 02:15:01PM -0600, Alex Williamson wrote:
> > > On Wed, 2012-03-14 at 20:58 +1100, David Gibson wrote:
> > > > On Tue, Mar 13, 2012 at 10:49:47AM -0600, Alex Williamson wrote:
> > > > > On Wed, 2012-03-14 at 01:33 +1100, David Gibson wrote:
> > > > > > On Mon, Mar 12, 2012 at 04:32:54PM -0600, Alex Williamson wrote:
> > > > > > > +/*
> > > > > > > + * Add a device to an isolation group.  Isolation groups start 
> > > > > > > empty and
> > > > > > > + * must be told about the devices they contain.  Expect this to 
> > > > > > > be called
> > > > > > > + * from isolation group providers via notifier.
> > > > > > > + */
> > > > > > 
> > > > > > Doesn't necessarily have to be from a notifier, particularly if the
> > > > > > provider is integrated into host bridge code.
> > > > > 
> > > > > Sure, a provider could do this on it's own if it wants.  This just
> > > > > provides some infrastructure for a common path.  Also note that this
> > > > > helps to eliminate all the #ifdef CONFIG_ISOLATION in the provider.  
> > > > > Yet
> > > > > to be seen whether that can reasonably be the case once isolation 
> > > > > groups
> > > > > are added to streaming DMA paths.
> > > > 
> > > > Right, but other than the #ifdef safety, which could be achieved more
> > > > simply, I'm not seeing what benefit the infrastructure provides over
> > > > directly calling the bus notifier function.  The infrastructure groups
> > > > the notifiers by bus type internally, but AFAICT exactly one bus
> > > > notifier call would become exactly one isolation notifier call, and
> > > > the notifier callback itself would be almost identical.
> > > 
> > > I guess I don't see this as a fundamental design point of the proposal,
> > > it's just a convenient way to initialize groups as a side-band addition
> > > until isolation groups become a more fundamental part of the iommu
> > > infrastructure.  If you want to do that level of integration in your
> > > provider, do it and make the callbacks w/o the notifier.  If nobody ends
> > > up using them, we'll remove them.  Maybe it will just end up being a
> > > bootstrap.  In the typical case, yes, one bus notifier is one isolation
> > > notifier.  It does however also allow one bus notifier to become
> > > multiple isolation notifiers, and includes some filtering that would
> > > just be duplicated if every provider decided to implement their own bus
> > > notifier.
> > 
> > Uh.. I didn't notice any filtering?  That's why I'm asking.
> 
> Not much, but a little:
> 
> +   switch (action) {
> +   case BUS_NOTIFY_ADD_DEVICE:
> +   if (!dev->isolation_group)
> +   blocking_notifier_call_chain(¬ifier->notifier,
> +   ISOLATION_NOTIFY_ADD_DEVICE, dev);
> +   break;
> +   case BUS_NOTIFY_DEL_DEVICE:
> +   if (dev->isolation_group)
> +   blocking_notifier_call_chain(¬ifier->notifier,
> +   ISOLATION_NOTIFY_DEL_DEVICE, dev);
> +   break;
> +   }


Ah, I see, fair enough.

A couple of tangential observations.  First, I suspect using
BUS_NOTIFY_DEL_DEVICE is a very roundabout way of handling hot-unplug,
it might be better to have an unplug callback in the group instead.

Second, I don't think aborting the call chain early for hot-plug is
actually a good idea.  I can't see a clear guarantee on the order, so
individual providers couldn't rely on that short-cut behaviour.  Which
means that if two providers would have attempted to claim the same
device, something is seriously wrong and we should probably report
that.

> ...
> > > > > > So, somewhere, I think we need a fallback path, but I'm not sure
> > > > > > exactly where.  If an isolation provider doesn't explicitly put a
> > > > > > device into a group, the device should go into the group of its 
> > > > > > parent
> > > > > > bridge.  This covers the case of a bus with IOMMU which has below 
> > > > > > it a
> > &g

Re: [PATCH 1/2] Isolation groups

2012-03-26 Thread David Gibson
On Wed, Mar 21, 2012 at 03:12:58PM -0600, Alex Williamson wrote:
> On Sat, 2012-03-17 at 15:57 +1100, David Gibson wrote:
> > On Fri, Mar 16, 2012 at 01:31:18PM -0600, Alex Williamson wrote:
> > > On Fri, 2012-03-16 at 14:45 +1100, David Gibson wrote:
> > > > On Thu, Mar 15, 2012 at 02:15:01PM -0600, Alex Williamson wrote:
> > > > > On Wed, 2012-03-14 at 20:58 +1100, David Gibson wrote:
> > > > > > On Tue, Mar 13, 2012 at 10:49:47AM -0600, Alex Williamson wrote:
> > > > > > > On Wed, 2012-03-14 at 01:33 +1100, David Gibson wrote:
> > > > > > > > On Mon, Mar 12, 2012 at 04:32:54PM -0600, Alex Williamson wrote:
> > > > > > > > > +/*
> > > > > > > > > + * Add a device to an isolation group.  Isolation groups 
> > > > > > > > > start empty and
> > > > > > > > > + * must be told about the devices they contain.  Expect this 
> > > > > > > > > to be called
> > > > > > > > > + * from isolation group providers via notifier.
> > > > > > > > > + */
> > > > > > > > 
> > > > > > > > Doesn't necessarily have to be from a notifier, particularly if 
> > > > > > > > the
> > > > > > > > provider is integrated into host bridge code.
> > > > > > > 
> > > > > > > Sure, a provider could do this on it's own if it wants.  This just
> > > > > > > provides some infrastructure for a common path.  Also note that 
> > > > > > > this
> > > > > > > helps to eliminate all the #ifdef CONFIG_ISOLATION in the 
> > > > > > > provider.  Yet
> > > > > > > to be seen whether that can reasonably be the case once isolation 
> > > > > > > groups
> > > > > > > are added to streaming DMA paths.
> > > > > > 
> > > > > > Right, but other than the #ifdef safety, which could be achieved 
> > > > > > more
> > > > > > simply, I'm not seeing what benefit the infrastructure provides over
> > > > > > directly calling the bus notifier function.  The infrastructure 
> > > > > > groups
> > > > > > the notifiers by bus type internally, but AFAICT exactly one bus
> > > > > > notifier call would become exactly one isolation notifier call, and
> > > > > > the notifier callback itself would be almost identical.
> > > > > 
> > > > > I guess I don't see this as a fundamental design point of the 
> > > > > proposal,
> > > > > it's just a convenient way to initialize groups as a side-band 
> > > > > addition
> > > > > until isolation groups become a more fundamental part of the iommu
> > > > > infrastructure.  If you want to do that level of integration in your
> > > > > provider, do it and make the callbacks w/o the notifier.  If nobody 
> > > > > ends
> > > > > up using them, we'll remove them.  Maybe it will just end up being a
> > > > > bootstrap.  In the typical case, yes, one bus notifier is one 
> > > > > isolation
> > > > > notifier.  It does however also allow one bus notifier to become
> > > > > multiple isolation notifiers, and includes some filtering that would
> > > > > just be duplicated if every provider decided to implement their own 
> > > > > bus
> > > > > notifier.
> > > > 
> > > > Uh.. I didn't notice any filtering?  That's why I'm asking.
> > > 
> > > Not much, but a little:
> > > 
> > > +   switch (action) {
> > > +   case BUS_NOTIFY_ADD_DEVICE:
> > > +   if (!dev->isolation_group)
> > > +   blocking_notifier_call_chain(¬ifier->notifier,
> > > +   ISOLATION_NOTIFY_ADD_DEVICE, dev);
> > > +   break;
> > > +   case BUS_NOTIFY_DEL_DEVICE:
> > > +   if (dev->isolation_group)
> > > +   blocking_notifier_call_chain(¬ifier->notifier,
> > > +   ISOLATION_NOTIFY_DEL_DEVICE, dev);
> > > +   break;
> > > +   }
T> > 
> > 
> > Ah, I see, fair enough.
> > 
> > A 

Re: [PATCH 1/2] Isolation groups

2012-03-29 Thread David Gibson
On Tue, Mar 27, 2012 at 01:34:43PM -0600, Alex Williamson wrote:
[snip]
> > > > this case, it gets a bit complex.  When the FooBus isolation provider
> > > > is active, the FooBus devices would be in their own groups, not the
> > > > group of the FooBridge and its sibling.  When the FooBus isolation
> > > > provider is removed, it would have to configure the FooBus IOMMU to a
> > > > passthrough mode, and revert the FooBus devices to the parent's
> > > > group.  Hm.  Complicated.
> > > 
> > > Yep.  I think we're arriving at the same point.  Groups are
> > > hierarchical, but ownership via a manager cannot be nested.  So to
> > > manage a group, we need to walk the entire tree of devices below each
> > > device checking that none of the groups are managed and all the devices
> > > are using the right driver, then walk up from the group to verify no
> > > group of a parent device is managed.  Thanks,
> > 
> > Blargh.  I really, really hope we can come up with a simpler model
> > than that.
> 
> Yep, I'm pretty well at the end of this experiment.  Honestly, I think
> isolation groups are the wrong approach.  We're trying to wrap too many
> concepts together and it's completely unmanageable.  I cannot see adding
> the complexity we're talking about here to the core device model for
> such a niche usage.  I think we're better off going back to the
> iommu_device_group() and building that out into something more complete,
> starting with group based iommu ops and a dma quirk infrastructure.
> >From there we can add some basic facilities to toggle driver autoprobe,
> maybe setup notifications for the group, and hopefully include a way to
> share iommu mappings between groups.  Anything much beyond that we
> should probably leave for something like the vfio driver.  Thanks,

Yes, well, I was hoping for a simpler model that didn't involve simply
sweeping all the issues under a rug.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 1/3] iommu: Introduce iommu_group

2012-04-18 Thread David Gibson
On Mon, Apr 02, 2012 at 03:14:40PM -0600, Alex Williamson wrote:
> IOMMUs often do not have visibility of individual devices in the
> system.  Due to IOMMU design, bus topology, or device quirks, we
> can often only identify groups of devices.  Examples include
> Intel VT-d & AMD-Vi which often have function level visibility
> compared to POWER partitionable endpoints which have bridge level
> granularity.

That's a significant oversimplification of the situation on POWER,
although it doesn't really matter in this context.  On older (i.e. pre
PCI-E) hardware, PEs have either host bridge (i.e. domain)
granularity, or in IIUC in some cases p2p bridge granularity, using
special p2p bridges, since that's the only real way to do iommu
differentiation without the PCI-E requestor IDs.  This isn't as coarse
as it seems in practice, because the hardware is usually built with a
bridge per physical PCI slot.

On newer PCI-E hardware, the PE granularity is basically a firmware
decision, and can go down to function level.  I believe pHyp puts the
granularity at the bridge level.  Our non-virtualized Linux "firmware"
currently does put it at the function level, but Ben is thinking about
changing that to bridge level: again, because of the hardware design
that isn't as coarse as it seems, and at this level we can hardware
guarantee isolation to a degree that's not possible at the function
level.

>  PCIe-to-PCI bridges also often cloud the IOMMU
> visibility as it cannot distiguish devices behind the bridge.
> Devices can also sometimes hurt themselves by initiating DMA using
> the wrong source ID on a multifunction PCI device.
> 
> IOMMU groups are meant to help solve these problems and hopefully
> become the working unit of the IOMMI API.

So far, so simple.  No objections here.  I am trying to work out what
the real difference in approach is in this seriers from either your or
my earlier isolation group series.  AFAICT it's just that this
approach is explicitly only about IOMMU identity, ignoring (here) any
other factors which might affect isolation.  Or am I missing
something?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 2/3] iommu: Create basic group infrastructure and update AMD-Vi & Intel VT-d

2012-04-18 Thread David Gibson
idr, GFP_KERNEL))) {
> + kfree(group);
> + mutex_unlock(&iommu_group_mutex);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + if (-EAGAIN == idr_get_new(&iommu_group_idr, group, &group->id))
> + goto again;
> +
> + mutex_unlock(&iommu_group_mutex);
> +
> + ret = kobject_init_and_add(&group->kobj, &iommu_group_ktype,
> +NULL, "%d", group->id);
> + if (ret) {
> + iommu_group_release(&group->kobj);
> + return ERR_PTR(ret);
> + }
> +
> + group->devices_kobj = kobject_create_and_add("devices", &group->kobj);
> +if (!group->devices_kobj) {
> + kobject_put(&group->kobj);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + dma_dev->iommu_group = group;
> +
> + return group;
> +}
> +
> +void iommu_group_free(struct iommu_group *group)
> +{
> + group->dma_dev->iommu_group = NULL;
> + WARN_ON(atomic_read(&group->refcnt));

This kind of implies that the representative device doesn't count
against the refcount, which seems wrong.

> + kobject_put(group->devices_kobj);
> + kobject_put(&group->kobj);
> +}
> +
> +bool iommu_group_empty(struct iommu_group *group)
> +{
> + return (0 == atomic_read(&group->refcnt));
> +}
> +
> +int iommu_group_add_device(struct iommu_group *group, struct device *dev)
> +{
> + int ret;
> +
> + atomic_inc(&group->refcnt);
> +
> + ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
> + if (ret)
> + return ret;
> +
> + ret = sysfs_create_link(group->devices_kobj, &dev->kobj,
> +kobject_name(&dev->kobj));
> + if (ret) {
> + sysfs_remove_link(&dev->kobj, "iommu_group");
> + return ret;
> + }
> +
> + dev->iommu_group = group;
> +
> + return 0;
> +}
> +
> +void iommu_group_remove_device(struct device *dev)
> +{
> + sysfs_remove_link(dev->iommu_group->devices_kobj,
> +   kobject_name(&dev->kobj));
> + sysfs_remove_link(&dev->kobj, "iommu_group");
> +
> + atomic_dec(&dev->iommu_group->refcnt);
> + dev->iommu_group = NULL;
>  }
>  
>  /**
> @@ -335,9 +473,23 @@ EXPORT_SYMBOL_GPL(iommu_unmap);
>  
>  int iommu_device_group(struct device *dev, unsigned int *groupid)
>  {
> - if (iommu_present(dev->bus) && dev->bus->iommu_ops->device_group)
> - return dev->bus->iommu_ops->device_group(dev, groupid);
> + if (dev->iommu_group) {
> + *groupid = dev->iommu_group->id;
> + return 0;
> + }
>  
>   return -ENODEV;
>  }
>  EXPORT_SYMBOL_GPL(iommu_device_group);
> +
> +static int __init iommu_init(void)
> +{
> + iommu_group_kset = kset_create_and_add("iommu_groups", NULL, NULL);
> + idr_init(&iommu_group_idr);
> + mutex_init(&iommu_group_mutex);
> +
> + BUG_ON(!iommu_group_kset);
> +
> + return 0;
> +}
> +subsys_initcall(iommu_init);
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 2ee375c..24004d6 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -60,6 +60,8 @@ struct iommu_domain {
>   * @iova_to_phys: translate iova to physical address
>   * @domain_has_cap: domain capabilities query
>   * @commit: commit iommu domain
> + * @add_device: add device to iommu grouping
> + * @remove_device: remove device from iommu grouping
>   * @pgsize_bitmap: bitmap of supported page sizes
>   */
>  struct iommu_ops {
> @@ -76,10 +78,25 @@ struct iommu_ops {
>   int (*domain_has_cap)(struct iommu_domain *domain,
> unsigned long cap);
>   int (*device_group)(struct device *dev, unsigned int *groupid);
> + int (*add_device)(struct device *dev);
> + void (*remove_device)(struct device *dev);
>   unsigned long pgsize_bitmap;
>  };
>  
> +/**
> + * struct iommu_group - groups of devices representing iommu visibility
> + * @dma_dev: all dma from the group appears to the iommu using this source id
> + * @kobj: iommu group node in sysfs
> + * @devices_kobj: sysfs subdir node for linking devices
> + * @refcnt: number of devices in group
> + * @id: unique id number for the group (for easy sysfs listing)
> + */
>  struct iommu_group {
> + struct device *dma_dev;

So, a "representative" device works very nicely for the AMD and Intel
IOMMUs.  But I'm not sure it makes sense for any IOMMU - we could
point it to the bridge in the Power case, but I'm not sure if that
really makes sense (particularly when it's a host bridge, not a p2p
bridge).  In embedded cases it may make even less sense.

I think it would be better to move this into iommu driver specific
data (which I guess would mean adding a facility for iommu driver
specific data, either with a private pointer or using container_of).

> + struct kobject kobj;
> + struct kobject *devices_kobj;

The devices_kobj has the same lifetime as the group, so it might as
well be a static member.

> + atomic_t refcnt;
> + int id;
>  };
>  
>  extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
> @@ -101,6 +118,12 @@ extern int iommu_domain_has_cap(struct iommu_domain 
> *domain,
>  extern void iommu_set_fault_handler(struct iommu_domain *domain,
>   iommu_fault_handler_t handler);
>  extern int iommu_device_group(struct device *dev, unsigned int *groupid);
> +extern struct iommu_group *iommu_group_alloc(struct device *dev);
> +extern void iommu_group_free(struct iommu_group *group);
> +extern bool iommu_group_empty(struct iommu_group *group);
> +extern int iommu_group_add_device(struct iommu_group *group,
> +   struct device *dev);
> +extern void iommu_group_remove_device(struct device *dev);
>  
>  /**
>   * report_iommu_fault() - report about an IOMMU fault to the IOMMU framework
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 02/13] iommu: IOMMU Groups

2012-05-13 Thread David Gibson
f --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 2198b2d..f75004e 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -26,60 +26,404 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +
> +static struct kset *iommu_group_kset;
> +static struct ida iommu_group_ida;
> +static struct mutex iommu_group_mutex;
> +
> +struct iommu_group {
> + struct kobject kobj;
> + struct kobject *devices_kobj;
> + struct list_head devices;
> + struct mutex mutex;
> + struct blocking_notifier_head notifier;
> + int id;

I think you should add some sort of name string to the group as well
(supplied by the iommu driver creating the group).  That would make it
easier to connect the arbitrary assigned IDs to any platform-standard
naming convention for these things.

[snip]
> +/**
> + * iommu_group_add_device - add a device to an iommu group
> + * @group: the group into which to add the device (reference should be held)
> + * @dev: the device
> + *
> + * This function is called by an iommu driver to add a device into a
> + * group.  Adding a device increments the group reference count.
> + */
> +int iommu_group_add_device(struct iommu_group *group, struct device *dev)
> +{
> + int ret;
> + struct iommu_device *device;
> +
> + device = kzalloc(sizeof(*device), GFP_KERNEL);
> + if (!device)
> + return -ENOMEM;
> +
> + device->dev = dev;
> +
> + ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
> + if (ret) {
> + kfree(device);
> + return ret;
> + }
> +
> + ret = sysfs_create_link(group->devices_kobj, &dev->kobj,
> + kobject_name(&dev->kobj));
> + if (ret) {
> + sysfs_remove_link(&dev->kobj, "iommu_group");
> + kfree(device);
> + return ret;
> + }

So, it's not clear that the kobject_name() here has to be unique
across all devices in the group.  It might be better to use an
arbitrary index here instead of a name to avoid that problem.

[snip]
> +/**
> + * iommu_group_remove_device - remove a device from it's current group
> + * @dev: device to be removed
> + *
> + * This function is called by an iommu driver to remove the device from
> + * it's current group.  This decrements the iommu group reference count.
> + */
> +void iommu_group_remove_device(struct device *dev)
> +{
> + struct iommu_group *group = dev->iommu_group;
> + struct iommu_device *device;
> +
> + /* Pre-notify listeners that a device is being removed. */
> + blocking_notifier_call_chain(&group->notifier,
> +  IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);
> +
> + mutex_lock(&group->mutex);
> + list_for_each_entry(device, &group->devices, list) {
> + if (device->dev == dev) {
> + list_del(&device->list);
> + kfree(device);
> + break;
> + }
> + }
> + mutex_unlock(&group->mutex);
> +
> + sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> + sysfs_remove_link(&dev->kobj, "iommu_group");
> +
> + dev->iommu_group = NULL;

I suspect the dev -> group pointer should be cleared first, under the
group lock, but I'm not certain about that.

[snip]
> +/**
> + * iommu_group_for_each_dev - iterate over each device in the group
> + * @group: the group
> + * @data: caller opaque data to be passed to callback function
> + * @fn: caller supplied callback function
> + *
> + * This function is called by group users to iterate over group devices.
> + * Callers should hold a reference count to the group during
> callback.

Probably also worth noting in this doco that the group lock will be
held across the callback.

[snip]
> +static int iommu_bus_notifier(struct notifier_block *nb,
> +   unsigned long action, void *data)
>  {
>   struct device *dev = data;
> + struct iommu_ops *ops = dev->bus->iommu_ops;
> + struct iommu_group *group;
> + unsigned long group_action = 0;
> +
> + /*
> +  * ADD/DEL call into iommu driver ops if provided, which may
> +  * result in ADD/DEL notifiers to group->notifier
> +  */
> + if (action == BUS_NOTIFY_ADD_DEVICE) {
> + if (ops->add_device)
> + return ops->add_device(dev);
> + } else if (action == BUS_NOTIFY_DEL_DEVICE) {
> + if (ops->remove_device && dev

Re: [PATCH 02/13] iommu: IOMMU Groups

2012-05-14 Thread David Gibson
On Mon, May 14, 2012 at 11:11:42AM -0600, Alex Williamson wrote:
> On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> > On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
[snip]
> > > +struct iommu_group {
> > > + struct kobject kobj;
> > > + struct kobject *devices_kobj;
> > > + struct list_head devices;
> > > + struct mutex mutex;
> > > + struct blocking_notifier_head notifier;
> > > + int id;
> > 
> > I think you should add some sort of name string to the group as well
> > (supplied by the iommu driver creating the group).  That would make it
> > easier to connect the arbitrary assigned IDs to any platform-standard
> > naming convention for these things.
> 
> When would the name be used and how is it exposed?

I'm thinking of this basically as a debugging aid.  So I'd expect it
to appear in a 'name' (or 'description') sysfs property on the group,
and in printk messages regarding the group.

[snip]
> > So, it's not clear that the kobject_name() here has to be unique
> > across all devices in the group.  It might be better to use an
> > arbitrary index here instead of a name to avoid that problem.
> 
> Hmm, that loses useful convenience when they are unique, such as on PCI.
> I'll look and see if sysfs_create_link will fail on duplicate names and
> see about adding some kind of instance to it.

Ok.  Is the name necessarily unique even for PCI, if the group crosses
multiple domains?

[snip]
> > > + mutex_lock(&group->mutex);
> > > + list_for_each_entry(device, &group->devices, list) {
> > > + if (device->dev == dev) {
> > > + list_del(&device->list);
> > > + kfree(device);
> > > + break;
> > > + }
> > > + }
> > > + mutex_unlock(&group->mutex);
> > > +
> > > + sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > > + sysfs_remove_link(&dev->kobj, "iommu_group");
> > > +
> > > + dev->iommu_group = NULL;
> > 
> > I suspect the dev -> group pointer should be cleared first, under the
> > group lock, but I'm not certain about that.
> 
> group->mutex is protecting the group's device list.  I think my
> assumption is that when a device is being removed, there should be no
> references to it for anyone to race with iommu_group_get(dev), but I'm
> not sure how valid that is.

What I'm concerned about here is someone grabbing the device by
non-group-related means, grabbing a pointer to its group and that
racing with remove_device().  It would then end up with a group
pointer it thinks is right for the device, when the group no longer
thinks it owns the device.

Doing it under the lock is so that on the other side, group aware code
doesn't traverse the group member list and grab a reference to a
device which no longer points back to the group.

> > [snip]
> > > +/**
> > > + * iommu_group_for_each_dev - iterate over each device in the group
> > > + * @group: the group
> > > + * @data: caller opaque data to be passed to callback function
> > > + * @fn: caller supplied callback function
> > > + *
> > > + * This function is called by group users to iterate over group devices.
> > > + * Callers should hold a reference count to the group during
> > > callback.
> > 
> > Probably also worth noting in this doco that the group lock will be
> > held across the callback.
> 
> Yes; calling iommu_group_remove_device through this would be a bad idea.

Or anything which blocks.

> > [snip]
> > > +static int iommu_bus_notifier(struct notifier_block *nb,
> > > +   unsigned long action, void *data)
> > >  {
> > >   struct device *dev = data;
> > > + struct iommu_ops *ops = dev->bus->iommu_ops;
> > > + struct iommu_group *group;
> > > + unsigned long group_action = 0;
> > > +
> > > + /*
> > > +  * ADD/DEL call into iommu driver ops if provided, which may
> > > +  * result in ADD/DEL notifiers to group->notifier
> > > +  */
> > > + if (action == BUS_NOTIFY_ADD_DEVICE) {
> > > + if (ops->add_device)
> > > + return ops->add_device(dev);
> > > + } else if (action == BUS_NOTIFY_DEL_DEVICE) {
> > > + if (ops->remove_device && dev->iommu_group) {
> > > + ops->remove_device(dev);
> > > + return 0;
> > > + }

Re: [PATCH 02/13] iommu: IOMMU Groups

2012-05-16 Thread David Gibson
On Tue, May 15, 2012 at 12:34:03AM -0600, Alex Williamson wrote:
> On Tue, 2012-05-15 at 12:03 +1000, David Gibson wrote:
> > On Mon, May 14, 2012 at 11:11:42AM -0600, Alex Williamson wrote:
> > > On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> > > > On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> > [snip]
> > > > > +struct iommu_group {
> > > > > + struct kobject kobj;
> > > > > + struct kobject *devices_kobj;
> > > > > + struct list_head devices;
> > > > > + struct mutex mutex;
> > > > > + struct blocking_notifier_head notifier;
> > > > > + int id;
> > > > 
> > > > I think you should add some sort of name string to the group as well
> > > > (supplied by the iommu driver creating the group).  That would make it
> > > > easier to connect the arbitrary assigned IDs to any platform-standard
> > > > naming convention for these things.
> > > 
> > > When would the name be used and how is it exposed?
> > 
> > I'm thinking of this basically as a debugging aid.  So I'd expect it
> > to appear in a 'name' (or 'description') sysfs property on the group,
> > and in printk messages regarding the group.
> 
> Ok, so long as it's only descriptive/debugging I don't have a problem
> adding something like that.
> 
> > [snip]
> > > > So, it's not clear that the kobject_name() here has to be unique
> > > > across all devices in the group.  It might be better to use an
> > > > arbitrary index here instead of a name to avoid that problem.
> > > 
> > > Hmm, that loses useful convenience when they are unique, such as on PCI.
> > > I'll look and see if sysfs_create_link will fail on duplicate names and
> > > see about adding some kind of instance to it.
> > 
> > Ok.  Is the name necessarily unique even for PCI, if the group crosses
> > multiple domains?
> 
> Yes, it includes the domain in the :bb:dd.f form.  I've found I can
> just use sysfs_create_link_nowarn and add a .# index when we have a name
> collision.

Ok, that sounds good.

> > [snip]
> > > > > + mutex_lock(&group->mutex);
> > > > > + list_for_each_entry(device, &group->devices, list) {
> > > > > + if (device->dev == dev) {
> > > > > + list_del(&device->list);
> > > > > + kfree(device);
> > > > > + break;
> > > > > + }
> > > > > + }
> > > > > + mutex_unlock(&group->mutex);
> > > > > +
> > > > > + sysfs_remove_link(group->devices_kobj, 
> > > > > kobject_name(&dev->kobj));
> > > > > + sysfs_remove_link(&dev->kobj, "iommu_group");
> > > > > +
> > > > > + dev->iommu_group = NULL;
> > > > 
> > > > I suspect the dev -> group pointer should be cleared first, under the
> > > > group lock, but I'm not certain about that.
> > > 
> > > group->mutex is protecting the group's device list.  I think my
> > > assumption is that when a device is being removed, there should be no
> > > references to it for anyone to race with iommu_group_get(dev), but I'm
> > > not sure how valid that is.
> > 
> > What I'm concerned about here is someone grabbing the device by
> > non-group-related means, grabbing a pointer to its group and that
> > racing with remove_device().  It would then end up with a group
> > pointer it thinks is right for the device, when the group no longer
> > thinks it owns the device.
> > 
> > Doing it under the lock is so that on the other side, group aware code
> > doesn't traverse the group member list and grab a reference to a
> > device which no longer points back to the group.
> 
> Our for_each function does grab the lock, as you noticed below, so
> removing it from the list under lock prevents that path.  Where it gets
> fuzzy is if someone can call iommu_group_get(dev) to get a group
> reference in this gap.

Right, that's what I'm concerned about.

>  Whether we clear the iommu_group pointer under
> lock or not doesn't matter for that path since it doesn't retrieve it
> under lock.  The assumption there is that the caller is going to have a
> reference to the device and therefore the de

Re: [PATCH 03/13] iommu: IOMMU groups for VT-d and AMD-Vi

2012-05-16 Thread David Gibson
On Fri, May 11, 2012 at 04:55:48PM -0600, Alex Williamson wrote:
> Add back group support for AMD & Intel.  amd_iommu already tracks
> devices and has init and uninit routines to manage groups.
> intel-iommu does this on the fly, so we make use of the notifier
> support built into iommu groups to create and remove groups.

Looks ok to me.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 04/13] pci: New pci_dma_quirk()

2012-05-16 Thread David Gibson
On Fri, May 11, 2012 at 04:55:55PM -0600, Alex Williamson wrote:
> Integrating IOMMU groups more closely into the driver core allows
> us to more easily work around DMA quirks.  The Ricoh multifunction
> controller is a favorite example of devices that are currently
> incompatible with IOMMU isolation as all the functions use the
> requestor ID of function 0 for DMA.  Passing this device into
> pci_dma_quirk returns the PCI device to use for DMA.  The IOMMU
> driver can then construct an IOMMU group including both devices.
> 
> Signed-off-by: Alex Williamson 
> ---
> 
>  drivers/pci/quirks.c |   22 ++
>  include/linux/pci.h  |2 ++
>  2 files changed, 24 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 4bf7102..6f9f7f9 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3109,3 +3109,25 @@ int pci_dev_specific_reset(struct pci_dev *dev, int 
> probe)
>  
>   return -ENOTTY;
>  }
> +
> +struct pci_dev *pci_dma_quirk(struct pci_dev *dev)
> +{
> + struct pci_dev *dma_dev = dev;
> +
> + /*
> +  * https://bugzilla.redhat.com/show_bug.cgi?id=605888
> +  *
> +  * Some Ricoh devices use the function 0 source ID for DMA on
> +  * other functions of a multifunction device.  The DMA devices
> +  * is therefore function 0, which will have implications of the
> +  * iommu grouping of these devices.
> +  */
> + if (dev->vendor == PCI_VENDOR_ID_RICOH &&
> + (dev->device == 0xe822 || dev->device == 0xe230 ||
> +  dev->device == 0xe832 || dev->device == 0xe476)) {
> + dma_dev = pci_get_slot(dev->bus,
> +PCI_DEVFN(PCI_SLOT(dev->devfn), 0));
> + }

Hrm.  This seems like a very generic name for a function performing a
very specific test.  We could well have devices with the same problem
in future, so shouldn't this be set up so the same quirk can be easily
added to new device ids without changing the function code itself.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] dma-mapping: Add dma_addr_is_phys_addr()

2019-10-13 Thread David Gibson
On Fri, Oct 11, 2019 at 06:25:18PM -0700, Ram Pai wrote:
> From: Thiago Jung Bauermann 
> 
> In order to safely use the DMA API, virtio needs to know whether DMA
> addresses are in fact physical addresses and for that purpose,
> dma_addr_is_phys_addr() is introduced.
> 
> cc: Benjamin Herrenschmidt 
> cc: David Gibson 
> cc: Michael Ellerman 
> cc: Paul Mackerras 
> cc: Michael Roth 
> cc: Alexey Kardashevskiy 
> cc: Paul Burton 
> cc: Robin Murphy 
> cc: Bartlomiej Zolnierkiewicz 
> cc: Marek Szyprowski 
> cc: Christoph Hellwig 
> Suggested-by: Michael S. Tsirkin 
> Signed-off-by: Ram Pai 
> Signed-off-by: Thiago Jung Bauermann 

The change itself looks ok, so

Reviewed-by: David Gibson 

However, I would like to see the commit message (and maybe the inline
comments) expanded a bit on what the distinction here is about.  Some
of the text from the next patch would be suitable, about DMA addresses
usually being in a different address space but not in the case of
bounce buffering.

> ---
>  arch/powerpc/include/asm/dma-mapping.h | 21 +
>  arch/powerpc/platforms/pseries/Kconfig |  1 +
>  include/linux/dma-mapping.h| 20 
>  kernel/dma/Kconfig |  3 +++
>  4 files changed, 45 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index 565d6f7..f92c0a4b 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -5,6 +5,8 @@
>  #ifndef _ASM_DMA_MAPPING_H
>  #define _ASM_DMA_MAPPING_H
>  
> +#include 
> +
>  static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type 
> *bus)
>  {
>   /* We don't handle the NULL dev case for ISA for now. We could
> @@ -15,4 +17,23 @@ static inline const struct dma_map_ops 
> *get_arch_dma_ops(struct bus_type *bus)
>   return NULL;
>  }
>  
> +#ifdef CONFIG_ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR
> +/**
> + * dma_addr_is_phys_addr - check whether a device DMA address is a physical
> + *   address
> + * @dev: device to check
> + *
> + * Returns %true if any DMA address for this device happens to also be a 
> valid
> + * physical address (not necessarily of the same page).
> + */
> +static inline bool dma_addr_is_phys_addr(struct device *dev)
> +{
> + /*
> +  * Secure guests always use the SWIOTLB, therefore DMA addresses are
> +  * actually the physical address of the bounce buffer.
> +  */
> + return is_secure_guest();
> +}
> +#endif
> +
>  #endif   /* _ASM_DMA_MAPPING_H */
> diff --git a/arch/powerpc/platforms/pseries/Kconfig 
> b/arch/powerpc/platforms/pseries/Kconfig
> index 9e35cdd..0108150 100644
> --- a/arch/powerpc/platforms/pseries/Kconfig
> +++ b/arch/powerpc/platforms/pseries/Kconfig
> @@ -152,6 +152,7 @@ config PPC_SVM
>   select SWIOTLB
>   select ARCH_HAS_MEM_ENCRYPT
>   select ARCH_HAS_FORCE_DMA_UNENCRYPTED
> + select ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR
>   help
>There are certain POWER platforms which support secure guests using
>the Protected Execution Facility, with the help of an Ultravisor
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index f7d1eea..6df5664 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -693,6 +693,26 @@ static inline bool dma_addressing_limited(struct device 
> *dev)
>   dma_get_required_mask(dev);
>  }
>  
> +#ifndef CONFIG_ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR
> +/**
> + * dma_addr_is_phys_addr - check whether a device DMA address is a physical
> + *   address
> + * @dev: device to check
> + *
> + * Returns %true if any DMA address for this device happens to also be a 
> valid
> + * physical address (not necessarily of the same page).
> + */
> +static inline bool dma_addr_is_phys_addr(struct device *dev)
> +{
> + /*
> +  * Except in very specific setups, DMA addresses exist in a different
> +  * address space from CPU physical addresses and cannot be directly used
> +  * to reference system memory.
> +  */
> + return false;
> +}
> +#endif
> +
>  #ifdef CONFIG_ARCH_HAS_SETUP_DMA_OPS
>  void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
>   const struct iommu_ops *iommu, bool coherent);
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> index 9decbba..6209b46 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -51,6 +51,9 @@ config ARCH_HAS_DMA_MMAP_PGPROT
>  config ARCH_HAS_FORCE_DMA_UNENCRYPTED
>   bool
>  
> +config ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR
> + bool
> +
>  config DMA_NONCOHERENT_CACHE_SYNC
>   bool
>  

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH 2/2] virtio_ring: Use DMA API if memory is encrypted

2019-10-13 Thread David Gibson
On Fri, Oct 11, 2019 at 06:25:19PM -0700, Ram Pai wrote:
> From: Thiago Jung Bauermann 
> 
> Normally, virtio enables DMA API with VIRTIO_F_IOMMU_PLATFORM, which must
> be set by both device and guest driver. However, as a hack, when DMA API
> returns physical addresses, guest driver can use the DMA API; even though
> device does not set VIRTIO_F_IOMMU_PLATFORM and just uses physical
> addresses.
> 
> Doing this works-around POWER secure guests for which only the bounce
> buffer is accessible to the device, but which don't set
> VIRTIO_F_IOMMU_PLATFORM due to a set of hypervisor and architectural bugs.
> To guard against platform changes, breaking any of these assumptions down
> the road, we check at probe time and fail if that's not the case.
> 
> cc: Benjamin Herrenschmidt 
> cc: David Gibson 
> cc: Michael Ellerman 
> cc: Paul Mackerras 
> cc: Michael Roth 
> cc: Alexey Kardashevskiy 
> cc: Jason Wang 
> cc: Christoph Hellwig 
> Suggested-by: Michael S. Tsirkin 
> Signed-off-by: Ram Pai 
> Signed-off-by: Thiago Jung Bauermann 

Reviewed-by: David Gibson 

I don't know that this is the most elegant solution possible.  But
it's simple, gets the job done and pretty unlikely to cause mysterious
breakage down the road.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH 2/2] virtio_ring: Use DMA API if memory is encrypted

2019-10-21 Thread David Gibson
On Tue, Oct 15, 2019 at 09:35:01AM +0200, Christoph Hellwig wrote:
> On Fri, Oct 11, 2019 at 06:25:19PM -0700, Ram Pai wrote:
> > From: Thiago Jung Bauermann 
> > 
> > Normally, virtio enables DMA API with VIRTIO_F_IOMMU_PLATFORM, which must
> > be set by both device and guest driver. However, as a hack, when DMA API
> > returns physical addresses, guest driver can use the DMA API; even though
> > device does not set VIRTIO_F_IOMMU_PLATFORM and just uses physical
> > addresses.
> 
> Sorry, but this is a complete bullshit hack.  Driver must always use
> the DMA API if they do DMA, and if virtio devices use physical addresses
> that needs to be returned through the platform firmware interfaces for
> the dma setup.  If you don't do that yet (which based on previous
> informations you don't), you need to fix it, and we can then quirk
> old implementations that already are out in the field.
> 
> In other words: we finally need to fix that virtio mess and not pile
> hacks on top of hacks.

Christoph, if I understand correctly, your objection isn't so much to
the proposed change here of itself, except insofar as it entrenches
virtio's existing code allowing it to either use the DMA api or bypass
it and use physical addresses directly.  Is that right, or have I
missed something?

Where do you envisage the decision to bypass the IOMMU being made?
The virtio spec more or less states that virtio devices use hypervisor
magic to access physical addresses directly, rather than using normal
DMA channels.  The F_IOMMU_PLATFORM flag then overrides that, since it
obviously won't work for hardware devices.

The platform code isn't really in a position to know that virtio
devices are (usually) magic.  So were you envisaging the virtio driver
explicitly telling the platform to use bypassing DMA operations?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [RFC PATCH v5 5/5] vfio-pci: Allow to expose MSI-X table to userspace when safe

2017-08-09 Thread David Gibson
On Mon, Aug 07, 2017 at 05:25:48PM +1000, Alexey Kardashevskiy wrote:
1;4803;0c> Some devices have a MSIX BAR not aligned to the system page size
> greater than 4K (like 64k for ppc64) which at the moment prevents
> such MMIO pages from being mapped to the userspace for the sake of
> the MSIX BAR content protection. If such page happens to share
> the same system page with some frequently accessed registers,
> the entire system page will be emulated which can seriously affect
> performance.
> 
> This allows mapping of MSI-X tables to userspace if hardware provides
> MSIX isolation via interrupt remapping or filtering; in other words
> allowing direct access to the MSIX BAR won't do any harm to other devices
> or cause spurious interrupts visible to the kernel.
> 
> This adds a wrapping helper to check if a capability is supported by
> an IOMMU group.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

> ---
>  include/linux/vfio.h |  1 +
>  drivers/vfio/pci/vfio_pci.c  | 20 +---
>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 -
>  drivers/vfio/vfio.c  | 15 +++
>  4 files changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 586809abb273..7110bca2fb60 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -46,6 +46,7 @@ struct vfio_device_ops {
>  
>  extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
>  extern void vfio_iommu_group_put(struct iommu_group *group, struct device 
> *dev);
> +extern bool vfio_iommu_group_is_capable(struct device *dev, unsigned long 
> cap);

This diff probably belongs in the earlier patch adding the function,
rather than here where it's first used.  Not worth respinning just for
that, though.

>  extern int vfio_add_group_dev(struct device *dev,
> const struct vfio_device_ops *ops,
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index d87a0a3cda14..c4c39ed64b1e 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -561,11 +561,17 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device 
> *vdev,
>   struct vfio_region_info_cap_sparse_mmap *sparse;
>   size_t end, size;
>   int nr_areas = 2, i = 0, ret;
> + bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
> + IOMMU_GROUP_CAP_ISOLATE_MSIX);
>  
>   end = pci_resource_len(vdev->pdev, vdev->msix_bar);
>  
> - /* If MSI-X table is aligned to the start or end, only one area */
> - if (((vdev->msix_offset & PAGE_MASK) == 0) ||
> + /*
> +  * If MSI-X table is allowed to mmap because of the capability
> +  * of IRQ remapping or aligned to the start or end, only one area
> +  */
> + if (is_msix_isolated ||
> + ((vdev->msix_offset & PAGE_MASK) == 0) ||
>   (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
>   nr_areas = 1;
>  
> @@ -577,6 +583,12 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device 
> *vdev,
>  
>   sparse->nr_areas = nr_areas;
>  
> + if (is_msix_isolated) {
> + sparse->areas[i].offset = 0;
> + sparse->areas[i].size = end;
> + return 0;
> + }
> +
>   if (vdev->msix_offset & PAGE_MASK) {
>   sparse->areas[i].offset = 0;
>   sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
> @@ -1094,6 +1106,8 @@ static int vfio_pci_mmap(void *device_data, struct 
> vm_area_struct *vma)
>   unsigned int index;
>   u64 phys_len, req_len, pgoff, req_start;
>   int ret;
> + bool is_msix_isolated = vfio_iommu_group_is_capable(&vdev->pdev->dev,
> + IOMMU_GROUP_CAP_ISOLATE_MSIX);
>  
>   index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
>  
> @@ -1115,7 +1129,7 @@ static int vfio_pci_mmap(void *device_data, struct 
> vm_area_struct *vma)
>   if (req_start + req_len > phys_len)
>   return -EINVAL;
>  
> - if (index == vdev->msix_bar) {
> + if (index == vdev->msix_bar && !is_msix_isolated) {
>   /*
>* Disallow mmaps overlapping the MSI-X table; users don't
>* get to touch this directly.  We could find somewhere
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c 
> b/drivers/vfio/pci/vfio_pci_rdwr.c
> index 357243d76f10..7514206a5ea7 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -18,6 +18,7 @@
>  #include 
>

Re: [RFC PATCH v5 1/5] iommu: Add capabilities to a group

2017-08-09 Thread David Gibson
On Mon, Aug 07, 2017 at 05:25:44PM +1000, Alexey Kardashevskiy wrote:
> This introduces capabilities to IOMMU groups. The first defined
> capability is IOMMU_GROUP_CAP_ISOLATE_MSIX which tells the IOMMU
> group users that a particular IOMMU group is capable of MSIX message
> filtering; this is useful when deciding whether or not to allow mapping
> of MSIX table to the userspace. Various architectures will enable it
> when they decide that it is safe to do so.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

This seems like a reasonable concept that's probably useful for
something, whether or not it's the best approach for the problem at
hand.

> ---
>  include/linux/iommu.h | 20 
>  drivers/iommu/iommu.c | 28 
>  2 files changed, 48 insertions(+)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 2cb54adc4a33..6b6f3c2f4904 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -155,6 +155,9 @@ struct iommu_resv_region {
>   enum iommu_resv_typetype;
>  };
>  
> +/* IOMMU group capabilities */
> +#define IOMMU_GROUP_CAP_ISOLATE_MSIX (1U)
> +
>  #ifdef CONFIG_IOMMU_API
>  
>  /**
> @@ -312,6 +315,11 @@ extern void *iommu_group_get_iommudata(struct 
> iommu_group *group);
>  extern void iommu_group_set_iommudata(struct iommu_group *group,
> void *iommu_data,
> void (*release)(void *iommu_data));
> +extern void iommu_group_set_caps(struct iommu_group *group,
> +  unsigned long clearcaps,
> +  unsigned long setcaps);
> +extern bool iommu_group_is_capable(struct iommu_group *group,
> +unsigned long cap);
>  extern int iommu_group_set_name(struct iommu_group *group, const char *name);
>  extern int iommu_group_add_device(struct iommu_group *group,
> struct device *dev);
> @@ -513,6 +521,18 @@ static inline void iommu_group_set_iommudata(struct 
> iommu_group *group,
>  {
>  }
>  
> +static inline void iommu_group_set_caps(struct iommu_group *group,
> + unsigned long clearcaps,
> + unsigned long setcaps)
> +{
> +}
> +
> +static inline bool iommu_group_is_capable(struct iommu_group *group,
> +   unsigned long cap)
> +{
> + return false;
> +}
> +
>  static inline int iommu_group_set_name(struct iommu_group *group,
>  const char *name)
>  {
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3f6ea160afed..6b2c34fe2c3d 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -52,6 +52,7 @@ struct iommu_group {
>   void (*iommu_data_release)(void *iommu_data);
>   char *name;
>   int id;
> + unsigned long caps;
>   struct iommu_domain *default_domain;
>   struct iommu_domain *domain;
>  };
> @@ -447,6 +448,33 @@ void iommu_group_set_iommudata(struct iommu_group 
> *group, void *iommu_data,
>  EXPORT_SYMBOL_GPL(iommu_group_set_iommudata);
>  
>  /**
> + * iommu_group_set_caps - Change the group capabilities
> + * @group: the group
> + * @clearcaps: capabilities mask to remove
> + * @setcaps: capabilities mask to add
> + *
> + * IOMMU groups can be capable of various features which device drivers
> + * may read and adjust the behavior.
> + */
> +void iommu_group_set_caps(struct iommu_group *group,
> + unsigned long clearcaps, unsigned long setcaps)
> +{
> + group->caps &= ~clearcaps;
> + group->caps |= setcaps;
> +}
> +EXPORT_SYMBOL_GPL(iommu_group_set_caps);
> +
> +/**
> + * iommu_group_is_capable - Returns if a group capability is present
> + * @group: the group
> + */
> +bool iommu_group_is_capable(struct iommu_group *group, unsigned long cap)
> +{
> + return !!(group->caps & cap);
> +}
> +EXPORT_SYMBOL_GPL(iommu_group_is_capable);
> +
> +/**
>   * iommu_group_set_name - set name for a group
>   * @group: the group
>   * @name: name

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-08-12 Thread David Gibson
On Sun, Aug 11, 2019 at 07:56:07AM +0200, Christoph Hellwig wrote:
> sev_active() is gone now in linux-next, at least as a global API.
> 
> And once again this is entirely going in the wrong direction.  The only
> way using the DMA API is going to work at all is if the device is ready
> for it.  So we need a flag on the virtio device, exposed by the
> hypervisor (or hardware for hw virtio devices) that says:  hey, I'm real,
> don't take a shortcut.

There still seems to be a failure to understand each other here.  The
limitation here simply *is not* a property of the device.  In fact,
it's effectively a property of the memory the virtio device would be
trying to access (because it's in secure mode it can't be directly
accessed via the hypervisor).  There absolutely are cases where this
is a device property (a physical virtio device being the obvious one),
but this isn't one of them.

Unfortunately, we're kind of stymied by the feature negotiation model
of virtio.  AIUI the hypervisor / device presents a bunch of feature
bits of which the guest / driver selects a subset.

AFAICT we already kind of abuse this for the VIRTIO_F_IOMMU_PLATFORM,
because to handle for cases where it *is* a device limitation, we
assume that if the hypervisor presents VIRTIO_F_IOMMU_PLATFORM then
the guest *must* select it.

What we actually need here is for the hypervisor to present
VIRTIO_F_IOMMU_PLATFORM as available, but not required.  Then we need
a way for the platform core code to communicate to the virtio driver
that *it* requires the IOMMU to be used, so that the driver can select
or not the feature bit on that basis.

> And that means on power and s390 qemu will always have to set thos if
> you want to be ready for the ultravisor and co games.  It's not like we
> haven't been through this a few times before, have we?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-08-13 Thread David Gibson
On Tue, Aug 13, 2019 at 03:26:17PM +0200, Christoph Hellwig wrote:
> On Mon, Aug 12, 2019 at 07:51:56PM +1000, David Gibson wrote:
> > AFAICT we already kind of abuse this for the VIRTIO_F_IOMMU_PLATFORM,
> > because to handle for cases where it *is* a device limitation, we
> > assume that if the hypervisor presents VIRTIO_F_IOMMU_PLATFORM then
> > the guest *must* select it.
> > 
> > What we actually need here is for the hypervisor to present
> > VIRTIO_F_IOMMU_PLATFORM as available, but not required.  Then we need
> > a way for the platform core code to communicate to the virtio driver
> > that *it* requires the IOMMU to be used, so that the driver can select
> > or not the feature bit on that basis.
> 
> I agree with the above, but that just brings us back to the original
> issue - the whole bypass of the DMA OPS should be an option that the
> device can offer, not the other way around.  And we really need to
> fix that root cause instead of doctoring around it.

I'm not exactly sure what you mean by "device" in this context.  Do
you mean the hypervisor (qemu) side implementation?

You're right that this was the wrong way around to begin with, but as
well as being hard to change now, I don't see how it really addresses
the current problem.  The device could default to IOMMU and allow
bypass, but the driver would still need to get information from the
platform to know that it *can't* accept that option in the case of a
secure VM.  Reversed sense, but the same basic problem.

The hypervisor does not, and can not be aware of the secure VM
restrictions - only the guest side platform code knows that.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-30 Thread David Gibson
mber of bytes to copy and map
> + * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
> + *set then this must be provided as input.
> + * @src_iova: IOVA to start the copy
> + *
> + * Copy an already existing mapping from src_ioas_id and establish it in
> + * dst_ioas_id. The src iova/length must exactly match a range used with
> + * IOMMU_IOAS_MAP.
> + */
> +struct iommu_ioas_copy {
> + __u32 size;
> + __u32 flags;
> + __u32 dst_ioas_id;
> + __u32 src_ioas_id;
> + __aligned_u64 length;
> + __aligned_u64 dst_iova;
> + __aligned_u64 src_iova;
> +};
> +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)

Since it can only copy a single mapping, what's the benefit of this
over just repeating an IOAS_MAP in the new IOAS?

> +/**
> + * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @ioas_id: IOAS ID to change the mapping of
> + * @iova: IOVA to start the unmapping at
> + * @length: Number of bytes to unmap
> + *
> + * Unmap an IOVA range. The iova/length must exactly match a range
> + * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
> + * In the latter case all IOVAs will be unmaped.
> + */
> +struct iommu_ioas_unmap {
> + __u32 size;
> + __u32 ioas_id;
> + __aligned_u64 iova;
> + __aligned_u64 length;
> +};
> +#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
>  #endif

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-04-27 Thread David Gibson
On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> 
> > > +/**
> > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > + * @ioas_id: IOAS ID to read ranges from
> > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > + * @__reserved: Must be 0
> > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the 
> > > smaller
> > > + *   of out_num_iovas or the length implied by size.
> > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > + *
> > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these 
> > > ranges is
> > > + * not allowed. out_num_iovas will be set to the total number of iovas
> > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > + * size should include the allocated flex array.
> > > + */
> > > +struct iommu_ioas_iova_ranges {
> > > + __u32 size;
> > > + __u32 ioas_id;
> > > + __u32 out_num_iovas;
> > > + __u32 __reserved;
> > > + struct iommu_valid_iovas {
> > > + __aligned_u64 start;
> > > + __aligned_u64 last;
> > > + } out_valid_iovas[];
> > > +};
> > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, 
> > > IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > 
> > Is the information returned by this valid for the lifeime of the IOAS,
> > or can it change?  If it can change, what events can change it?
> >
> > If it *can't* change, then how do we have enough information to
> > determine this at ALLOC time, since we don't necessarily know which
> > (if any) hardware IOMMU will be attached to it.
> 
> It is a good point worth documenting. It can change. Particularly
> after any device attachment.

Right.. this is vital and needs to be front and centre in the
comments/docs here.  Really, I think an interface that *doesn't* have
magically changing status would be better (which is why I was
advocating that the user set the constraints, and the kernel supplied
or failed outright).  Still I recognize that has its own problems.

> I added this:
> 
>  * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these 
> ranges
>  * is not allowed. out_num_iovas will be set to the total number of iovas and
>  * the out_valid_iovas[] will be filled in as space permits. size should 
> include
>  * the allocated flex array.
>  *
>  * The allowed ranges are dependent on the HW path the DMA operation takes, 
> and
>  * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
>  * full range, and each attached device will narrow the ranges based on that
>  * devices HW restrictions.

I think you need to be even more explicit about this: which exact
operations on the fd can invalidate exactly which items in the
information from this call?  Can it only ever be narrowed, or can it
be broadened with any operations?

> > > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> > 
> > Since it can only copy a single mapping, what's the benefit of this
> > over just repeating an IOAS_MAP in the new IOAS?
> 
> It causes the underlying pin accounting to be shared and can avoid
> calling GUP entirely.

If that's the only purpose, then that needs to be right here in the
comments too.  So is expected best practice to IOAS_MAP everything you
might want to map into a sort of "scratch" IOAS, then IOAS_COPY the
mappings you actually end up wanting into the "real" IOASes for use?

Seems like it would be nicer for the interface to just figure it out
for you: I can see there being sufficient complications with that to
have this slightly awkward interface, but I think it needs a rationale
to accompany it.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-04-28 Thread David Gibson
this because the interface was completely broken for most of its
lifetime.  EEH is a fancy error handling feature of IBM PCI hardware
somewhat similar in concept, though not interface, to PCIe AER.  I have
a very strong impression that while this was a much-touted checkbox
feature for RAS, no-one, ever. actually used it.  As evidenced by the
fact that there was, I believe over a *decade* in which all the
interfaces were completely broken by design, and apparently no-one
noticed.

So, cynically, you could probably get away with making this a no-op as
well.  If you wanted to do it properly... well... that would require
training up yet another person to actually understand this and hoping
they get it done before they run screaming.  This one gets very ugly
because the EEH operations have to operate on the hardware (or
firmware) "Partitionable Endpoints" (PEs) which correspond one to one
with IOMMU groups, but not necessarily with VFIO containers, but
there's not really any sensible way to expose that to users.

You might be able to do this by simply failing this outright if
there's anything other than exactly one IOMMU group bound to the
container / IOAS (which I think might be what VFIO itself does now).
Handling that with a device centric API gets somewhat fiddlier, of
course.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-04-28 Thread David Gibson
On Fri, Apr 29, 2022 at 01:21:30AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, April 28, 2022 11:11 PM
> > 
> > 
> > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> > 2 IOVA
> > > windows, which aren't contiguous with each other.  The base addresses
> > > of each of these are fixed, but the size of each window, the pagesize
> > > (i.e. granularity) of each window and the number of levels in the
> > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > changes in how these windows are handled.  Typical Linux guests will
> > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > can't count on that; the guest can use them however it wants.
> > 
> > As part of nesting iommufd will have a 'create iommu_domain using
> > iommu driver specific data' primitive.
> > 
> > The driver specific data for PPC can include a description of these
> > windows so the PPC specific qemu driver can issue this new ioctl
> > using the information provided by the guest.
> > 
> > The main issue is that internally to the iommu subsystem the
> > iommu_domain aperture is assumed to be a single window. This kAPI will
> > have to be improved to model the PPC multi-window iommu_domain.
> > 
> 
> From the point of nesting probably each window can be a separate
> domain then the existing aperture should still work?

Maybe.  There might be several different ways to represent it, but the
vital piece is that any individual device (well, group, technically)
must atomically join/leave both windows at once.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-04-28 Thread David Gibson
On Thu, Apr 28, 2022 at 11:22:58AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 03:58:30PM +1000, David Gibson wrote:
> > On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> > > 
> > > > > +/**
> > > > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > > > + * @ioas_id: IOAS ID to read ranges from
> > > > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > > > + * @__reserved: Must be 0
> > > > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is 
> > > > > the smaller
> > > > > + *   of out_num_iovas or the length implied by size.
> > > > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > > > + *
> > > > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside 
> > > > > these ranges is
> > > > > + * not allowed. out_num_iovas will be set to the total number of 
> > > > > iovas
> > > > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > > > + * size should include the allocated flex array.
> > > > > + */
> > > > > +struct iommu_ioas_iova_ranges {
> > > > > + __u32 size;
> > > > > + __u32 ioas_id;
> > > > > + __u32 out_num_iovas;
> > > > > + __u32 __reserved;
> > > > > + struct iommu_valid_iovas {
> > > > > + __aligned_u64 start;
> > > > > + __aligned_u64 last;
> > > > > + } out_valid_iovas[];
> > > > > +};
> > > > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, 
> > > > > IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > > > 
> > > > Is the information returned by this valid for the lifeime of the IOAS,
> > > > or can it change?  If it can change, what events can change it?
> > > >
> > > > If it *can't* change, then how do we have enough information to
> > > > determine this at ALLOC time, since we don't necessarily know which
> > > > (if any) hardware IOMMU will be attached to it.
> > > 
> > > It is a good point worth documenting. It can change. Particularly
> > > after any device attachment.
> > 
> > Right.. this is vital and needs to be front and centre in the
> > comments/docs here.  Really, I think an interface that *doesn't* have
> > magically changing status would be better (which is why I was
> > advocating that the user set the constraints, and the kernel supplied
> > or failed outright).  Still I recognize that has its own problems.
> 
> That is a neat idea, it could be a nice option, it lets userspace
> further customize the kernel allocator.
> 
> But I don't have a use case in mind? The simplified things I know
> about want to attach their devices then allocate valid IOVA, they
> don't really have a notion about what IOVA regions they are willing to
> accept, or necessarily do hotplug.

The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
emulation code knows the IOVA windows that are expected of the vIOMMU
(because that's a property of the emulated platform), and requests
them of the host IOMMU.  If the host can supply that, you're good
(this doesn't necessarily mean the host windows match exactly, just
that the requested windows fit within the host windows).  If not,
you report an error.  This can be done at any point when the host
windows might change - so try to attach a device that can't support
the requested windows, and it will fail.  Attaching a device which
shrinks the windows, but still fits the requested windows within, and
you're still good to go.

For a typical direct userspace case you don't want that.  However, it
probably *does* make sense for userspace to specify how large a window
it wants.  So some form that allows you to specify size without base
address also makes sense.  In that case the kernel would set a base
address according to the host IOMMU's capabilities, or fail if it
can't supply any window of the requested size.  When to allocate that
base address is a bit unclear though.  If you do it at window request
time, then you might pick something that a later device can't work
with.  If you do it later, it's less clear how to sensibly report it
t

Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-04-29 Thread David Gibson
On Thu, Apr 28, 2022 at 12:10:37PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 12:53:16AM +1000, David Gibson wrote:
> 
> > 2) Costly GUPs.  pseries (the most common ppc machine type) always
> > expects a (v)IOMMU.  That means that unlike the common x86 model of a
> > host with IOMMU, but guests with no-vIOMMU, guest initiated
> > maps/unmaps can be a hot path.  Accounting in that path can be
> > prohibitive (and on POWER8 in particular it prevented us from
> > optimizing that path the way we wanted).  We had two solutions for
> > that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
> > based on the IOVA window sizes.  That was improved in the v2 which
> > used the concept of preregistration.  IIUC iommufd can achieve the
> > same effect as preregistration using IOAS_COPY, so this one isn't
> > really a problem either.
> 
> I think PPC and S390 are solving the same problem here. I think S390
> is going to go to a SW nested model where it has an iommu_domain
> controlled by iommufd that is populated with the pinned pages, eg
> stored in an xarray.
> 
> Then the performance map/unmap path is simply copying pages from the
> xarray to the real IOPTEs - and this would be modeled as a nested
> iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.
> 
> Perhaps this is agreeable for PPC too?

Uh.. maybe?  Note that I'm making these comments based on working on
this some years ago (the initial VFIO for ppc implementation in
particular).  I'm no longer actively involved in ppc kernel work.

> > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> > windows, which aren't contiguous with each other.  The base addresses
> > of each of these are fixed, but the size of each window, the pagesize
> > (i.e. granularity) of each window and the number of levels in the
> > IOMMU pagetable are runtime configurable.  Because it's true in the
> > hardware, it's also true of the vIOMMU interface defined by the IBM
> > hypervisor (and adpoted by KVM as well).  So, guests can request
> > changes in how these windows are handled.  Typical Linux guests will
> > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > can't count on that; the guest can use them however it wants.
> 
> As part of nesting iommufd will have a 'create iommu_domain using
> iommu driver specific data' primitive.
> 
> The driver specific data for PPC can include a description of these
> windows so the PPC specific qemu driver can issue this new ioctl
> using the information provided by the guest.

Hmm.. not sure if that works.  At the moment, qemu (for example) needs
to set up the domains/containers/IOASes as it constructs the machine,
because that's based on the virtual hardware topology.  Initially they
use the default windows (0..2GiB first window, second window
disabled).  Only once the guest kernel is up and running does it issue
the hypercalls to set the final windows as it prefers.  In theory the
guest could change them during runtime though it's unlikely in
practice.  They could change during machine lifetime in practice,
though, if you rebooted from one guest kernel to another that uses a
different configuration.

*Maybe* IOAS construction can be deferred somehow, though I'm not sure
because the assigned devices need to live somewhere.

> The main issue is that internally to the iommu subsystem the
> iommu_domain aperture is assumed to be a single window. This kAPI will
> have to be improved to model the PPC multi-window iommu_domain.

Right.

> If this API is not used then the PPC driver should choose some
> sensible default windows that makes things like DPDK happy.
> 
> > Then, there's handling existing qemu (or other software) that is using
> > the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
> > should be a goal or not: as others have noted, working actively to
> > port qemu to the new interface at the same time as making a
> > comprehensive in-kernel compat layer is arguably redundant work.
> 
> At the moment I think I would stick with not including the SPAPR
> interfaces in vfio_compat, but there does seem to be a path if someone
> with HW wants to build and test them?
> 
> > You might be able to do this by simply failing this outright if
> > there's anything other than exactly one IOMMU group bound to the
> > container / IOAS (which I think might be what VFIO itself does now).
> > Handling that with a device centric API gets somewhat fiddlier, of
> > course.
> 
> Maybe every device gets a co

Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-04-30 Thread David Gibson
On Fri, Apr 29, 2022 at 09:54:42AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 04:00:14PM +1000, David Gibson wrote:
> > > But I don't have a use case in mind? The simplified things I know
> > > about want to attach their devices then allocate valid IOVA, they
> > > don't really have a notion about what IOVA regions they are willing to
> > > accept, or necessarily do hotplug.
> > 
> > The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
> > emulation code knows the IOVA windows that are expected of the vIOMMU
> > (because that's a property of the emulated platform), and requests
> > them of the host IOMMU.  If the host can supply that, you're good
> > (this doesn't necessarily mean the host windows match exactly, just
> > that the requested windows fit within the host windows).  If not,
> > you report an error.  This can be done at any point when the host
> > windows might change - so try to attach a device that can't support
> > the requested windows, and it will fail.  Attaching a device which
> > shrinks the windows, but still fits the requested windows within, and
> > you're still good to go.
> 
> We were just talking about this in another area - Alex said that qemu
> doesn't know the IOVA ranges? Is there some vIOMMU cases where it does?

Uh.. what?  We certainly know (or, rather, choose) the IOVA ranges for
ppc.  That is to say we set up the default IOVA ranges at machine
construction (those defaults have changed with machine version a
couple of times).  If the guest uses dynamic DMA windows we then
update those ranges based on the hypercalls, but at any point we know
what the IOVA windows are supposed to be.  I don't really see how x86
or anything else could not know the IOVA ranges.  Who else *could* set
the ranges when implementing a vIOMMU in TCG mode?

For the non-vIOMMU case then IOVA==GPA, so everything qemu knows about
the GPA space it also knows about the IOVA space.  Which, come to
think of it, means memory hotplug also complicates things.

> Even if yes, qemu is able to manage this on its own - it doesn't use
> the kernel IOVA allocator, so there is not a strong reason to tell the
> kernel what the narrowed ranges are.

I don't follow.  The problem for the qemu case here is if you hotplug
a device which narrows down the range to something smaller than the
guest expects.  If qemu has told the kernel the ranges it needs, that
can just fail (which is the best you can do).  If the kernel adds the
device but narrows the ranges, then you may have just put the guest
into a situation where the vIOMMU cannot do what the guest expects it
to.  If qemu can only query the windows, not specify them then it
won't know that adding a particular device will conflict with its
guest side requirements until after it's already added.  That could
mess up concurrent guest initiated map operations for existing devices
in the same guest side domain, so I don't think reversing the hotplug
after the problem is detected is enough.

> > > That is one possibility, yes. qemu seems to be using this to establish
> > > a clone ioas of an existing operational one which is another usage
> > > model.
> > 
> > Right, for qemu (or other hypervisors) the obvious choice would be to
> > create a "staging" IOAS where IOVA == GPA, then COPY that into the various
> > emulated bus IOASes.  For a userspace driver situation, I'm guessing
> > you'd map your relevant memory pool into an IOAS, then COPY to the
> > IOAS you need for whatever specific devices you're using.
> 
> qemu seems simpler, it juggled multiple containers so it literally
> just copies when it instantiates a new container and does a map in
> multi-container.

I don't follow you.  Are you talking about the vIOMMU or non vIOMMU
case?  In the vIOMMU case the different containers can be for
different guest side iommu domains with different guest-IOVA spaces,
so you can't just copy from one to another.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-05-02 Thread David Gibson
On Fri, Apr 29, 2022 at 09:50:30AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 04:22:56PM +1000, David Gibson wrote:
> > On Fri, Apr 29, 2022 at 01:21:30AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Thursday, April 28, 2022 11:11 PM
> > > > 
> > > > 
> > > > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> > > > 2 IOVA
> > > > > windows, which aren't contiguous with each other.  The base addresses
> > > > > of each of these are fixed, but the size of each window, the pagesize
> > > > > (i.e. granularity) of each window and the number of levels in the
> > > > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > > > changes in how these windows are handled.  Typical Linux guests will
> > > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > > > can't count on that; the guest can use them however it wants.
> > > > 
> > > > As part of nesting iommufd will have a 'create iommu_domain using
> > > > iommu driver specific data' primitive.
> > > > 
> > > > The driver specific data for PPC can include a description of these
> > > > windows so the PPC specific qemu driver can issue this new ioctl
> > > > using the information provided by the guest.
> > > > 
> > > > The main issue is that internally to the iommu subsystem the
> > > > iommu_domain aperture is assumed to be a single window. This kAPI will
> > > > have to be improved to model the PPC multi-window iommu_domain.
> > > > 
> > > 
> > > From the point of nesting probably each window can be a separate
> > > domain then the existing aperture should still work?
> > 
> > Maybe.  There might be several different ways to represent it, but the
> > vital piece is that any individual device (well, group, technically)
> > must atomically join/leave both windows at once.
> 
> I'm not keen on the multi-iommu_domains because it means we have to
> create the idea that a device can be attached to multiple
> iommu_domains, which we don't have at all today.
> 
> Since iommu_domain allows PPC to implement its special rules, like the
> atomicness above.

I tend to agree; I think extending the iommu domain concept to
incorporate multiple windows makes more sense than extending to allow
multiple domains per device.  I'm just saying there might be other
ways of representing this, and that's not a sticking point for me as
long as the right properties can be preserved.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-05-02 Thread David Gibson
On Fri, Apr 29, 2022 at 09:48:38AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 04:20:36PM +1000, David Gibson wrote:
> 
> > > I think PPC and S390 are solving the same problem here. I think S390
> > > is going to go to a SW nested model where it has an iommu_domain
> > > controlled by iommufd that is populated with the pinned pages, eg
> > > stored in an xarray.
> > > 
> > > Then the performance map/unmap path is simply copying pages from the
> > > xarray to the real IOPTEs - and this would be modeled as a nested
> > > iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.
> > > 
> > > Perhaps this is agreeable for PPC too?
> > 
> > Uh.. maybe?  Note that I'm making these comments based on working on
> > this some years ago (the initial VFIO for ppc implementation in
> > particular).  I'm no longer actively involved in ppc kernel work.
> 
> OK
>  
> > > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 
> > > > IOVA
> > > > windows, which aren't contiguous with each other.  The base addresses
> > > > of each of these are fixed, but the size of each window, the pagesize
> > > > (i.e. granularity) of each window and the number of levels in the
> > > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > > changes in how these windows are handled.  Typical Linux guests will
> > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > > can't count on that; the guest can use them however it wants.
> > > 
> > > As part of nesting iommufd will have a 'create iommu_domain using
> > > iommu driver specific data' primitive.
> > > 
> > > The driver specific data for PPC can include a description of these
> > > windows so the PPC specific qemu driver can issue this new ioctl
> > > using the information provided by the guest.
> > 
> > Hmm.. not sure if that works.  At the moment, qemu (for example) needs
> > to set up the domains/containers/IOASes as it constructs the machine,
> > because that's based on the virtual hardware topology.  Initially they
> > use the default windows (0..2GiB first window, second window
> > disabled).  Only once the guest kernel is up and running does it issue
> > the hypercalls to set the final windows as it prefers.  In theory the
> > guest could change them during runtime though it's unlikely in
> > practice.  They could change during machine lifetime in practice,
> > though, if you rebooted from one guest kernel to another that uses a
> > different configuration.
> > 
> > *Maybe* IOAS construction can be deferred somehow, though I'm not sure
> > because the assigned devices need to live somewhere.
> 
> This is a general requirement for all the nesting implementations, we
> start out with some default nested page table and then later the VM
> does the vIOMMU call to change it. So nesting will have to come along
> with some kind of 'switch domains IOCTL'
> 
> In this case I would guess PPC could do the same and start out with a
> small (nested) iommu_domain and then create the VM's desired
> iommu_domain from the hypercall, and switch to it.
> 
> It is a bit more CPU work since maps in the lower range would have to
> be copied over, but conceptually the model matches the HW nesting.

Ah.. ok.  IIUC what you're saying is that the kernel-side IOASes have
fixed windows, but we fake dynamic windows in the userspace
implementation by flipping the devices over to a new IOAS with the new
windows.  Is that right?

Where exactly would the windows be specified?  My understanding was
that when creating a back-end specific IOAS, that would typically be
for the case where you're using a user / guest managed IO pagetable,
with the backend specifying the format for that.  In the ppc case we'd
need to specify the windows, but we'd still need the IOAS_MAP/UNMAP
operations to manage the mappings.  The PAPR vIOMMU is
paravirtualized, so all updates come via hypercalls, so there's no
user/guest managed data structure.

That should work from the point of view of the userspace and guest
side interfaces.  It might be fiddly from the point of view of the
back end.  The ppc iommu doesn't really have the notion of
configurable domains

Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-05-05 Thread David Gibson
On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote:
> On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote:
> 
> > > It is a bit more CPU work since maps in the lower range would have to
> > > be copied over, but conceptually the model matches the HW nesting.
> > 
> > Ah.. ok.  IIUC what you're saying is that the kernel-side IOASes have
> > fixed windows, but we fake dynamic windows in the userspace
> > implementation by flipping the devices over to a new IOAS with the new
> > windows.  Is that right?
> 
> Yes
> 
> > Where exactly would the windows be specified?  My understanding was
> > that when creating a back-end specific IOAS, that would typically be
> > for the case where you're using a user / guest managed IO pagetable,
> > with the backend specifying the format for that.  In the ppc case we'd
> > need to specify the windows, but we'd still need the IOAS_MAP/UNMAP
> > operations to manage the mappings.  The PAPR vIOMMU is
> > paravirtualized, so all updates come via hypercalls, so there's no
> > user/guest managed data structure.
> 
> When the iommu_domain is created I want to have a
> iommu-driver-specific struct, so PPC can customize its iommu_domain
> however it likes.

This requires that the client be aware of the host side IOMMU model.
That's true in VFIO now, and it's nasty; I was really hoping we could
*stop* doing that.

Note that I'm talking here *purely* about the non-optimized case where
all updates to the host side IO pagetables are handled by IOAS_MAP /
IOAS_COPY, with no direct hardware access to user or guest managed IO
pagetables.  The optimized case obviously requires end-to-end
agreement on the pagetable format amongst other domain properties.

What I'm hoping is that qemu (or whatever) can use this non-optimized
as a fallback case where it does't need to know the properties of
whatever host side IOMMU models there are.  It just requests what it
needs based on the vIOMMU properties it needs to replicate and the
host kernel either can supply it or can't.

In many cases it should be perfectly possible to emulate a PPC style
vIOMMU on an x86 host, because the x86 IOMMU has such a colossal
aperture that it will encompass wherever the ppc apertures end
up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc
host (currently somewhere between awkward and impossible) by placing
the host apertures to cover guest memory.

Admittedly those are pretty niche cases, but allowing for them gives
us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
the future, and AFAICT, ARM are much less conservative that x86 about
maintaining similar hw interfaces over time.  That's why I think
considering these ppc cases will give a more robust interface for
other future possibilities as well.

> > That should work from the point of view of the userspace and guest
> > side interfaces.  It might be fiddly from the point of view of the
> > back end.  The ppc iommu doesn't really have the notion of
> > configurable domains - instead the address spaces are the hardware or
> > firmware fixed PEs, so they have a fixed set of devices.  At the bare
> > metal level it's possible to sort of do domains by making the actual
> > pagetable pointers for several PEs point to a common place.
> 
> I'm not sure I understand this - a domain is just a storage container
> for an IO page table, if the HW has IOPTEs then it should be able to
> have a domain?
> 
> Making page table pointers point to a common IOPTE tree is exactly
> what iommu_domains are for - why is that "sort of" for ppc?

Ok, fair enough, it's only "sort of" in the sense that the hw specs /
docs don't present any equivalent concept.

> > However, in the future, nested KVM under PowerVM is likely to be the
> > norm.  In that situation the L1 as well as the L2 only has the
> > paravirtualized interfaces, which don't have any notion of domains,
> > only PEs.  All updates take place via hypercalls which explicitly
> > specify a PE (strictly speaking they take a "Logical IO Bus Number"
> > (LIOBN), but those generally map one to one with PEs), so it can't use
> > shared pointer tricks either.
> 
> How does the paravirtualized interfaces deal with the page table? Does
> it call a map/unmap hypercall instead of providing guest IOPTEs?

Sort of.  The main interface is H_PUT_TCE ("TCE" - Translation Control
Entry - being IBMese for an IOPTE). This takes an LIOBN (which selects
which PE and aperture), an IOVA and a TCE value - which is a guest
physical address plus some permission bits.  The

Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-05-08 Thread David Gibson
On Fri, May 06, 2022 at 10:42:21AM +, Tian, Kevin wrote:
> > From: David Gibson 
> > Sent: Friday, May 6, 2022 1:25 PM
> > 
> > >
> > > When the iommu_domain is created I want to have a
> > > iommu-driver-specific struct, so PPC can customize its iommu_domain
> > > however it likes.
> > 
> > This requires that the client be aware of the host side IOMMU model.
> > That's true in VFIO now, and it's nasty; I was really hoping we could
> > *stop* doing that.
> 
> that model is anyway inevitable when talking about user page table,

Right, but I'm explicitly not talking about the user managed page
table case.  I'm talking about the case where the IO pagetable is
still managed by the kernel and we update it via IOAS_MAP and similar
operations.

> i.e. when nesting is enabled.

I don't really follow the connection you're drawing between a user
managed table and nesting.

> > Note that I'm talking here *purely* about the non-optimized case where
> > all updates to the host side IO pagetables are handled by IOAS_MAP /
> > IOAS_COPY, with no direct hardware access to user or guest managed IO
> > pagetables.  The optimized case obviously requires end-to-end
> > agreement on the pagetable format amongst other domain properties.
> > 
> > What I'm hoping is that qemu (or whatever) can use this non-optimized
> > as a fallback case where it does't need to know the properties of
> > whatever host side IOMMU models there are.  It just requests what it
> > needs based on the vIOMMU properties it needs to replicate and the
> > host kernel either can supply it or can't.
> > 
> > In many cases it should be perfectly possible to emulate a PPC style
> > vIOMMU on an x86 host, because the x86 IOMMU has such a colossal
> > aperture that it will encompass wherever the ppc apertures end
> > up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc
> > host (currently somewhere between awkward and impossible) by placing
> > the host apertures to cover guest memory.
> > 
> > Admittedly those are pretty niche cases, but allowing for them gives
> > us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
> > ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
> > the future, and AFAICT, ARM are much less conservative that x86 about
> > maintaining similar hw interfaces over time.  That's why I think
> > considering these ppc cases will give a more robust interface for
> > other future possibilities as well.
> 
> It's not niche cases. We already have virtio-iommu which can work
> on both ARM and x86 platforms, i.e. what current iommufd provides
> is already generic enough except on PPC.
> 
> Then IMHO the key open here is:
> 
> Can PPC adapt to the current iommufd proposal if it can be
> refactored to fit the standard iommu domain/group concepts?

Right...  and I'm still trying to figure out whether it can adapt to
either part of that.  We absolutely need to allow for multiple IOVA
apertures within a domain.  If we have that I *think* we can manage
(if suboptimally), but I'm trying to figure out the corner cases to
make sure I haven't missed something.

> If not, what is the remaining gap after PPC becomes a normal
> citizen in the iommu layer and is it worth solving it in the general
> interface or via iommu-driver-specific domain (given this will
> exist anyway)?
> 
> to close that open I'm with Jason:
> 
>"Fundamentally PPC has to fit into the iommu standard framework of
>group and domains, we can talk about modifications, but drifting too
>far away is a big problem."
> 
> Directly jumping to the iommufd layer for what changes might be
> applied to all platforms sounds counter-intuitive if we haven't tried 
> to solve the gap in the iommu layer in the first place, as even
> there is argument that certain changes in iommufd layer can find
> matching concept on other platforms it still sort of looks redundant
> since those platforms already work with the current model.

I don't really follow what you're saying here.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-05-08 Thread David Gibson
On Fri, May 06, 2022 at 09:48:37AM -0300, Jason Gunthorpe wrote:
> On Fri, May 06, 2022 at 03:25:03PM +1000, David Gibson wrote:
> > On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote:
> 
> > > When the iommu_domain is created I want to have a
> > > iommu-driver-specific struct, so PPC can customize its iommu_domain
> > > however it likes.
> > 
> > This requires that the client be aware of the host side IOMMU model.
> > That's true in VFIO now, and it's nasty; I was really hoping we could
> > *stop* doing that.
> 
> iommufd has two modes, the 'generic interface' which what this patch
> series shows that does not require any device specific knowledge.

Right, and I'm speaking specifically to that generic interface.  But
I'm thinking particularly about the qemu case where we do have
specific knowledge of the *guest* vIOMMU, but we want to avoid having
specific knowledge of the host IOMMU, because they might not be the same.

It would be good to have a way of seeing if the guest vIOMMU can be
emulated on this host IOMMU without qemu having to have separate
logic for every host IOMMU.

> The default iommu_domain that the iommu driver creates will be used
> here, it is up to the iommu driver to choose something reasonable for
> use by applications like DPDK. ie PPC should probably pick its biggest
> x86-like aperture.

So, using the big aperture means a very high base IOVA
(1<<59)... which means that it won't work at all if you want to attach
any devices that aren't capable of 64-bit DMA.  Using the maximum
possible window size would mean we either potentially waste a lot of
kernel memory on pagetables, or we use unnecessarily large number of
levels to the pagetable.

Basically we don't have enough information to make a good decision
here.

More generally, the problem with the interface advertising limitations
and it being up to userspace to work out if those are ok or not is
that it's fragile.  It's pretty plausible that some future IOMMU model
will have some new kind of limitation that can't be expressed in the
query structure we invented now.  That means that to add support for
that we need some kind of gate to prevent old userspace using the new
IOMMU (e.g. only allowing the new IOMMU to be used if userspace uses
newly added queries to get the new limitations).  That's true even if
what userspace was actually doing with the IOMMU would fit just fine
into those new limitations.

But if userspace requests the capabilities it wants, and the kernel
acks or nacks that, we can support the new host IOMMU with existing
software just fine.  They won't be able to use any *new* features or
capabilities of the new hardware, of course, but they'll be able to
use what it does that overlaps with what they needed before.

ppc - or more correctly, the POWER and PAPR IOMMU models - is just
acting here as an example of an IOMMU with limitations and
capabilities that don't fit into the current query model.

> The iommu-driver-specific struct is the "advanced" interface and
> allows a user-space IOMMU driver to tightly control the HW with full
> HW specific knowledge. This is where all the weird stuff that is not
> general should go.

Right, but forcing anything more complicated than "give me some IOVA
region" to go through the advanced interface means that qemu (or any
hypervisor where the guest platform need not identically match the
host) has to have n^2 complexity to match each guest IOMMU model to
each host IOMMU model.

> > Note that I'm talking here *purely* about the non-optimized case where
> > all updates to the host side IO pagetables are handled by IOAS_MAP /
> > IOAS_COPY, with no direct hardware access to user or guest managed IO
> > pagetables.  The optimized case obviously requires end-to-end
> > agreement on the pagetable format amongst other domain properties.
> 
> Sure, this is how things are already..
> 
> > What I'm hoping is that qemu (or whatever) can use this non-optimized
> > as a fallback case where it does't need to know the properties of
> > whatever host side IOMMU models there are.  It just requests what it
> > needs based on the vIOMMU properties it needs to replicate and the
> > host kernel either can supply it or can't.
> 
> There aren't really any negotiable vIOMMU properties beyond the
> ranges, and the ranges are not *really* negotiable.

Errr.. how do you figure?  On ppc the ranges and pagesizes are
definitely negotiable.  I'm not really familiar with other models, but
anything which allows *any* variations in the pagetable structure will
effectively have at least some negotiable properties.

Even if any individual host IOMMU doesn't have negotiable properties
(wh

Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-05-10 Thread David Gibson
On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote:
> On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:
> 
> > > The default iommu_domain that the iommu driver creates will be used
> > > here, it is up to the iommu driver to choose something reasonable for
> > > use by applications like DPDK. ie PPC should probably pick its biggest
> > > x86-like aperture.
> > 
> > So, using the big aperture means a very high base IOVA
> > (1<<59)... which means that it won't work at all if you want to attach
> > any devices that aren't capable of 64-bit DMA.
> 
> I'd expect to include the 32 bit window too..

I'm not entirely sure what you mean.  Are you working on the
assumption that we've extended to allowing multiple apertures, so we'd
default to advertising both a small/low aperture and a large/high
aperture?

> > Using the maximum possible window size would mean we either
> > potentially waste a lot of kernel memory on pagetables, or we use
> > unnecessarily large number of levels to the pagetable.
> 
> All drivers have this issue to one degree or another. We seem to be
> ignoring it - in any case this is a micro optimization, not a
> functional need?

Ok, fair point.

> > More generally, the problem with the interface advertising limitations
> > and it being up to userspace to work out if those are ok or not is
> > that it's fragile.  It's pretty plausible that some future IOMMU model
> > will have some new kind of limitation that can't be expressed in the
> > query structure we invented now.
> 
> The basic API is very simple - the driver needs to provide ranges of
> IOVA and map/unmap - I don't think we have a future problem here we
> need to try and guess and solve today.

Well.. maybe.  My experience of encountering hardware doing weird-arse
stuff makes me less sanguine.

> Even PPC fits this just fine, the open question for DPDK is more
> around optimization, not functional.
> 
> > But if userspace requests the capabilities it wants, and the kernel
> > acks or nacks that, we can support the new host IOMMU with existing
> > software just fine.
> 
> No, this just makes it fragile in the other direction because now
> userspace has to know what platform specific things to ask for *or it
> doesn't work at all*. This is not a improvement for the DPDK cases.

Um.. no.  The idea is that userspace requests *what it needs*, not
anything platform specific.  In the case of DPDK that would be nothing
more than the (minimum) aperture size.  Nothing platform specific
about that.

> Kernel decides, using all the kernel knowledge it has and tells the
> application what it can do - this is the basic simplified interface.
> 
> > > The iommu-driver-specific struct is the "advanced" interface and
> > > allows a user-space IOMMU driver to tightly control the HW with full
> > > HW specific knowledge. This is where all the weird stuff that is not
> > > general should go.
> > 
> > Right, but forcing anything more complicated than "give me some IOVA
> > region" to go through the advanced interface means that qemu (or any
> > hypervisor where the guest platform need not identically match the
> > host) has to have n^2 complexity to match each guest IOMMU model to
> > each host IOMMU model.
> 
> I wouldn't say n^2, but yes, qemu needs to have a userspace driver for
> the platform IOMMU, and yes it needs this to reach optimal
> behavior. We already know this is a hard requirement for using nesting
> as acceleration, I don't see why apertures are so different.

For one thing, because we only care about optimal behaviour on the
host ~= guest KVM case.  That means it's not n^2, just (roughly) one
host driver for each matching guest driver.  I'm considering the
general X on Y case - we don't need to optimize it, but it would be
nice for it to work without considering every combination separately.

> > Errr.. how do you figure?  On ppc the ranges and pagesizes are
> > definitely negotiable.  I'm not really familiar with other models, but
> > anything which allows *any* variations in the pagetable structure will
> > effectively have at least some negotiable properties.
> 
> As above, if you ask for the wrong thing then you don't get
> anything. If DPDK asks for something that works on ARM like 0 -> 4G
> then PPC and x86 will always fail. How is this improving anything to
> require applications to carefully ask for exactly the right platform
> specific ranges?

Hm, looks like I didn't sufficiently emphasize that the base address
would be optional for userspace to supply.  So userspace 

Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-05-10 Thread David Gibson
On Tue, May 10, 2022 at 04:00:09PM -0300, Jason Gunthorpe wrote:
> On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote:
> > > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:
> > > 
> > > > > The default iommu_domain that the iommu driver creates will be used
> > > > > here, it is up to the iommu driver to choose something reasonable for
> > > > > use by applications like DPDK. ie PPC should probably pick its biggest
> > > > > x86-like aperture.
> > > > 
> > > > So, using the big aperture means a very high base IOVA
> > > > (1<<59)... which means that it won't work at all if you want to attach
> > > > any devices that aren't capable of 64-bit DMA.
> > > 
> > > I'd expect to include the 32 bit window too..
> > 
> > I'm not entirely sure what you mean.  Are you working on the
> > assumption that we've extended to allowing multiple apertures, so we'd
> > default to advertising both a small/low aperture and a large/high
> > aperture?
> 
> Yes

Ok, that works assuming we can advertise multiple windows.

> > > No, this just makes it fragile in the other direction because now
> > > userspace has to know what platform specific things to ask for *or it
> > > doesn't work at all*. This is not a improvement for the DPDK cases.
> > 
> > Um.. no.  The idea is that userspace requests *what it needs*, not
> > anything platform specific.  In the case of DPDK that would be nothing
> > more than the (minimum) aperture size.  Nothing platform specific
> > about that.
> 
> Except a 32 bit platform can only maybe do a < 4G aperture, a 64 bit
> platform can do more, but it varies how much more, etc.
> 
> There is no constant value DPDK could stuff in this request, unless it
> needs a really small amount of IOVA, like 1G or something.

Well, my assumption was that DPDK always wanted an IOVA window to
cover its hugepage buffer space.  So not "constant" exactly, but a
value it will know at start up time.  But I think we cover that more
closely below.

> > > It isn't like there is some hard coded value we can put into DPDK that
> > > will work on every platform. So kernel must pick for DPDK, IMHO. I
> > > don't see any feasible alternative.
> > 
> > Yes, hence *optionally specified* base address only.
> 
> Okay, so imagine we've already done this and DPDK is not optionally
> specifying anything :)
> 
> The structs can be extended so we can add this as an input to creation
> when a driver can implement it.
> 
> > > The ppc specific driver would be on the generic side of qemu in its
> > > viommu support framework. There is lots of host driver optimization
> > > possible here with knowledge of the underlying host iommu HW. It
> > > should not be connected to the qemu target.
> > 
> > Thinking through this...
> > 
> > So, I guess we could have basically the same logic I'm suggesting be
> > in the qemu backend iommu driver instead.  So the target side (machine
> > type, strictly speaking) would request of the host side the apertures
> > it needs, and the host side driver would see if it can do that, based
> > on both specific knowledge of that driver and the query reponses.
> 
> Yes, this is what I'm thinking
> 
> > ppc on x86 should work with that.. at least if the x86 aperture is
> > large enough to reach up to ppc's high window.  I guess we'd have the
> > option here of using either the generic host driver or the
> > x86-specific driver.  The latter would mean qemu maintaining an
> > x86-format shadow of the io pagetables; mildly tedious, but doable.
> 
> The appeal of having userspace page tables is performance, so it is
> tedious to shadow, but it should run faster.

I doubt the difference is meaningful in the context of an emulated
guest, though.

> > So... is there any way of backing out of this gracefully.  We could
> > detach the device, but in the meantime ongoing DMA maps from
> > previous devices might have failed.  
> 
> This sounds like a good use case for qemu to communicate ranges - but
> as I mentioned before Alex said qemu didn't know the ranges..

Yeah, I'm a bit baffled by that, and I don't know the context.  Note
that there are at least two different very different users of the host
IOMMU backends in: one is for emulation of guest DMA (with or without
a vIOMMU).  In that case the details of the guest platform should let
qemu know the ranges.  Ther

Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-05-12 Thread David Gibson
On Wed, May 11, 2022 at 03:15:22AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, May 11, 2022 3:00 AM
> > 
> > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > > Ok... here's a revised version of my proposal which I think addresses
> > > your concerns and simplfies things.
> > >
> > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
> > >   will probably need matching changes)
> > >
> > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> > >   is chosen by the kernel within the aperture(s).  This is closer to
> > >   how mmap() operates, and DPDK and similar shouldn't care about
> > >   having specific IOVAs, even at the individual mapping level.
> > >
> > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
> > >   for when you really do want to control the IOVA (qemu, maybe some
> > >   special userspace driver cases)
> > 
> > We already did both of these, the flag is called
> > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> > select the IOVA internally.
> > 
> > > - ATTACH will fail if the new device would shrink the aperture to
> > >   exclude any already established mappings (I assume this is already
> > >   the case)
> > 
> > Yes
> > 
> > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> > >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
> > >   MAPs won't use it, but doesn't actually put anything into the IO
> > >   pagetables.
> > > - Like a regular mapping, ATTACHes that are incompatible with an
> > >   IOMAP_RESERVEed region will fail
> > > - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
> > >   mapping
> > 
> > Yeah, this seems OK, I'm thinking a new API might make sense because
> > you don't really want mmap replacement semantics but a permanent
> > record of what IOVA must always be valid.
> > 
> > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> > IOMMUFD_CMD_IOAS_IOVA_RANGES:
> > 
> > struct iommu_ioas_require_iova {
> > __u32 size;
> > __u32 ioas_id;
> > __u32 num_iovas;
> > __u32 __reserved;
> > struct iommu_required_iovas {
> > __aligned_u64 start;
> > __aligned_u64 last;
> > } required_iovas[];
> > };
> 
> As a permanent record do we want to enforce that once the required
> range list is set all FIXED and non-FIXED allocations must be within the
> list of ranges?

No, I don't think so.  In fact the way I was envisaging this,
non-FIXED mappings will *never* go into the reserved ranges.  This is
for the benefit of any use cases that need both mappings where they
don't care about the IOVA and those which do.

Essentially, reserving a region here is saying to the kernel "I want
to manage this IOVA space; make sure nothing else touches it".  That
means both that the kernel must disallow any hw associated changes
(like ATTACH) which would impinge on the reserved region, and also any
IOVA allocations that would take parts away from that space.

Whether we want to restrict FIXED mappings to the reserved regions is
an interesting question.  I wasn't thinking that would be necessary
(just as you can use mmap() MAP_FIXED anywhere).  However.. much as
MAP_FIXED is very dangerous to use if you don't previously reserve
address space, I think IOMAP_FIXED is dangerous if you haven't
previously reserved space.  So maybe it would make sense to only allow
FIXED mappings within reserved regions.

Strictly dividing the IOVA space into kernel managed and user managed
regions does make a certain amount of sense.

> If yes we can take the end of the last range as the max size of the iova
> address space to optimize the page table layout.
> 
> otherwise we may need another dedicated hint for that optimization.

Right.  With the revised model where reserving windows is optional,
not required, I don't think we can quite re-use this for optimization
hints.  Which is a bit unfortunate.

I can't immediately see a way to tweak this which handles both more
neatly, but I like the idea if we can figure out a way.

> > > So, for DPDK the sequence would be:
> > >
> > > 1. Create IOAS
> > > 2. ATTACH devices
> > > 3. IOAS_MAP some stuff
> > > 4. Do DMA with the IOVAs that IOAS_MAP returned
> > >
> > > (Note, not even any need for QUERY in simple cases)
> > 
&

Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-05-24 Thread David Gibson
On Tue, May 24, 2022 at 10:25:53AM -0300, Jason Gunthorpe wrote:
> On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote:
> 
> > Which means the guest RAM does not need to be all mapped in that base IOAS
> > suggested down this thread as that would mean all memory is pinned and
> > powervm won't be able to swap it out (yeah, it can do such thing now!). Not
> > sure if we really want to support this or stick to a simpler design.
> 
> Huh? How can it swap? Calling GUP is not optional. Either you call GUP
> at the start and there is no swap, or you call GUP for each vIOMMU
> hypercall.
> 
> Since everyone says PPC doesn't call GUP during the hypercall - how is
> it working?

The current implementation does GUP during the pre-reserve.  I think
Alexey's talking about a new PowerVM (IBM hypervisor) feature; I don't
know how that works.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-26 Thread David Gibson
On Fri, Apr 23, 2021 at 07:28:03PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 23, 2021 at 10:38:51AM -0600, Alex Williamson wrote:
> > On Thu, 22 Apr 2021 20:39:50 -0300
> 
> > > /dev/ioasid should understand the group concept somehow, otherwise it
> > > is incomplete and maybe even security broken.
> > > 
> > > So, how do I add groups to, say, VDPA in a way that makes sense? The
> > > only answer I come to is broadly what I outlined here - make
> > > /dev/ioasid do all the group operations, and do them when we enjoin
> > > the VDPA device to the ioasid.
> > > 
> > > Once I have solved all the groups problems with the non-VFIO users,
> > > then where does that leave VFIO? Why does VFIO need a group FD if
> > > everyone else doesn't?
> > 
> > This assumes there's a solution for vDPA that doesn't just ignore the
> > problem and hope for the best.  I can't speak to a vDPA solution.
> 
> I don't think we can just ignore the question and succeed with
> /dev/ioasid.
> 
> Guess it should get answered as best it can for ioasid "in general"
> then we can decide if it makes sense for VFIO to use the group FD or
> not when working in ioasid mode.
> 
> Maybe a better idea will come up
> 
> > an implicit restriction.  You've listed a step in the description about
> > a "list of devices in the group", but nothing in the pseudo code
> > reflects that step.
> 
> I gave it below with the readdir() - it isn't in the pseudo code
> because the applications I looked through didn't use it, and wouldn't
> benefit from it. I tried to show what things were doing today.

And chance are they will break cryptically if you give them a device
in a multi-device group.  That's not something we want to encourage.

> 
> > I expect it would be a subtly missed by any userspace driver
> > developer unless they happen to work on a system where the grouping
> > is not ideal.
> 
> I'm still unclear - what are be the consequence if the application
> designer misses the group detail? 
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-26 Thread David Gibson
On Thu, Apr 22, 2021 at 08:39:50PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 22, 2021 at 04:38:08PM -0600, Alex Williamson wrote:
> 
> > Because it's fundamental to the isolation of the device?  What you're
> > proposing doesn't get around the group issue, it just makes it implicit
> > rather than explicit in the uapi.
> 
> I'm not even sure it makes it explicit or implicit, it just takes away
> the FD.
> 
> There are four group IOCTLs, I see them mapping to /dev/ioasid follows:
>  VFIO_GROUP_GET_STATUS - 
>+ VFIO_GROUP_FLAGS_CONTAINER_SET is fairly redundant
>+ VFIO_GROUP_FLAGS_VIABLE could be in a new sysfs under
>  kernel/iomm_groups, or could be an IOCTL on /dev/ioasid
>IOASID_ALL_DEVICES_VIABLE
> 
>  VFIO_GROUP_SET_CONTAINER -
>+ This happens implicitly when the device joins the IOASID
>  so it gets moved to the vfio_device FD:
>   ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)
> 
>  VFIO_GROUP_UNSET_CONTAINER -
>+ Also moved to the vfio_device FD, opposite of JOIN_IOASID_FD
> 
>  VFIO_GROUP_GET_DEVICE_FD -
>+ Replaced by opening /dev/vfio/deviceX
>  Learn the deviceX which will be the cdev sysfs shows as:
>   /sys/devices/pci:00/:00:01.0/:01:00.0/vfio/deviceX/dev
> Open /dev/vfio/deviceX
> 
> > > How do we model the VFIO group security concept to something like
> > > VDPA?
> > 
> > Is it really a "VFIO group security concept"?  We're reflecting the
> > reality of the hardware, not all devices are fully isolated.  
> 
> Well, exactly.
> 
> /dev/ioasid should understand the group concept somehow, otherwise it
> is incomplete and maybe even security broken.
> 
> So, how do I add groups to, say, VDPA in a way that makes sense? The
> only answer I come to is broadly what I outlined here - make
> /dev/ioasid do all the group operations, and do them when we enjoin
> the VDPA device to the ioasid.
> 
> Once I have solved all the groups problems with the non-VFIO users,
> then where does that leave VFIO? Why does VFIO need a group FD if
> everyone else doesn't?
> 
> > IOMMU group.  This is the reality that any userspace driver needs to
> > play in, it doesn't magically go away because we drop the group file
> > descriptor.  
> 
> I'm not saying it does, I'm saying it makes the uAPI more regular and
> easier to fit into /dev/ioasid without the group FD.
> 
> > It only makes the uapi more difficult to use correctly because
> > userspace drivers need to go outside of the uapi to have any idea
> > that this restriction exists.  
> 
> I don't think it makes any substantive difference one way or the
> other.
> 
> With the group FD: the userspace has to read sysfs, find the list of
> devices in the group, open the group fd, create device FDs for each
> device using the name from sysfs.
> 
> Starting from a BDF the general pseudo code is
>  group_path = readlink("/sys/bus/pci/devices/BDF/iommu_group")
>  group_name = basename(group_path)
>  group_fd = open("/dev/vfio/"+group_name)
>  device_fd = ioctl(VFIO_GROUP_GET_DEVICE_FD, BDF);
> 
> Without the group FD: the userspace has to read sysfs, find the list
> of devices in the group and then open the device-specific cdev (found
> via sysfs) and link them to a /dev/ioasid FD.
> 
> Starting from a BDF the general pseudo code is:
>  device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
>  device_fd = open("/dev/vfio/"+device_name)
>  ioasidfd = open("/dev/ioasid")
>  ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)

This line is the problem.

[Historical aside: Alex's early drafts for the VFIO interface looked
quite similar to this.  Ben Herrenschmidt and myself persuaded him it
was a bad idea, and groups were developed instead.  I still think it's
a bad idea, and not just for POWER]

As Alex says, if this line fails because of the group restrictions,
that's not great because it's not very obvious what's gone wrong.  But
IMO, the success path on a multi-device group is kind of worse:
you've now made made a meaningful and visible change to the setup of
devices which are not mentioned in this line *at all*.  If you've
changed the DMA address space of this device you've also changed it
for everything else in the group - there's no getting around that.

For both those reasons, I absolutely agree with Alex that retaining
the explicit group model is valuable.

Yes, it makes set up more of a pain, but it's necessary complexity to
actually understand what's going on here.


> These two routes can have identical outcomes and identical security
>

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-26 Thread David Gibson
On Thu, Apr 22, 2021 at 11:13:37AM -0600, Alex Williamson wrote:
> On Wed, 21 Apr 2021 20:03:01 -0300
> Jason Gunthorpe  wrote:
> 
> > On Wed, Apr 21, 2021 at 01:33:12PM -0600, Alex Williamson wrote:
> > 
> > > > I still expect that VFIO_GROUP_SET_CONTAINER will be used to connect
> > > > /dev/{ioasid,vfio} to the VFIO group and all the group and device
> > > > logic stays inside VFIO.  
> > > 
> > > But that group and device logic is also tied to the container, where
> > > the IOMMU backend is the interchangeable thing that provides the IOMMU
> > > manipulation for that container.  
> > 
> > I think that is an area where the discussion would need to be focused.
> > 
> > I don't feel very prepared to have it in details, as I haven't dug
> > into all the group and iommu micro-operation very much.
> > 
> > But, it does seem like the security concept that VFIO is creating with
> > the group also has to be present in the lower iommu layer too.
> > 
> > With different subsystems joining devices to the same ioasid's we
> > still have to enforce the security propery the vfio group is creating.
> > 
> > > If you're using VFIO_GROUP_SET_CONTAINER to associate a group to a
> > > /dev/ioasid, then you're really either taking that group outside of
> > > vfio or you're re-implementing group management in /dev/ioasid.   
> > 
> > This sounds right.
> > 
> > > > Everything can be switched to ioasid_container all down the line. If
> > > > it wasn't for PPC this looks fairly simple.  
> > > 
> > > At what point is it no longer vfio?  I'd venture to say that replacing
> > > the container rather than invoking a different IOMMU backend is that
> > > point.  
> > 
> > sorry, which is no longer vfio?
> 
> I'm suggesting that if we're replacing the container/group model with
> an ioasid then we're effectively creating a new thing that really only
> retains the vfio device uapi.
> 
> > > > Since getting rid of PPC looks a bit hard, we'd be stuck with
> > > > accepting a /dev/ioasid and then immediately wrappering it in a
> > > > vfio_container an shimming it through a vfio_iommu_ops. It is not
> > > > ideal at all, but in my look around I don't see a major problem if
> > > > type1 implementation is moved to live under /dev/ioasid.  
> > > 
> > > But type1 is \just\ an IOMMU backend, not "/dev/vfio".  Given that
> > > nobody flinched at removing NVLink support, maybe just deprecate SPAPR
> > > now and see if anyone objects ;)  
> > 
> > Would simplify this project, but I wonder :)
> > 
> > In any event, it does look like today we'd expect the SPAPR stuff
> > would be done through the normal iommu APIs, perhaps enhanced a bit,
> > which makes me suspect an enhanced type1 can implement SPAPR.
> 
> David Gibson has argued for some time that SPAPR could be handled via a
> converged type1 model.  We has mapped that out at one point,
> essentially a "type2", but neither of us had any bandwidth to pursue it.

Right.  The sPAPR TCE backend is kind of an unfortunate accident of
history.  We absolutely could do a common interface, but no-one's had
time to work on it.

> > I say this because the SPAPR looks quite a lot like PASID when it has
> > APIs for allocating multiple tables and other things. I would be
> > interested to hear someone from IBM talk about what it is doing and
> > how it doesn't fit into today's IOMMU API.

Hm.  I don't think it's really like PASID.  Just like Type1, the TCE
backend represents a single DMA address space which all devices in the
container will see at all times.  The difference is that there can be
multiple (well, 2) "windows" of valid IOVAs within that address space.
Each window can have a different TCE (page table) layout.  For kernel
drivers, a smallish translated window at IOVA 0 is used for 32-bit
devices, and a large direct mapped (no page table) window is created
at a high IOVA for better performance with 64-bit DMA capable devices.

With the VFIO backend we create (but don't populate) a similar
smallish 32-bit window, userspace can create its own secondary window
if it likes, though obvious for userspace use there will always be a
page table.  Userspace can choose the total size (but not address),
page size and to an extent the page table format of the created
window.  Note that the TCE page table format is *not* the same as the
POWER CPU core's page table format.  Userspace can also remove the
default sma

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-26 Thread David Gibson
o get the device_fd.
> > 
> 
> So your proposal sort of moves the entire container/group/domain 
> managment into /dev/ioasid and then leaves vfio only provide device
> specific uAPI. An ioasid represents a page table (address space), thus 
> is equivalent to the scope of VFIO container.

Right.  I don't really know how /dev/iosasid is supposed to work, and
so far I don't see how it conceptually differs from a container.  What
is it adding?

> Having the device join 
> an ioasid is equivalent to attaching a device to VFIO container, and 
> here the group integrity must be enforced. Then /dev/ioasid anyway 
> needs to manage group objects and their association with ioasid and 
> underlying iommu domain thus it's pointless to keep same logic within
> VFIO. Is this understanding correct?
> 
> btw one remaining open is whether you expect /dev/ioasid to be 
> associated with a single iommu domain, or multiple. If only a single 
> domain is allowed, the ioasid_fd is equivalent to the scope of VFIO 
> container. It is supposed to have only one gpa_ioasid_id since one 
> iommu domain can only have a single 2nd level pgtable. Then all other 
> ioasids, once allocated, must be nested on this gpa_ioasid_id to fit 
> in the same domain. if a legacy vIOMMU is exposed (which disallows 
> nesting), the userspace has to open an ioasid_fd for every group. 
> This is basically the VFIO way. On the other hand if multiple domains 
> is allowed, there could be multiple ioasid_ids each holding a 2nd level 
> pgtable and an iommu domain (or a list of pgtables and domains due to
> incompatibility issue as discussed in another thread), and can be
> nested by other ioasids respectively. The application only needs
> to open /dev/ioasid once regardless of whether vIOMMU allows 
> nesting, and has a single interface for ioasid allocation. Which way
> do you prefer to?
> 
> Thanks
> Kevin
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-27 Thread David Gibson
On Tue, Apr 27, 2021 at 01:39:54PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 27, 2021 at 03:11:25PM +1000, David Gibson wrote:
> 
> > > So your proposal sort of moves the entire container/group/domain 
> > > managment into /dev/ioasid and then leaves vfio only provide device
> > > specific uAPI. An ioasid represents a page table (address space), thus 
> > > is equivalent to the scope of VFIO container.
> > 
> > Right.  I don't really know how /dev/iosasid is supposed to work, and
> > so far I don't see how it conceptually differs from a container.  What
> > is it adding?
> 
> There are three motivating topics:
>  1) /dev/vfio/vfio is only usable by VFIO and we have many interesting
> use cases now where we need the same thing usable outside VFIO
>  2) /dev/vfio/vfio does not support modern stuff like PASID and
> updating to support that is going to be a big change, like adding
> multiple IOASIDs so they can be modeled as as a tree inside a
> single FD
>  3) I understand there is some desire to revise the uAPI here a bit,
> ie Alex mentioned the poor mapping performance.
> 
> I would say it is not conceptually different from what VFIO calls a
> container, it is just a different uAPI with the goal to be cross
> subsystem.

Ok, that makes sense.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-27 Thread David Gibson
On Tue, Apr 27, 2021 at 02:24:32PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 27, 2021 at 02:50:45PM +1000, David Gibson wrote:
> 
> > > > I say this because the SPAPR looks quite a lot like PASID when it has
> > > > APIs for allocating multiple tables and other things. I would be
> > > > interested to hear someone from IBM talk about what it is doing and
> > > > how it doesn't fit into today's IOMMU API.
> > 
> > Hm.  I don't think it's really like PASID.  Just like Type1, the TCE
> > backend represents a single DMA address space which all devices in the
> > container will see at all times.  The difference is that there can be
> > multiple (well, 2) "windows" of valid IOVAs within that address space.
> > Each window can have a different TCE (page table) layout.  For kernel
> > drivers, a smallish translated window at IOVA 0 is used for 32-bit
> > devices, and a large direct mapped (no page table) window is created
> > at a high IOVA for better performance with 64-bit DMA capable devices.
> >
> > With the VFIO backend we create (but don't populate) a similar
> > smallish 32-bit window, userspace can create its own secondary window
> > if it likes, though obvious for userspace use there will always be a
> > page table.  Userspace can choose the total size (but not address),
> > page size and to an extent the page table format of the created
> > window.  Note that the TCE page table format is *not* the same as the
> > POWER CPU core's page table format.  Userspace can also remove the
> > default small window and create its own.
> 
> So what do you need from the generic API? I'd suggest if userspace
> passes in the required IOVA range it would benefit all the IOMMU
> drivers to setup properly sized page tables and PPC could use that to
> drive a single window. I notice this is all DPDK did to support TCE.

Yes.  My proposed model for a unified interface would be that when you
create a new container/IOASID, *no* IOVAs are valid.  Before you can
map anything you would have to create a window with specified base,
size, pagesize (probably some flags for extension, too).  That could
fail if the backend IOMMU can't handle that IOVA range, it could be a
backend no-op if the requested window lies within a fixed IOVA range
the backend supports, or it could actually reprogram the back end for
the new window (such as for POWER TCEs).  Regardless of the hardware,
attempts to map outside the created window(s) would be rejected by
software.

I expect we'd need some kind of query operation to expose limitations
on the number of windows, addresses for them, available pagesizes etc.

> > The second wrinkle is pre-registration.  That lets userspace register
> > certain userspace VA ranges (*not* IOVA ranges) as being the only ones
> > allowed to be mapped into the IOMMU.  This is a performance
> > optimization, because on pre-registration we also pre-account memory
> > that will be effectively locked by DMA mappings, rather than doing it
> > at DMA map and unmap time.
> 
> This feels like nesting IOASIDs to me, much like a vPASID.
> 
> The pre-registered VA range would be the root of the tree and the
> vIOMMU created ones would be children of the tree. This could allow
> the map operations of the child to refer to already prepped physical
> memory held in the root IOASID avoiding the GUP/etc cost.

Huh... I never thought of it that way, but yeah, that sounds like it
could work.  More elegantly than the current system in fact.

> Seems fairly genericish, though I'm not sure about the kvm linkage..

I think it should be doable.  We'd basically need to give KVM a handle
on the parent AS, and the child AS, and the guest side handle (what
PAPR calls a "Logical IO Bus Number" - liobn).  KVM would then
translate H_PUT_TCE etc. hypercalls on that liobn into calls into the
IOMMU subsystem to map bits of the parent AS into the child.  We'd
probably have to have some requirements that either parent AS is
identity-mapped to a subset of the userspace AS (effectively what we
have now) or that parent AS is the same as guest physical address.
Not sure which would work better.

> > I like the idea of a common DMA/IOMMU handling system across
> > platforms.  However in order to be efficiently usable for POWER it
> > will need to include multiple windows, allowing the user to change
> > those windows and something like pre-registration to amortize
> > accounting costs for heavy vIOMMU load.
> 
> I have a feeling /dev/ioasid is going to end up with some HW specific
> escape hatch to create some HW specific IOASID types and operate on
> them in a HW specific way.
> 
> However, wh

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-27 Thread David Gibson
On Tue, Apr 27, 2021 at 02:12:12PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 27, 2021 at 03:08:46PM +1000, David Gibson wrote:
> > > Starting from a BDF the general pseudo code is:
> > >  device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
> > >  device_fd = open("/dev/vfio/"+device_name)
> > >  ioasidfd = open("/dev/ioasid")
> > >  ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)
> > 
> > This line is the problem.
> > 
> > [Historical aside: Alex's early drafts for the VFIO interface looked
> > quite similar to this.  Ben Herrenschmidt and myself persuaded him it
> > was a bad idea, and groups were developed instead.  I still think it's
> > a bad idea, and not just for POWER]
> 
> Spawning the VFIO device FD from the group FD is incredibly gross from
> a kernel design perspective. Since that was done the struct
> vfio_device missed out on a sysfs presence and doesn't have the
> typical 'struct device' member or dedicated char device you'd expect a
> FD based subsystem to have.
> 
> This basically traded normal usage of the driver core for something
> that doesn't serve a technical usage. Given we are now nearly 10 years
> later and see that real widely deployed applications are not doing
> anything special with the group FD it makes me question the wisdom of
> this choice.

I'm really not sure what "anything special" would constitute here.

> > As Alex says, if this line fails because of the group restrictions,
> > that's not great because it's not very obvious what's gone wrong.  
> 
> Okay, that is fair, but let's solve that problem directly. For
> instance netlink has been going in the direction of adding a "extack"
> from the kernel which is a descriptive error string. If the failing
> ioctl returned the string:
> 
>   "cannot join this device to the IOASID because device XXX in the
>same group #10 is in use"

Um.. is there a sane way to return strings from an ioctl()?

> Would you agree it is now obvious what has gone wrong? In fact would
> you agree this is a lot better user experience than what applications
> do today even though they have the group FD?
> 
> > But IMO, the success path on a multi-device group is kind of worse:
> > you've now made made a meaningful and visible change to the setup of
> > devices which are not mentioned in this line *at all*.  
> 
> I don't think spawning a single device_fd from the guoup clearly says
> there are repercussions outside that immediate, single, device.

It's not the fact that the device fds are spawed from the group fd.
It's the fact that the "attach" operation - binding the group to the
container now, binding the whatever to the iosasid in future -
explicitly takes a group.  That's an operation that affects a group,
so the interface should reflect that.

Getting the device fds from the group fd kind of follows, because it's
unsafe to do basically anything on the device unless you already
control the group (which in this case means attaching it to a
container/ioasid).  I'm entirely open to ways of doing that that are
less inelegant from a sysfs integration point of view, but the point
is you must manage the group before you can do anything at all with
individual devices.

> That comes from understanding what the ioctls are doing, and reading
> the documentation. The same applies to some non-group FD world.
> 
> > Yes, it makes set up more of a pain, but it's necessary complexity to
> > actually understand what's going on here.
> 
> There is a real technical problem here - the VFIO group is the thing
> that spawns the device_fd and that is incompatible with the idea to
> centralize the group security logic in drivers/iommu/ and share it
> with multiple subsystems.

I don't see why.  I mean, sure, you don't want explicitly the *vfio*
group as such.  But IOMMU group is already a cross-subsystem concept
and you can explicitly expose that in a different way.

> We also don't have an obvious clean way to incorporate a group FD into
> other subsystems (nor would I want to).

If you don't have a group concept in other subsystems, there's a fair
chance they are broken.  There are a bunch of operations that are
inherently per-group.  Well.. per container/IOASID, but the
granularity of membership for that is the group.

> One option is VFIO can keep its group FD but nothing else will have
> anthing like it. However I don't much like the idea that VFIO will
> have a special and unique programming model to do that same things
> other subsystem will do. That will make it harder for userspace to
> 

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-28 Thread David Gibson
On Wed, Apr 28, 2021 at 09:21:49PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 28, 2021 at 11:23:39AM +1000, David Gibson wrote:
> 
> > Yes.  My proposed model for a unified interface would be that when you
> > create a new container/IOASID, *no* IOVAs are valid.
> 
> Hurm, it is quite tricky. All IOMMUs seem to have a dead zone around
> the MSI window, so negotiating this all in a general way is not going
> to be a very simple API.
> 
> To be general it would be nicer to say something like 'I need XXGB of
> IOVA space' 'I need 32 bit IOVA space' etc and have the kernel return
> ranges that sum up to at least that big. Then the kernel can do its
> all its optimizations.

Ah, yes, sorry.  We do need an API that lets the kernel make more of
the decisions too.  For userspace drivers it would generally be
sufficient to just ask for XXX size of IOVA space wherever you can get
it.  Handling guests requires more precision.  So, maybe a request
interface with a bunch of hint variables and a matching set of
MAP_FIXED-like flags to assert which ones aren't negotiable.

> I guess you are going to say that the qemu PPC vIOMMU driver needs
> more exact control..

*Every* vIOMMU driver needs more exact control.  The guest drivers
will expect to program the guest devices with IOVAs matching the guest
platform's IOMMU model.  Therefore the backing host IOMMU has to be
programmed to respond to those IOVAs.  If it can't be, there's no way
around it, and you want to fail out early.  With this model that will
happen when qemu (say) requests the host IOMMU window(s) to match the
guest's expected IOVA ranges.

Actually, come to that even guests without a vIOMMU need more exact
control: they'll expect IOVA to match GPA, so if your host IOMMU can't
be set up translate the full range of GPAs, again, you're out of luck.

The only reason x86 has been able to ignore this is that the
assumption has been that all IOMMUs can translate IOVAs from 0...  Once you really start to
look at what the limits are, you need the exact window control I'm
describing.

> > I expect we'd need some kind of query operation to expose limitations
> > on the number of windows, addresses for them, available pagesizes etc.
> 
> Is page size an assumption that hugetlbfs will always be used for backing
> memory or something?

So for TCEs (and maybe other IOMMUs out there), the IO page tables are
independent of the CPU page tables.  They don't have the same format,
and they don't necessarily have the same page size.  In the case of a
bare metal kernel working in physical addresses they can use that TCE
page size however they like.  For userspace you get another layer of
complexity.  Essentially to implement things correctly the backing
IOMMU needs to have a page size granularity that's the minimum of
whatever granularity the userspace or guest driver expects and the
host page size backing the memory.

> > > As an ideal, only things like the HW specific qemu vIOMMU driver
> > > should be reaching for all the special stuff.
> > 
> > I'm hoping we can even avoid that, usually.  With the explicitly
> > created windows model I propose above, it should be able to: qemu will
> > create the windows according to the IOVA windows the guest platform
> > expects to see and they either will or won't work on the host platform
> > IOMMU.  If they do, generic maps/unmaps should be sufficient.  If they
> > don't well, the host IOMMU simply cannot emulate the vIOMMU so you're
> > out of luck anyway.
> 
> It is not just P9 that has special stuff, and this whole area of PASID
> seems to be quite different on every platform
> 
> If things fit very naturally and generally then maybe, but I've been
> down this road before of trying to make a general description of a
> group of very special HW. It ended in tears after 10 years when nobody
> could understand the "general" API after it was Frankenstein'd up with
> special cases for everything. Cautionary tale
> 
> There is a certain appeal to having some
> 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra
> information like windows that can be optionally called by the viommu
> driver and it remains well defined and described.

Windows really aren't ppc specific.  They're absolutely there on x86
and everything else as well - it's just that people are used to having
a window at 0.. that you can often get away with
treating it sloppily.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-28 Thread David Gibson
On Wed, Apr 28, 2021 at 11:56:22AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 28, 2021 at 10:58:29AM +1000, David Gibson wrote:
> > On Tue, Apr 27, 2021 at 02:12:12PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Apr 27, 2021 at 03:08:46PM +1000, David Gibson wrote:
> > > > > Starting from a BDF the general pseudo code is:
> > > > >  device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
> > > > >  device_fd = open("/dev/vfio/"+device_name)
> > > > >  ioasidfd = open("/dev/ioasid")
> > > > >  ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)
> > > > 
> > > > This line is the problem.
> > > > 
> > > > [Historical aside: Alex's early drafts for the VFIO interface looked
> > > > quite similar to this.  Ben Herrenschmidt and myself persuaded him it
> > > > was a bad idea, and groups were developed instead.  I still think it's
> > > > a bad idea, and not just for POWER]
> > > 
> > > Spawning the VFIO device FD from the group FD is incredibly gross from
> > > a kernel design perspective. Since that was done the struct
> > > vfio_device missed out on a sysfs presence and doesn't have the
> > > typical 'struct device' member or dedicated char device you'd expect a
> > > FD based subsystem to have.
> > > 
> > > This basically traded normal usage of the driver core for something
> > > that doesn't serve a technical usage. Given we are now nearly 10 years
> > > later and see that real widely deployed applications are not doing
> > > anything special with the group FD it makes me question the wisdom of
> > > this choice.
> > 
> > I'm really not sure what "anything special" would constitute here.
> 
> Well, really anything actually. All I see in, say, dpdk, is open the
> group fd, get a device fd, do the container dance and never touch the
> group fd again or care about groups in any way. It seems typical of
> this class of application.

Well, sure, the only operation you do on the group itself is attach it
to the container (and then every container operation can be thought of
as applying to all its attached groups).  But that attach operation
really is fundamentally about the group.  It always, unavoidably,
fundamentally affects every device in the group - including devices
you may not typically think about, like bridges and switches.

That is *not* true of the other device operations, like poking IO.

> If dpdk is exposing other devices to a risk it certainly hasn't done
> anything to make that obvious.

And in practice I suspect it will just break if you give it a >1
device group.

> > > Okay, that is fair, but let's solve that problem directly. For
> > > instance netlink has been going in the direction of adding a "extack"
> > > from the kernel which is a descriptive error string. If the failing
> > > ioctl returned the string:
> > > 
> > >   "cannot join this device to the IOASID because device XXX in the
> > >same group #10 is in use"
> > 
> > Um.. is there a sane way to return strings from an ioctl()?
> 
> Yes, it can be done, a string buffer pointer and length in the input
> for instance.

I suppose.  Rare enough that I expect everyone will ignore it, alas :/.

> > Getting the device fds from the group fd kind of follows, because it's
> > unsafe to do basically anything on the device unless you already
> > control the group (which in this case means attaching it to a
> > container/ioasid).  I'm entirely open to ways of doing that that are
> > less inelegant from a sysfs integration point of view, but the point
> > is you must manage the group before you can do anything at all with
> > individual devices.
> 
> I think most likely VFIO is going to be the only thing to manage a
> multi-device group.

You don't get to choose that.  You could explicitly limit other things
to only one-device groups, but that would be an explicit limitation.
Essentially any device can end up in a multi-device group, if you put
it behind a PCIe to PCI bridge, or a PCIe switch which doesn't support
access controls.

The groups are still there, whether or not other things want to deal
with them.

> I see things like VDPA being primarily about PASID, and an IOASID that
> is matched to a PASID is inherently a single device IOMMU group.

I don't know enough about PASID to make sense of that.

> > I don't see why.  I mean, sure, you don't want explicitly the *vfio*
> > group as such.  But IOMMU group is already a cr

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-03 Thread David Gibson
On Mon, May 03, 2021 at 01:05:30PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 29, 2021 at 01:20:22PM +1000, David Gibson wrote:
> > > There is a certain appeal to having some
> > > 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra
> > > information like windows that can be optionally called by the viommu
> > > driver and it remains well defined and described.
> > 
> > Windows really aren't ppc specific.  They're absolutely there on x86
> > and everything else as well - it's just that people are used to having
> > a window at 0.. that you can often get away with
> > treating it sloppily.
> 
> My point is this detailed control seems to go on to more than just
> windows. As you say the vIOMMU is emulating specific HW that needs to
> have kernel interfaces to match it exactly.

It's really not that bad.  The case of emulating the PAPR vIOMMU on
something else is relatively easy, because all updates to the IO page
tables go through hypercalls.  So, as long as the backend IOMMU can
map all the IOVAs that the guest IOMMU can, then qemu's implementation
of those hypercalls just needs to put an equivalent mapping in the
backend, which it can do with a generic VFIO_DMA_MAP.

vIOMMUs with page tables in guest memory are harder, but only really
in the usual ways that a vIOMMU of that type is harder (needs cache
mode or whatever).  At whatever point you need to shadow from the
guest IO page tables to the host backend, you can again do that with
generic maps, as long as the backend supports the necessary IOVAs, and
has an IO page size that's equal to or a submultiple of the vIOMMU
page size.

> I'm remarking that trying to unify every HW IOMMU implementation that
> ever has/will exist into a generic API complete enough to allow the
> vIOMMU to be created is likely to result in an API too complicated to
> understand..

Maybe not every one, but I think we can get a pretty wide range with a
reasonable interface.  Explicitly handling IOVA windows does most of
it.  And we kind of need to handle that anyway to expose what ranges
the IOMMU is capable of translating anyway.  I think making handling
valid IOVA windows explicit makes things simpler than having
per-backend-family interfaces to expose the limits of their
translation ranges, which is what's likely to happen without it.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-12 Thread David Gibson
On Mon, May 03, 2021 at 01:15:18PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 29, 2021 at 01:04:05PM +1000, David Gibson wrote:
> > Again, I don't know enough about VDPA to make sense of that.  Are we
> > essentially talking non-PCI virtual devices here?  In which case you
> > could define the VDPA "bus" to always have one-device groups.
> 
> It is much worse than that.
> 
> What these non-PCI devices need is for the kernel driver to be part of
> the IOMMU group of the underlying PCI device but tell VFIO land that
> "groups don't matter"

I don't really see a semantic distinction between "always one-device
groups" and "groups don't matter".  Really the only way you can afford
to not care about groups is if they're singletons.

> Today mdev tries to fake this by using singleton iommu groups, but it
> is really horrible and direcly hacks up the VFIO IOMMU code to
> understand these special cases. Intel was proposing more special
> hacking in the VFIO IOMMU code to extend this to PASID.

At this stage I don't really understand why that would end up so
horrible.

> When we get to a /dev/ioasid this is all nonsense. The kernel device
> driver is going to have to tell drivers/iommu exactly what kind of
> ioasid it can accept, be it a PASID inside a kernel owned group, a SW
> emulated 'mdev' ioasid, or whatever.
> 
> In these cases the "group" idea has become a fiction that just creates
> a pain.

I don't see how the group is a fiction in this instance.  You can
still have devices that can't be isolated, therefore you can have
non-singleton groups.

> "Just reorganize VDPA to do something insane with the driver
> core so we can create a dummy group to satisfy an unnecessary uAPI
> restriction" is not a very compelling argument.
> 
> So if the nonsensical groups goes away for PASID/mdev, where does it
> leave the uAPI in other cases?
> 
> > I don't think simplified-but-wrong is a good goal.  The thing about
> > groups is that if they're there, you can't just "not care" about them,
> > they affect you whether you like it or not.
> 
> You really can. If one thing claims the group then all the other group
> devices become locked out.

Aside: I'm primarily using "group" to mean the underlying hardware
unit, not the vfio construct on top of it, I'm not sure that's been
clear throughout.

So.. your model assumes that every device has a safe quiescent state
where it won't do any harm until poked, whether its group is
currently kernel owned, or owned by a userspace that doesn't know
anything about it.

At minimum this does mean that in order to use one device in the group
you must have permission to use *all* the devices in the group -
otherwise you may be able to operate a device you don't have
permission to by DMAing to its registers from a device you do have
permission to.

Whatever scripts are managing ownership of devices also need to know
about groups, because they need to put all the devices into that
quiescent state before the group can change ownership.

> The main point to understand is that groups are NOT an application
> restriction! It is a whole system restriction that the operator needs
> to understand and deal with. This is why things like dpdk don't care
> about the group at all - there is nothing they can do with the
> information.
> 
> If the operator says to run dpdk on a specific device then the
> operator is the one that has to deal with all the other devices in the
> group getting locked out.

Ok, I think I see your point there.

> At best the application can make it more obvious that the operator is
> doing something dangerous, but the current kernel API doesn't seem to
> really support that either.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-12 Thread David Gibson
On Tue, May 04, 2021 at 03:15:37PM -0300, Jason Gunthorpe wrote:
> On Tue, May 04, 2021 at 01:54:55PM +1000, David Gibson wrote:
> > On Mon, May 03, 2021 at 01:05:30PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Apr 29, 2021 at 01:20:22PM +1000, David Gibson wrote:
> > > > > There is a certain appeal to having some
> > > > > 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra
> > > > > information like windows that can be optionally called by the viommu
> > > > > driver and it remains well defined and described.
> > > > 
> > > > Windows really aren't ppc specific.  They're absolutely there on x86
> > > > and everything else as well - it's just that people are used to having
> > > > a window at 0.. that you can often get away with
> > > > treating it sloppily.
> > > 
> > > My point is this detailed control seems to go on to more than just
> > > windows. As you say the vIOMMU is emulating specific HW that needs to
> > > have kernel interfaces to match it exactly.
> > 
> > It's really not that bad.  The case of emulating the PAPR vIOMMU on
> > something else is relatively easy, because all updates to the IO page
> > tables go through hypercalls.  So, as long as the backend IOMMU can
> > map all the IOVAs that the guest IOMMU can, then qemu's implementation
> > of those hypercalls just needs to put an equivalent mapping in the
> > backend, which it can do with a generic VFIO_DMA_MAP.
> 
> So you also want the PAPR vIOMMU driver to run on, say, an ARM IOMMU?

Well, I don't want to preclude it in the API.  I'm not sure about that
specific example, but in most cases it should be possible to run the
PAPR vIOMMU on an x86 IOMMU backend.  Obviously only something you'd
want to do for testing and experimentation, but it could be quite
useful for that.

> > vIOMMUs with page tables in guest memory are harder, but only really
> > in the usual ways that a vIOMMU of that type is harder (needs cache
> > mode or whatever).  At whatever point you need to shadow from the
> > guest IO page tables to the host backend, you can again do that with
> > generic maps, as long as the backend supports the necessary IOVAs, and
> > has an IO page size that's equal to or a submultiple of the vIOMMU
> > page size.
> 
> But this definitely all becomes HW specific.
> 
> For instance I want to have an ARM vIOMMU driver it needs to do some
> 
>  ret = ioctl(ioasid_fd, CREATE_NESTED_IOASID, [page table format is ARMvXXX])
>  if (ret == -EOPNOTSUPP)
>  ret = ioctl(ioasid_fd, CREATE_NORMAL_IOASID, ..)
>  // and do completely different and more expensive emulation
> 
> I can get a little bit of generality, but at the end of the day the
> IOMMU must create a specific HW layout of the nested page table, if it
> can't, it can't.

Erm.. I don't really know how your IOASID interface works here.  I'm
thinking about the VFIO interface where maps and unmaps are via
explicit ioctl()s, which provides an obvious point to do translation
between page table formats.

But.. even if you're exposing page tables to userspace.. with hardware
that has explicit support for nesting you can probably expose the hw
tables directly which is great for the cases that works for.  But
surely for older IOMMUs which don't do nesting you must have some way
of shadowing guest IO page tables to host IO page tables to translate
GPA to HPA at least?  If you're doing that, I don't see that
converting page table format is really any harder

> > > I'm remarking that trying to unify every HW IOMMU implementation that
> > > ever has/will exist into a generic API complete enough to allow the
> > > vIOMMU to be created is likely to result in an API too complicated to
> > > understand..
> > 
> > Maybe not every one, but I think we can get a pretty wide range with a
> > reasonable interface.  
> 
> It sounds like a reasonable guideline is if the feature is actually
> general to all IOMMUs and can be used by qemu as part of a vIOMMU
> emulation when compatible vIOMMU HW is not available.
> 
> Having 'requested window' support that isn't actually implemented in
> every IOMMU is going to mean the PAPR vIOMMU emulation won't work,
> defeating the whole point of making things general?

The trick is that you don't necessarily need dynamic window support in
the backend to emulate it in the vIOMMU.  If your backend has fixed
windows, then you emulate request window as:
if (requested window is within backend windows)
no-op;
    else
 

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-12 Thread David Gibson
On Wed, May 05, 2021 at 01:39:02PM -0300, Jason Gunthorpe wrote:
> On Wed, May 05, 2021 at 02:28:53PM +1000, Alexey Kardashevskiy wrote:
> 
> > This is a good feature in general when let's say there is a linux supported
> > device which has a proprietary device firmware update tool which only exists
> > as an x86 binary and your hardware is not x86 - running qemu + vfio in full
> > emulation would provide a way to run the tool to update a physical device.
> 
> That specific use case doesn't really need a vIOMMU though, does it?

Possibly not, but the mechanics needed to do vIOMMU on different host
IOMMU aren't really different from what you need for a no-vIOMMU
guest.  With a vIOMMU you need to map guest IOVA space into the host
IOVA space.  With no no-vIOMMU you need to map guest physical
addresses into the host IOVA space.  In either case the GPA/gIOVA to
userspace and userspace to HPA mappings are basically arbitrary.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-24 Thread David Gibson
On Thu, May 13, 2021 at 10:50:30AM -0300, Jason Gunthorpe wrote:
> On Thu, May 13, 2021 at 04:07:07PM +1000, David Gibson wrote:
> > On Wed, May 05, 2021 at 01:39:02PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 05, 2021 at 02:28:53PM +1000, Alexey Kardashevskiy wrote:
> > > 
> > > > This is a good feature in general when let's say there is a linux 
> > > > supported
> > > > device which has a proprietary device firmware update tool which only 
> > > > exists
> > > > as an x86 binary and your hardware is not x86 - running qemu + vfio in 
> > > > full
> > > > emulation would provide a way to run the tool to update a physical 
> > > > device.
> > > 
> > > That specific use case doesn't really need a vIOMMU though, does it?
> > 
> > Possibly not, but the mechanics needed to do vIOMMU on different host
> > IOMMU aren't really different from what you need for a no-vIOMMU
> > guest.  
> 
> For very simple vIOMMUs this might be true, but this new features of nesting
> PASID, migration, etc, etc all make the vIOMMU complicated and
> emuluating it completely alot harder.

Well, sure, emulating a complex vIOMMU is complex. But "very simple
vIOMMUs" covers the vast majority of currently deployed hardware, and
several are already emulated by qemu.

> Stuffing a vfio-pci into a guest and creating a physical map using a
> single IOASID is comparably trivial.

Note that for PAPR (POWER guest) systems this is not an option: the
PAPR platform *always* has a vIOMMU.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-24 Thread David Gibson
On Thu, May 13, 2021 at 10:59:38AM -0300, Jason Gunthorpe wrote:
> On Thu, May 13, 2021 at 03:48:19PM +1000, David Gibson wrote:
> > On Mon, May 03, 2021 at 01:15:18PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Apr 29, 2021 at 01:04:05PM +1000, David Gibson wrote:
> > > > Again, I don't know enough about VDPA to make sense of that.  Are we
> > > > essentially talking non-PCI virtual devices here?  In which case you
> > > > could define the VDPA "bus" to always have one-device groups.
> > > 
> > > It is much worse than that.
> > > 
> > > What these non-PCI devices need is for the kernel driver to be part of
> > > the IOMMU group of the underlying PCI device but tell VFIO land that
> > > "groups don't matter"
> > 
> > I don't really see a semantic distinction between "always one-device
> > groups" and "groups don't matter".  Really the only way you can afford
> > to not care about groups is if they're singletons.
> 
> The kernel driver under the mdev may not be in an "always one-device"
> group.

I don't really understand what you mean by that.

> It is a kernel driver so the only thing we know and care about is that
> all devices in the HW group are bound to kernel drivers.
> 
> The vfio device that spawns from this kernel driver is really a
> "groups don't matter" vfio device because at the IOMMU layer it should
> be riding on the physical group of the kernel driver.  At the VFIO
> layer we no longer care about the group abstraction because the system
> guarentees isolation in some other way.

Uh.. I don't really know how mdevs are isolated from each other.  I
thought it was because the physical device providing the mdevs
effectively had an internal IOMMU (or at least DMA permissioning) to
isolate the mdevs, even though the physical device may not be fully
isolated.

In that case the virtual mdev is effectively in a singleton group,
which is different from the group of its parent device.

If the physical device had a bug which meant the mdevs *weren't*
properly isolated from each other, then those mdevs would share a
group, and you *would* care about it.  Depending on how the isolation
failed the mdevs might or might not also share a group with the parent
physical device.

> The issue is a software one of tightly coupling IOMMU HW groups to
> VFIO's API and then introducing an entire class of VFIO mdev devices
> that no longer care about IOMMU HW groups at all.

The don't necessarily care about the IOMMU groups of the parent
physical hardware, but they have their own IOMMU groups as virtual
hardware devices.

> Currently mdev tries to trick this by creating singleton groups, but
> it is very ugly and very tightly coupled to a specific expectation of
> the few existing mdev drivers. Trying to add PASID made it alot worse.
> 
> > Aside: I'm primarily using "group" to mean the underlying hardware
> > unit, not the vfio construct on top of it, I'm not sure that's been
> > clear throughout.
> 
> Sure, that is obviously fixed, but I'm not interested in that.
> 
> I'm interested in having a VFIO API that makes sense for vfio-pci
> which has a tight coupling to the HW notion of a IOMMU and also vfio
> mdev's that have no concept of a HW IOMMU group.
> 
> > So.. your model assumes that every device has a safe quiescent state
> > where it won't do any harm until poked, whether its group is
> > currently kernel owned, or owned by a userspace that doesn't know
> > anything about it.
> 
> This is today's model, yes. When you run dpdk on a multi-group device
> vfio already ensures that all the device groups remained parked and
> inaccessible.

I'm not really following what you're saying there.

If you have a multi-device group, and dpdk is using one device in it,
VFIO *does not* (and cannot) ensure that other devices in the group
are parked and inaccessible.  It ensures that they're parked at the
moment the group moves from kernel to userspace ownership, but it
can't prevent dpdk from accessing and unparking those devices via peer
to peer DMA.

> > At minimum this does mean that in order to use one device in the group
> > you must have permission to use *all* the devices in the group -
> > otherwise you may be able to operate a device you don't have
> > permission to by DMAing to its registers from a device you do have
> > permission to.
> 
> If the administator configures the system with different security
> labels for different VFIO devices then yes removing groups makes this
> more tricky as all devices in the group

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-27 Thread David Gibson
On Wed, May 26, 2021 at 02:48:03AM +0530, Kirti Wankhede wrote:
> 
> 
> On 5/26/2021 1:22 AM, Jason Gunthorpe wrote:
> > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > 
> > > 2. iommu backed mdev devices for SRIOV where mdev device is created per
> > > VF (mdev device == VF device) then that mdev device has same iommu
> > > protection scope as VF associated to it.
> > 
> > This doesn't require, and certainly shouldn't create, a fake group.
> > 
> > Only the VF's real IOMMU group should be used to model an iommu domain
> > linked to a VF. Injecting fake groups that are proxies for real groups
> > only opens the possibility of security problems like David is
> > concerned with.
> > 
> 
> I think this security issue should be addressed by letting mdev device
> inherit its parent's iommu_group, i.e. VF's iommu_group here.

No, that doesn't work.  AIUI part of the whole point of mdevs is to
allow chunks of a single PCI function to be handed out to different
places, because they're isolated from each other not by the system
IOMMU, but by a combination of MMU hardware in the hardware (e.g. in a
GPU card) and software in the mdev driver.  If mdevs inherited the
group of their parent device they wouldn't count as isolated from each
other, which they should.

> 
> Kirti
> 
> > Max's series approaches this properly by fully linking the struct
> > pci_device of the VF throughout the entire VFIO scheme, including the
> > group and container, while still allowing override of various VFIO
> > operations.
> > 
> > Jason
> > 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-27 Thread David Gibson
On Mon, May 24, 2021 at 08:37:44PM -0300, Jason Gunthorpe wrote:
> On Mon, May 24, 2021 at 05:52:58PM +1000, David Gibson wrote:
> 
> > > > I don't really see a semantic distinction between "always one-device
> > > > groups" and "groups don't matter".  Really the only way you can afford
> > > > to not care about groups is if they're singletons.
> > > 
> > > The kernel driver under the mdev may not be in an "always one-device"
> > > group.
> > 
> > I don't really understand what you mean by that.
> 
> I mean the group of the mdev's actual DMA device may have multiple
> things in it.
>  
> > > It is a kernel driver so the only thing we know and care about is that
> > > all devices in the HW group are bound to kernel drivers.
> > > 
> > > The vfio device that spawns from this kernel driver is really a
> > > "groups don't matter" vfio device because at the IOMMU layer it should
> > > be riding on the physical group of the kernel driver.  At the VFIO
> > > layer we no longer care about the group abstraction because the system
> > > guarentees isolation in some other way.
> > 
> > Uh.. I don't really know how mdevs are isolated from each other.  I
> > thought it was because the physical device providing the mdevs
> > effectively had an internal IOMMU (or at least DMA permissioning) to
> > isolate the mdevs, even though the physical device may not be fully
> > isolated.
> > 
> > In that case the virtual mdev is effectively in a singleton group,
> > which is different from the group of its parent device.
> 
> That is one way to view it, but it means creating a whole group
> infrastructure and abusing the IOMMU stack just to create this
> nonsense fiction.

It's a nonsense fiction until it's not, at which point it will bite
you in the arse.

> We also abuse the VFIO container stuff to hackily
> create several different types pf IOMMU uAPIs for the mdev - all of
> which are unrelated to drivers/iommu.
> 
> Basically, there is no drivers/iommu thing involved, thus is no really
> iommu group, for mdev it is all a big hacky lie.

Well, "iommu" group might not be the best name, but hardware isolation
is still a real concern here, even if it's not entirely related to the
IOMMU.

> > If the physical device had a bug which meant the mdevs *weren't*
> > properly isolated from each other, then those mdevs would share a
> > group, and you *would* care about it.  Depending on how the isolation
> > failed the mdevs might or might not also share a group with the parent
> > physical device.
> 
> That isn't a real scenario.. mdevs that can't be isolated just
> wouldn't be useful to exist

Really?  So what do you do when you discover some mdevs you thought
were isolated actually aren't due to a hardware bug?  Drop support
from the driver entirely?  In which case what do you say to the people
who understandably complain "but... we had all the mdevs in one guest
anyway, we don't care if they're not isolated"?

> > > This is today's model, yes. When you run dpdk on a multi-group device
> > > vfio already ensures that all the device groups remained parked and
> > > inaccessible.
> > 
> > I'm not really following what you're saying there.
> > 
> > If you have a multi-device group, and dpdk is using one device in it,
> > VFIO *does not* (and cannot) ensure that other devices in the group
> > are parked and inaccessible.  
> 
> I mean in the sense that no other user space can open those devices
> and no kernel driver can later be attached to them.

Ok.

> > It ensures that they're parked at the moment the group moves from
> > kernel to userspace ownership, but it can't prevent dpdk from
> > accessing and unparking those devices via peer to peer DMA.
> 
> Right, and adding all this group stuff did nothing to alert the poor
> admin that is running DPDK to this risk.

Didn't it?  Seems to me the admin that in order to give the group to
DPDK, the admin had to find and unbind all the things in it... so is
therefore aware that they're giving everything in it to DPDK.

> > > If the administator configures the system with different security
> > > labels for different VFIO devices then yes removing groups makes this
> > > more tricky as all devices in the group should have the same label.
> > 
> > That seems a bigger problem than "more tricky".  How would you propose
> > addressing this with your device-first model?
> 
> You pu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-27 Thread David Gibson
On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote:
> On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> 
> > 2. iommu backed mdev devices for SRIOV where mdev device is created per
> > VF (mdev device == VF device) then that mdev device has same iommu
> > protection scope as VF associated to it. 
> 
> This doesn't require, and certainly shouldn't create, a fake group.

It's only fake if you start with a narrow view of what a group is.  A
group is a set of devices (in the kernel sense of "device", not
necessarily the hardware sense) which can't be isolated from each
other.  The mdev device is a kernel device, and if working as intended
it can be isolated from everything else, and is therefore in an
absolute bona fide group of its own.

> Only the VF's real IOMMU group should be used to model an iommu domain
> linked to a VF. Injecting fake groups that are proxies for real groups
> only opens the possibility of security problems like David is
> concerned with.

It's not a proxy for a real group, it's a group of its own.  If you
discover that (due to a hardware bug, for example) the mdev is *not*
properly isolated from its parent PCI device, then both the mdev
virtual device *and* the physical PCI device are in the same group.
Groups including devices of different types and on different buses
were considered from the start, and are precedented, if rare.

> Max's series approaches this properly by fully linking the struct
> pci_device of the VF throughout the entire VFIO scheme, including the
> group and container, while still allowing override of various VFIO
> operations.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-06-01 Thread David Gibson
On Thu, May 27, 2021 at 03:48:47PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote:
> > On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > > 
> > > > 2. iommu backed mdev devices for SRIOV where mdev device is created per
> > > > VF (mdev device == VF device) then that mdev device has same iommu
> > > > protection scope as VF associated to it. 
> > > 
> > > This doesn't require, and certainly shouldn't create, a fake group.
> > 
> > It's only fake if you start with a narrow view of what a group is. 
> 
> A group is connected to drivers/iommu. A group object without *any*
> relation to drivers/iommu is just a complete fiction, IMHO.

That might be where we differ.  As I've said, my group I'm primarily
meaning the fundamental hardware unit of isolation.  *Usually* that's
determined by the capabilities of an IOMMU, but in some cases it might
not be.  In either case, the boundaries still matter.

> > > Only the VF's real IOMMU group should be used to model an iommu domain
> > > linked to a VF. Injecting fake groups that are proxies for real groups
> > > only opens the possibility of security problems like David is
> > > concerned with.
> > 
> > It's not a proxy for a real group, it's a group of its own.  If you
> > discover that (due to a hardware bug, for example) the mdev is *not*
> 
> What Kirti is talking about here is the case where a mdev is wrapped
> around a VF and the DMA isolation stems directly from the SRIOV VF's
> inherent DMA isolation, not anything the mdev wrapper did.
> 
> The group providing the isolation is the VF's group.

Yes, in that case the mdev absolutely should be in the VF's group -
having its own group is not just messy but incorrect.

> The group mdev implicitly creates is just a fake proxy that comes
> along with mdev API. It doesn't do anything and it doesn't mean
> anything.

But.. the case of multiple mdevs managed by a single PCI device with
an internal IOMMU also exists, and then the mdev groups are *not*
proxies but true groups independent of the parent device.  Which means
that the group structure of mdevs can vary, which is an argument *for*
keeping it, not against.

> > properly isolated from its parent PCI device, then both the mdev
> > virtual device *and* the physical PCI device are in the same group.
> > Groups including devices of different types and on different buses
> > were considered from the start, and are precedented, if rare.
> 
> This is far too theoretical for me. A security broken mdev is
> functionally useless.

Is it, though?  Again, I'm talking about the case of multiple mdevs
with a single parent device (because that's the only case I was aware
of until recently).  Isolation comes from a device-internal
IOMMU... that turns out to be broken.  But if your security domain
happens to include all the mdevs on the device anyway, then you don't
care.

Are you really going to say people can't use their fancy hardware in
this mode because it has a security flaw that's not relevant to their
usecase?


And then.. there's Kirti's case.  In that case the mdev should belong
to its parent PCI device's group since that's what's providing
isolation.  But in that case the parent device can be in a
multi-device group for any of the usual reasons (PCIe-to-PCI bridge,
PCIe switch with broken ACS, multifunction device with crosstalk).
Which means the mdev also shares a group with those other device.  So
again, the group structure matters and is not a fiction.

> We don't need to support it, and we don't need complicated software to
> model it.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-06-01 Thread David Gibson
On Thu, May 27, 2021 at 04:06:20PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 02:53:42PM +1000, David Gibson wrote:
> 
> > > > If the physical device had a bug which meant the mdevs *weren't*
> > > > properly isolated from each other, then those mdevs would share a
> > > > group, and you *would* care about it.  Depending on how the isolation
> > > > failed the mdevs might or might not also share a group with the parent
> > > > physical device.
> > > 
> > > That isn't a real scenario.. mdevs that can't be isolated just
> > > wouldn't be useful to exist
> > 
> > Really?  So what do you do when you discover some mdevs you thought
> > were isolated actually aren't due to a hardware bug?  Drop support
> > from the driver entirely?  In which case what do you say to the people
> > who understandably complain "but... we had all the mdevs in one guest
> > anyway, we don't care if they're not isolated"?
> 
> I've never said to eliminate groups entirely. 
> 
> What I'm saying is that all the cases we have for mdev today do not
> require groups, but are forced to create a fake group anyhow just to
> satisfy the odd VFIO requirement to have a group FD.
> 
> If some future mdev needs groups then sure, add the appropriate group
> stuff.
> 
> But that doesn't effect the decision to have a VFIO group FD, or not.
> 
> > > > It ensures that they're parked at the moment the group moves from
> > > > kernel to userspace ownership, but it can't prevent dpdk from
> > > > accessing and unparking those devices via peer to peer DMA.
> > > 
> > > Right, and adding all this group stuff did nothing to alert the poor
> > > admin that is running DPDK to this risk.
> > 
> > Didn't it?  Seems to me the admin that in order to give the group to
> > DPDK, the admin had to find and unbind all the things in it... so is
> > therefore aware that they're giving everything in it to DPDK.
> 
> Again, I've never said the *group* should be removed. I'm only
> concerned about the *group FD*

Ok, that wasn't really clear to me.

I still wouldn't say the group for mdevs is a fiction though.. rather
that the group device used for (no internal IOMMU case) mdevs is just
plain wrong.

> When the admin found and unbound they didn't use the *group FD* in any
> way.

No, they are likely to have changed permissions on the group device
node as part of the process, though.

> > > You put the same security labels you'd put on the group to the devices
> > > that consitute the group. It is only more tricky in the sense that the
> > > script that would have to do this will need to do more than ID the
> > > group to label but also ID the device members of the group and label
> > > their char nodes.
> > 
> > Well, I guess, if you take the view that root is allowed to break the
> > kernel.  I tend to prefer that although root can obviously break the
> > kernel if they intend do, we should make it hard to do by accident -
> > which in this case would mean the kernel *enforcing* that the devices
> > in the group have the same security labels, which I can't really see
> > how to do without an exposed group.
> 
> How is this "break the kernel"? It has nothing to do with the
> kernel. Security labels are a user space concern.

*thinks*... yeah, ok, that was much too strong an assertion.  What I
was thinking of is the fact that this means that guarantees you'd
normally expect the kernel to enforce can be obviated by bad
configuration: chown-ing a device to root doesn't actually protect it
if there's another device in the same group exposed to other users.

But I guess you could say the same about, say, an unauthenticated nbd
export of a root-owned block device, so I guess that's not something
the kernel can reasonably enforce.


Ok.. you might be finally convincing me, somewhat.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-06-01 Thread David Gibson
On Thu, May 27, 2021 at 11:55:00PM +0530, Kirti Wankhede wrote:
> 
> 
> On 5/27/2021 10:30 AM, David Gibson wrote:
> > On Wed, May 26, 2021 at 02:48:03AM +0530, Kirti Wankhede wrote:
> > > 
> > > 
> > > On 5/26/2021 1:22 AM, Jason Gunthorpe wrote:
> > > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > > > 
> > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created 
> > > > > per
> > > > > VF (mdev device == VF device) then that mdev device has same iommu
> > > > > protection scope as VF associated to it.
> > > > 
> > > > This doesn't require, and certainly shouldn't create, a fake group.
> > > > 
> > > > Only the VF's real IOMMU group should be used to model an iommu domain
> > > > linked to a VF. Injecting fake groups that are proxies for real groups
> > > > only opens the possibility of security problems like David is
> > > > concerned with.
> > > > 
> > > 
> > > I think this security issue should be addressed by letting mdev device
> > > inherit its parent's iommu_group, i.e. VF's iommu_group here.
> > 
> > No, that doesn't work.  AIUI part of the whole point of mdevs is to
> > allow chunks of a single PCI function to be handed out to different
> > places, because they're isolated from each other not by the system
> > IOMMU, but by a combination of MMU hardware in the hardware (e.g. in a
> > GPU card) and software in the mdev driver.
> 
> That's correct for non-iommu backed mdev devices.
> 
> > If mdevs inherited the
> > group of their parent device they wouldn't count as isolated from each
> > other, which they should.
> > 
> 
> For iommu backed mdev devices for SRIOV, where there can be single mdev
> device for its parent, here parent device is VF, there can't be multiple
> mdev devices associated with that VF. In this case mdev can inherit the
> group of parent device.

Ah, yes, if there's just one mdev for the PCI function, and the
function doesn't have an internal memory protection unit then this
makes sense.

Which means we *do* have at least two meaningfully different group
configurations for mdev:
  * mdev is in a singleton group independent of the parent PCI device
  * mdev shares a group with its parent PCI device

Which means even in the case of mdevs, the group structure is *not* a
meaningless fiction.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Tue, Jun 01, 2021 at 02:56:43PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 08:38:00AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Saturday, May 29, 2021 3:59 AM
> > > 
> > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> > > >
> > > > 5. Use Cases and Flows
> > > >
> > > > Here assume VFIO will support a new model where every bound device
> > > > is explicitly listed under /dev/vfio thus a device fd can be acquired 
> > > > w/o
> > > > going through legacy container/group interface. For illustration purpose
> > > > those devices are just called dev[1...N]:
> > > >
> > > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > > >
> > > > As explained earlier, one IOASID fd is sufficient for all intended use 
> > > > cases:
> > > >
> > > > ioasid_fd = open("/dev/ioasid", mode);
> > > >
> > > > For simplicity below examples are all made for the virtualization story.
> > > > They are representative and could be easily adapted to a 
> > > > non-virtualization
> > > > scenario.
> > > 
> > > For others, I don't think this is *strictly* necessary, we can
> > > probably still get to the device_fd using the group_fd and fit in
> > > /dev/ioasid. It does make the rest of this more readable though.
> > 
> > Jason, want to confirm here. Per earlier discussion we remain an
> > impression that you want VFIO to be a pure device driver thus
> > container/group are used only for legacy application.
> 
> Let me call this a "nice wish".
> 
> If you get to a point where you hard need this, then identify the hard
> requirement and let's do it, but I wouldn't bloat this already large
> project unnecessarily.
> 
> Similarly I wouldn't depend on the group fd existing in this design
> so it could be changed later.

I don't think presence or absence of a group fd makes a lot of
difference to this design.  Having a group fd just means we attach
groups to the ioasid instead of individual devices, and we no longer
need the bookkeeping of "partial" devices.

> > From this comment are you suggesting that VFIO can still keep
> > container/ group concepts and user just deprecates the use of vfio
> > iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> > simple policy that an IOASID will reject cmd if partially-attached
> > group exists)?
> 
> I would say no on the container. /dev/ioasid == the container, having
> two competing objects at once in a single process is just a mess.

Right.  I'd assume that for compatibility, creating a container would
create a single IOASID under the hood with a compatiblity layer
translating the container operations to iosaid operations.

> If the group fd can be kept requires charting a path through the
> ioctls where the container is not used and /dev/ioasid is sub'd in
> using the same device FD specific IOCTLs you show here.

Again, I don't think it makes much difference.  The model doesn't
really change even if you allow both ATTACH_GROUP and ATTACH_DEVICE on
the IOASID.  Basically ATTACH_GROUP would just be equivalent to
attaching all the constituent devices.

> I didn't try to chart this out carefully.
> 
> Also, ultimately, something need to be done about compatability with
> the vfio container fd. It looks clear enough to me that the the VFIO
> container FD is just a single IOASID using a special ioctl interface
> so it would be quite rasonable to harmonize these somehow.
> 
> But that is too complicated and far out for me at least to guess on at
> this point..
> 
> > > Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> > > there any scenario where we want different vpasid's for the same
> > > IOASID? I guess it is OK like this. Hum.
> > 
> > Yes, it's completely sane that the guest links a I/O page table to 
> > different vpasids on dev1 and dev2. The IOMMU doesn't mandate
> > that when multiple devices share an I/O page table they must use
> > the same PASID#. 
> 
> Ok..
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
l to 
> > complete the pending fault;
> > 
> > -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved 
> > in 
> > ioasid_data->fault_data, and then calls iommu api to complete it with
> > {rid, pasid, response_code};
> 
> So resuming a fault on an ioasid will resume all devices pending on
> the fault?
> 
> > 5.7. BIND_PASID_TABLE
> > 
> > 
> > PASID table is put in the GPA space on some platform, thus must be updated
> > by the guest. It is treated as another user page table to be bound with the 
> > IOMMU.
> > 
> > As explained earlier, the user still needs to explicitly bind every user 
> > I/O 
> > page table to the kernel so the same pgtable binding protocol (bind, cache 
> > invalidate and fault handling) is unified cross platforms.
> >
> > vIOMMUs may include a caching mode (or paravirtualized way) which, once 
> > enabled, requires the guest to invalidate PASID cache for any change on the 
> > PASID table. This allows Qemu to track the lifespan of guest I/O page 
> > tables.
> >
> > In case of missing such capability, Qemu could enable write-protection on
> > the guest PASID table to achieve the same effect.
> > 
> > /* After boots */
> > /* Make vPASID space nested on GPA space */
> > pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> > 
> > /* Attach dev1 to pasidtbl_ioasid */
> > at_data = { .ioasid = pasidtbl_ioasid};
> > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> > 
> > /* Bind PASID table */
> > bind_data = {
> > .ioasid = pasidtbl_ioasid;
> > .addr   = gpa_pasid_table;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> > 
> > /* vIOMMU detects a new GVA I/O space created */
> > gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> > 
> > /* Attach dev1 to the new address space, with gpasid1 */
> > at_data = {
> > .ioasid = gva_ioasid;
> > .flag   = IOASID_ATTACH_USER_PASID;
> > .user_pasid = gpasid1;
> > };
> > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> > 
> > /* Bind guest I/O page table. Because SET_PASID_TABLE has been
> >   * used, the kernel will not update the PASID table. Instead, just
> >   * track the bound I/O page table for handling invalidation and
> >   * I/O page faults.
> >   */
> > bind_data = {
> > .ioasid = gva_ioasid;
> > .addr   = gva_pgtable1;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> I still don't quite get the benifit from doing this.
> 
> The idea to create an all PASID IOASID seems to work better with less
> fuss on HW that is directly parsing the guest's PASID table.
> 
> Cache invalidate seems easy enough to support
> 
> Fault handling needs to return the (ioasid, device_label, pasid) when
> working with this kind of ioasid.
> 
> It is true that it does create an additional flow qemu has to
> implement, but it does directly mirror the HW.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
 most of these attributes of the IOASID after
> attaching to a device.

Yes... but as above, we have no idea what the IOMMU's capabilities are
until devices are attached.

> The user should have some idea how it intends to use the IOASID when
> it creates it and the rest of the system should match the intention.
> 
> For instance if the user is creating a IOASID to cover the guest GPA
> with the intention of making children it should indicate this during
> alloc.
> 
> If the user is intending to point a child IOASID to a guest page table
> in a certain descriptor format then it should indicate it during
> alloc.
> 
> device bind should fail if the device somehow isn't compatible with
> the scheme the user is tring to use.

[snip]
> > 2.2. /dev/vfio uAPI
> > 
> 
> To be clear you mean the 'struct vfio_device' API, these are not
> IOCTLs on the container or group?
> 
> > /*
> >* Bind a vfio_device to the specified IOASID fd
> >*
> >* Multiple vfio devices can be bound to a single ioasid_fd, but a single
> >* vfio device should not be bound to multiple ioasid_fd's.
> >*
> >* Input parameters:
> >*  - ioasid_fd;
> >*
> >* Return: 0 on success, -errno on failure.
> >*/
> > #define VFIO_BIND_IOASID_FD   _IO(VFIO_TYPE, VFIO_BASE + 22)
> > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)
> 
> This is where it would make sense to have an output "device id" that
> allows /dev/ioasid to refer to this "device" by number in events and
> other related things.

The group number could be used for that, even if there are no group
fds.  You generally can't identify things more narrowly than group
anyway.


-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Fri, May 28, 2021 at 02:35:38PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
[snip]
> > With above design /dev/ioasid uAPI is all about I/O address spaces. 
> > It doesn't include any device routing information, which is only 
> > indirectly registered to the ioasid driver through VFIO uAPI. For
> > example, I/O page fault is always reported to userspace per IOASID,
> > although it's physically reported per device (RID+PASID). 
> 
> I agree with Jean-Philippe - at the very least erasing this
> information needs a major rational - but I don't really see why it
> must be erased? The HW reports the originating device, is it just a
> matter of labeling the devices attached to the /dev/ioasid FD so it
> can be reported to userspace?

HW reports the originating device as far as it knows.  In many cases
where you have multiple devices in an IOMMU group, it's because
although they're treated as separate devices at the kernel level, they
have the same RID at the HW level.  Which means a RID for something in
the right group is the closest you can count on supplying.

[snip]
> > However this way significantly 
> > violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
> > one address space any more. Device routing information (indirectly 
> > marking hidden I/O spaces) has to be carried in iotlb invalidation and 
> > page faulting uAPI to help connect vIOMMU with the underlying 
> > pIOMMU. This is one design choice to be confirmed with ARM guys.
> 
> I'm confused by this rational.
> 
> For a vIOMMU that has IO page tables in the guest the basic
> choices are:
>  - Do we have a hypervisor trap to bind the page table or not? (RID
>and PASID may differ here)
>  - Do we have a hypervisor trap to invaliate the page tables or not?
> 
> If the first is a hypervisor trap then I agree it makes sense to create a
> child IOASID that points to each guest page table and manage it
> directly. This should not require walking guest page tables as it is
> really just informing the HW where the page table lives. HW will walk
> them.
> 
> If there are no hypervisor traps (does this exist?) then there is no
> way to involve the hypervisor here and the child IOASID should simply
> be a pointer to the guest's data structure that describes binding. In
> this case that IOASID should claim all PASIDs when bound to a
> RID. 

And in that case I think we should call that object something other
than an IOASID, since it represents multiple address spaces.

> Invalidation should be passed up the to the IOMMU driver in terms of
> the guest tables information and either the HW or software has to walk
> to guest tables to make sense of it.
> 
> Events from the IOMMU to userspace should be tagged with the attached
> device label and the PASID/substream ID. This means there is no issue
> to have a a 'all PASID' IOASID.
> 
> > Notes:
> > -   It might be confusing as IOASID is also used in the kernel (drivers/
> > iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
> > find a better name later to differentiate.
> 
> +1 on Jean-Philippe's remarks
> 
> > -   PPC has not be considered yet as we haven't got time to fully understand
> > its semantics. According to previous discussion there is some 
> > generality 
> > between PPC window-based scheme and VFIO type1 semantics. Let's 
> > first make consensus on this proposal and then further discuss how to 
> > extend it to cover PPC's requirement.
> 
> From what I understood PPC is not so bad, Nesting IOASID's did its
> preload feature and it needed a way to specify/query the IOVA range a
> IOASID will cover.
> 
> > -   There is a protocol between vfio group and kvm. Needs to think about
> > how it will be affected following this proposal.
> 
> Ugh, I always stop looking when I reach that boundary. Can anyone
> summarize what is going on there?
> 
> Most likely passing the /dev/ioasid into KVM's FD (or vicevera) is the
> right answer. Eg if ARM needs to get the VMID from KVM and set it to
> ioasid then a KVM "ioctl set_arm_vmid(/dev/ioasid)" call is
> reasonable. Certainly better than the symbol get sutff we have right
> now.
> 
> I will read through the detail below in another email
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic 
> to 
> isolate untrusted device DMAs initiated by userspace. 
> 
> This proposal describes the uAPI of /dev/ioasid and also sample sequences 
> with VFIO as example in typical usages. The driver-facing kernel API provided 
> by the iommu layer is still TBD, which can be discussed after consensus is 
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
>   
> https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ 
> 
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.

Thanks for the writeup.  I'm giving this a first pass review, note
that I haven't read all the existing replies in detail yet.

> 
> TOC
> 
> 1. Terminologies and Concepts
> 2. uAPI Proposal
> 2.1. /dev/ioasid uAPI
> 2.2. /dev/vfio uAPI
> 2.3. /dev/kvm uAPI
> 3. Sample structures and helper functions
> 4. PASID virtualization
> 5. Use Cases and Flows
> 5.1. A simple example
> 5.2. Multiple IOASIDs (no nesting)
> 5.3. IOASID nesting (software)
> 5.4. IOASID nesting (hardware)
> 5.5. Guest SVA (vSVA)
> 5.6. I/O page fault
> 5.7. BIND_PASID_TABLE
> 
> 
> 1. Terminologies and Concepts
> -
> 
> IOASID FD is the container holding multiple I/O address spaces. User 
> manages those address spaces through FD operations. Multiple FD's are 
> allowed per process, but with this proposal one FD should be sufficient for 
> all intended usages.
> 
> IOASID is the FD-local software handle representing an I/O address space. 
> Each IOASID is associated with a single I/O page table. IOASIDs can be 
> nested together, implying the output address from one I/O page table 
> (represented by child IOASID) must be further translated by another I/O 
> page table (represented by parent IOASID).

Is there a compelling reason to have all the IOASIDs handled by one
FD?  Simply on the grounds that handles to kernel internal objects are
usualy fds, having an fd per ioasid seems like an obvious alternative.
In that case plain open() would replace IOASID_ALLOC.  Nested could be
handled either by 1) having a CREATED_NESTED on the parent fd which
spawns a new fd or 2) opening /dev/ioasid again for a new fd and doing
a SET_PARENT before doing anything else.

I may be bikeshedding here..

> I/O address space can be managed through two protocols, according to 
> whether the corresponding I/O page table is constructed by the kernel or 
> the user. When kernel-managed, a dma mapping protocol (similar to 
> existing VFIO iommu type1) is provided for the user to explicitly specify 
> how the I/O address space is mapped. Otherwise, a different protocol is 
> provided for the user to bind an user-managed I/O page table to the 
> IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
> handling. 
> 
> Pgtable binding protocol can be used only on the child IOASID's, implying 
> IOASID nesting must be enabled. This is because the kernel doesn't trust 
> userspace. Nesting allows the kernel to enforce its DMA isolation policy 
> through the parent IOASID.

To clarify, I'm guessing that's a restriction of likely practice,
rather than a fundamental API restriction.  I can see a couple of
theoretical future cases where a user-managed pagetable for a "base"
IOASID would be feasible:

  1) On some fancy future MMU allowing free nesting, where the kernel
 would insert an implicit extra layer translating user addresses
 to physical addresses, and the userspace manages a pagetable with
 its own VAs being the target AS
  2) For a purely software virtual device, where its virtual DMA
 engine can interpet user addresses fine

> IOASID nesting can be implemented in two ways: hardware nesting and
> software nesting. With hardware support the child and parent I/O page 
> tables are walked consecutively by the IOMMU to form a nested translation. 
> When it's implemented in software, the ioasid driver is responsible for 
> merging the two-level mappings into a single-level shadow I/O page table. 
> Software nesting requires both child/parent page tables operated through 
> the dma mapping protocol, so any change in either level can be captured 
> by the kernel to update the corresponding shadow mapping.

As Jason also said, I don't think you need to restrict software
nesting to only kernel managed L2 tables - you already need hooks for
cache invalidation, and you can use those to trigger reshadows.

> An I/O address space takes effect in the IOMMU only after it is attached 
> to a device. The device in the /dev/i

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Wed, Jun 02, 2021 at 01:37:53PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:57:52PM +1000, David Gibson wrote:
> 
> > I don't think presence or absence of a group fd makes a lot of
> > difference to this design.  Having a group fd just means we attach
> > groups to the ioasid instead of individual devices, and we no longer
> > need the bookkeeping of "partial" devices.
> 
> Oh, I think we really don't want to attach the group to an ioasid, or
> at least not as a first-class idea.
> 
> The fundamental problem that got us here is we now live in a world
> where there are many ways to attach a device to an IOASID:

I'm not seeing that that's necessarily a problem.

>  - A RID binding
>  - A RID,PASID binding
>  - A RID,PASID binding for ENQCMD

I have to admit I haven't fully grasped the differences between these
modes.  I'm hoping we can consolidate at least some of them into the
same sort of binding onto different IOASIDs (which may be linked in
parent/child relationships).

>  - A SW TABLE binding
>  - etc
> 
> The selection of which mode to use is based on the specific
> driver/device operation. Ie the thing that implements the 'struct
> vfio_device' is the thing that has to select the binding mode.

I thought userspace selected the binding mode - although not all modes
will be possible for all devices.

> group attachment was fine when there was only one mode. As you say it
> is fine to just attach every group member with RID binding if RID
> binding is the only option.
> 
> When SW TABLE binding was added the group code was hacked up - now the
> group logic is choosing between RID/SW TABLE in a very hacky and mdev
> specific way, and this is just a mess.

Sounds like it.  What do you propose instead to handle backwards
compatibility for group-based VFIO code?

> The flow must carry the IOASID from the /dev/iommu to the vfio_device
> driver and the vfio_device implementation must choose which binding
> mode and parameters it wants based on driver and HW configuration.
> 
> eg if two PCI devices are in a group then it is perfectly fine that
> one device uses RID binding and the other device uses RID,PASID
> binding.

U... I don't see how that can be.  They could well be in the same
group because their RIDs cannot be distinguished from each other.

> The only place I see for a "group bind" in the uAPI is some compat
> layer for the vfio container, and the implementation would be quite
> different, we'd have to call each vfio_device driver in the group and
> execute the IOASID attach IOCTL.
> 
> > > I would say no on the container. /dev/ioasid == the container, having
> > > two competing objects at once in a single process is just a mess.
> > 
> > Right.  I'd assume that for compatibility, creating a container would
> > create a single IOASID under the hood with a compatiblity layer
> > translating the container operations to iosaid operations.
> 
> It is a nice dream for sure
> 
> /dev/vfio could be a special case of /dev/ioasid just with a different
> uapi and ending up with only one IOASID. They could be interchangable
> from then on, which would simplify the internals of VFIO if it
> consistently delt with these new ioasid objects everywhere. But last I
> looked it was complicated enough to best be done later on
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Thu, Jun 03, 2021 at 01:29:58AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Thursday, June 3, 2021 12:09 AM
> > 
> > On Wed, Jun 02, 2021 at 01:33:22AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Wednesday, June 2, 2021 1:42 AM
> > > >
> > > > On Tue, Jun 01, 2021 at 08:10:14AM +, Tian, Kevin wrote:
> > > > > > From: Jason Gunthorpe 
> > > > > > Sent: Saturday, May 29, 2021 1:36 AM
> > > > > >
> > > > > > On Thu, May 27, 2021 at 07:58:12AM +, Tian, Kevin wrote:
> > > > > >
> > > > > > > IOASID nesting can be implemented in two ways: hardware nesting
> > and
> > > > > > > software nesting. With hardware support the child and parent I/O
> > page
> > > > > > > tables are walked consecutively by the IOMMU to form a nested
> > > > translation.
> > > > > > > When it's implemented in software, the ioasid driver is 
> > > > > > > responsible
> > for
> > > > > > > merging the two-level mappings into a single-level shadow I/O page
> > > > table.
> > > > > > > Software nesting requires both child/parent page tables operated
> > > > through
> > > > > > > the dma mapping protocol, so any change in either level can be
> > > > captured
> > > > > > > by the kernel to update the corresponding shadow mapping.
> > > > > >
> > > > > > Why? A SW emulation could do this synchronization during
> > invalidation
> > > > > > processing if invalidation contained an IOVA range.
> > > > >
> > > > > In this proposal we differentiate between host-managed and user-
> > > > > managed I/O page tables. If host-managed, the user is expected to use
> > > > > map/unmap cmd explicitly upon any change required on the page table.
> > > > > If user-managed, the user first binds its page table to the IOMMU and
> > > > > then use invalidation cmd to flush iotlb when necessary (e.g. 
> > > > > typically
> > > > > not required when changing a PTE from non-present to present).
> > > > >
> > > > > We expect user to use map+unmap and bind+invalidate respectively
> > > > > instead of mixing them together. Following this policy, map+unmap
> > > > > must be used in both levels for software nesting, so changes in either
> > > > > level are captured timely to synchronize the shadow mapping.
> > > >
> > > > map+unmap or bind+invalidate is a policy of the IOASID itself set when
> > > > it is created. If you put two different types in a tree then each IOASID
> > > > must continue to use its own operation mode.
> > > >
> > > > I don't see a reason to force all IOASIDs in a tree to be consistent??
> > >
> > > only for software nesting. With hardware support the parent uses map
> > > while the child uses bind.
> > >
> > > Yes, the policy is specified per IOASID. But if the policy violates the
> > > requirement in a specific nesting mode, then nesting should fail.
> > 
> > I don't get it.
> > 
> > If the IOASID is a page table then it is bind/invalidate. SW or not SW
> > doesn't matter at all.
> > 
> > > >
> > > > A software emulated two level page table where the leaf level is a
> > > > bound page table in guest memory should continue to use
> > > > bind/invalidate to maintain the guest page table IOASID even though it
> > > > is a SW construct.
> > >
> > > with software nesting the leaf should be a host-managed page table
> > > (or metadata). A bind/invalidate protocol doesn't require the user
> > > to notify the kernel of every page table change.
> > 
> > The purpose of invalidate is to inform the implementation that the
> > page table has changed so it can flush the caches. If the page table
> > is changed and invalidation is not issued then then the implementation
> > is free to ignore the changes.
> > 
> > In this way the SW mode is the same as a HW mode with an infinite
> > cache.
> > 
> > The collaposed shadow page table is really just a cache.
> > 
> 
> OK. One additional thing is that we may need a 'caching_mode"
> thing reported by /dev/ioasid, indicating whether invalidation is

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Thu, Jun 03, 2021 at 02:49:56AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, June 3, 2021 12:59 AM
> > 
> > On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > >   /* Bind guest I/O page table  */
> > > > >   bind_data = {
> > > > >   .ioasid = gva_ioasid;
> > > > >   .addr   = gva_pgtable1;
> > > > >   // and format information
> > > > >   };
> > > > >   ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > > >
> > > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > > there any reason to split these things? The only advantage to the
> > > > split is the device is known, but the device shouldn't impact
> > > > anything..
> > >
> > > I'm pretty sure the device(s) could matter, although they probably
> > > won't usually.
> > 
> > It is a bit subtle, but the /dev/iommu fd itself is connected to the
> > devices first. This prevents wildly incompatible devices from being
> > joined together, and allows some "get info" to report the capability
> > union of all devices if we want to do that.
> 
> I would expect the capability reported per-device via /dev/iommu. 
> Incompatible devices can bind to the same fd but cannot attach to
> the same IOASID. This allows incompatible devices to share locked
> page accounting.

Yeah... I'm not convinced that everything relevant here can be
reported per-device.  I think we may have edge cases where
combinations of devices have restrictions that individual devices in
the set do not.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Wed, Jun 02, 2021 at 01:58:38PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > /* Bind guest I/O page table  */
> > > > bind_data = {
> > > > .ioasid = gva_ioasid;
> > > > .addr   = gva_pgtable1;
> > > > // and format information
> > > > };
> > > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > > 
> > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > there any reason to split these things? The only advantage to the
> > > split is the device is known, but the device shouldn't impact
> > > anything..
> > 
> > I'm pretty sure the device(s) could matter, although they probably
> > won't usually. 
> 
> It is a bit subtle, but the /dev/iommu fd itself is connected to the
> devices first. This prevents wildly incompatible devices from being
> joined together, and allows some "get info" to report the capability
> union of all devices if we want to do that.

Right.. but I've not been convinced that having a /dev/iommu fd
instance be the boundary for these types of things actually makes
sense.  For example if we were doing the preregistration thing
(whether by child ASes or otherwise) then that still makes sense
across wildly different devices, but we couldn't share that layer if
we have to open different instances for each of them.

It really seems to me that it's at the granularity of the address
space (including extended RID+PASID ASes) that we need to know what
devices we have, and therefore what capbilities we have for that AS.

> The original concept was that devices joined would all have to support
> the same IOASID format, at least for the kernel owned map/unmap IOASID
> type. Supporting different page table formats maybe is reason to
> revisit that concept.
> 
> There is a small advantage to re-using the IOASID container because of
> the get_user_pages caching and pinned accounting management at the FD
> level.

Right, but at this stage I'm just not seeing a really clear (across
platforms and device typpes) boundary for what things have to be per
IOASID container and what have to be per IOASID, so I'm just not sure
the /dev/iommu instance grouping makes any sense.

> I don't know if that small advantage is worth the extra complexity
> though.
> 
> > But it would certainly be possible for a system to have two
> > different host bridges with two different IOMMUs with different
> > pagetable formats.  Until you know which devices (and therefore
> > which host bridge) you're talking about, you don't know what formats
> > of pagetable to accept.  And if you have devices from *both* bridges
> > you can't bind a page table at all - you could theoretically support
> > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > in both formats, but it would be pretty reasonable not to support
> > that.
> 
> The basic process for a user space owned pgtable mode would be:
> 
>  1) qemu has to figure out what format of pgtable to use
> 
> Presumably it uses query functions using the device label.

No... in the qemu case it would always select the page table format
that it needs to present to the guest.  That's part of the
guest-visible platform that's selected by qemu's configuration.

There's no negotiation here: either the kernel can supply what qemu
needs to pass to the guest, or it can't.  If it can't qemu, will have
to either emulate in SW (if possible, probably using a kernel-managed
IOASID to back it) or fail outright.

> The
> kernel code should look at the entire device path through all the
> IOMMU HW to determine what is possible.
> 
> Or it already knows because the VM's vIOMMU is running in some
> fixed page table format, or the VM's vIOMMU already told it, or
> something.

Again, I think you have the order a bit backwards.  The user selects
the capabilities that the vIOMMU will present to the guest as part of
the qemu configuration.  Qemu then requests that of the host kernel,
and either the host kernel supplies it, qemu emulates it in SW, or
qemu fails to start.

Guest visible properties of the platform never (or *should* never)
depend implicitly on host capabilities - it's impossible to sanely
support migration in such an environment.

>  2) qemu creates an IOASID and based on #1 and says 'I want this format'

Right.

>  3) qemu binds the IOASID to the device. 
> 
> If qmeu gets it wrong then it just fails.

Right, though it may be fall back to (partial) software emu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Wed, Jun 02, 2021 at 01:16:48PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:32:27PM +1000, David Gibson wrote:
> > > I agree with Jean-Philippe - at the very least erasing this
> > > information needs a major rational - but I don't really see why it
> > > must be erased? The HW reports the originating device, is it just a
> > > matter of labeling the devices attached to the /dev/ioasid FD so it
> > > can be reported to userspace?
> > 
> > HW reports the originating device as far as it knows.  In many cases
> > where you have multiple devices in an IOMMU group, it's because
> > although they're treated as separate devices at the kernel level, they
> > have the same RID at the HW level.  Which means a RID for something in
> > the right group is the closest you can count on supplying.
> 
> Granted there may be cases where exact fidelity is not possible, but
> that doesn't excuse eliminating fedelity where it does exist..
> 
> > > If there are no hypervisor traps (does this exist?) then there is no
> > > way to involve the hypervisor here and the child IOASID should simply
> > > be a pointer to the guest's data structure that describes binding. In
> > > this case that IOASID should claim all PASIDs when bound to a
> > > RID. 
> > 
> > And in that case I think we should call that object something other
> > than an IOASID, since it represents multiple address spaces.
> 
> Maybe.. It is certainly a special case.
> 
> We can still consider it a single "address space" from the IOMMU
> perspective. What has happened is that the address table is not just a
> 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".

True.  This does complexify how we represent what IOVA ranges are
valid, though.  I'll bet you most implementations don't actually
implement a full 64-bit IOVA, which means we effectively have a large
number of windows from (0..max IOVA) for each valid pasid.  This adds
another reason I don't think my concept of IOVA windows is just a
power specific thing.

> If we are already going in the direction of having the IOASID specify
> the page table format and other details, specifying that the page
> tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
> step.

Well, rather I think userspace needs to request what page table format
it wants and the kernel tells it whether it can oblige or not.

> I wouldn't twist things into knots to create a difference, but if it
> is easy to do it wouldn't hurt either.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Wed, Jun 02, 2021 at 02:19:30PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:15:07PM +1000, David Gibson wrote:
> 
> > Is there a compelling reason to have all the IOASIDs handled by one
> > FD?
> 
> There was an answer on this, if every PASID needs an IOASID then there
> are too many FDs.

Too many in what regard?  fd limits?  Something else?

It seems to be there are two different cases for PASID handling here.
One is where userspace explicitly creates each valid PASID and
attaches a separate pagetable for each (or handles each with
MAP/UNMAP).  In that case I wouldn't have expected there to be too
many fds.

Then there's the case where we register a whole PASID table, in which
case I think you only need the one FD.  We can treat that as creating
an 84-bit IOAS, whose pagetable format is (PASID table + a bunch of
pagetables for each PASID).

> It is difficult to share the get_user_pages cache across FDs.

Ah... hrm, yes I can see that.

> There are global properties in the /dev/iommu FD, like what devices
> are part of it, that are important for group security operations. This
> becomes confused if it is split to many FDs.

I'm still not seeing those.  I'm really not seeing any well-defined
meaning to devices being attached to the fd, but not to a particular
IOAS.

> > > I/O address space can be managed through two protocols, according to 
> > > whether the corresponding I/O page table is constructed by the kernel or 
> > > the user. When kernel-managed, a dma mapping protocol (similar to 
> > > existing VFIO iommu type1) is provided for the user to explicitly specify 
> > > how the I/O address space is mapped. Otherwise, a different protocol is 
> > > provided for the user to bind an user-managed I/O page table to the 
> > > IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
> > > handling. 
> > > 
> > > Pgtable binding protocol can be used only on the child IOASID's, implying 
> > > IOASID nesting must be enabled. This is because the kernel doesn't trust 
> > > userspace. Nesting allows the kernel to enforce its DMA isolation policy 
> > > through the parent IOASID.
> > 
> > To clarify, I'm guessing that's a restriction of likely practice,
> > rather than a fundamental API restriction.  I can see a couple of
> > theoretical future cases where a user-managed pagetable for a "base"
> > IOASID would be feasible:
> > 
> >   1) On some fancy future MMU allowing free nesting, where the kernel
> >  would insert an implicit extra layer translating user addresses
> >  to physical addresses, and the userspace manages a pagetable with
> >  its own VAs being the target AS
> 
> I would model this by having a "SVA" parent IOASID. A "SVA" IOASID one
> where the IOVA == process VA and the kernel maintains this mapping.

That makes sense.  Needs a different name to avoid Intel and PCI
specificness, but having a trivial "pagetable format" which just says
IOVA == user address is nice idea.

> Since the uAPI is so general I do have a general expecation that the
> drivers/iommu implementations might need to be a bit more complicated,
> like if the HW can optimize certain specific graphs of IOASIDs we
> would still model them as graphs and the HW driver would have to
> "compile" the graph into the optimal hardware.
> 
> This approach has worked reasonable in other kernel areas.

That seems sensible.

> >   2) For a purely software virtual device, where its virtual DMA
> >  engine can interpet user addresses fine
> 
> This also sounds like an SVA IOASID.

Ok.

> Depending on HW if a device can really only bind to a very narrow kind
> of IOASID then it should ask for that (probably platform specific!)
> type during its attachment request to drivers/iommu.
> 
> eg "I am special hardware and only know how to do PLATFORM_BLAH
> transactions, give me an IOASID comatible with that". If the only way
> to create "PLATFORM_BLAH" is with a SVA IOASID because BLAH is
> hardwired to the CPU ASID  then that is just how it is.

Fair enough.

> > I wonder if there's a way to model this using a nested AS rather than
> > requiring special operations.  e.g.
> > 
> > 'prereg' IOAS
> > |
> > \- 'rid' IOAS
> >|
> >\- 'pasid' IOAS (maybe)
> > 
> > 'prereg' would have a kernel managed pagetable into which (for
> > example) qemu platform code would map all guest memory (using
> > IOASID_MAP_DMA).  qemu's vIOMMU driver would then mirror the gues

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-02 Thread David Gibson
On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
> Hi Jason,
> 
> On 2021/5/29 7:36, Jason Gunthorpe wrote:
> > > /*
> > >* Bind an user-managed I/O page table with the IOMMU
> > >*
> > >* Because user page table is untrusted, IOASID nesting must be enabled
> > >* for this ioasid so the kernel can enforce its DMA isolation policy
> > >* through the parent ioasid.
> > >*
> > >* Pgtable binding protocol is different from DMA mapping. The latter
> > >* has the I/O page table constructed by the kernel and updated
> > >* according to user MAP/UNMAP commands. With pgtable binding the
> > >* whole page table is created and updated by userspace, thus different
> > >* set of commands are required (bind, iotlb invalidation, page fault, 
> > > etc.).
> > >*
> > >* Because the page table is directly walked by the IOMMU, the user
> > >* must  use a format compatible to the underlying hardware. It can
> > >* check the format information through IOASID_GET_INFO.
> > >*
> > >* The page table is bound to the IOMMU according to the routing
> > >* information of each attached device under the specified IOASID. The
> > >* routing information (RID and optional PASID) is registered when a
> > >* device is attached to this IOASID through VFIO uAPI.
> > >*
> > >* Input parameters:
> > >*  - child_ioasid;
> > >*  - address of the user page table;
> > >*  - formats (vendor, address_width, etc.);
> > >*
> > >* Return: 0 on success, -errno on failure.
> > >*/
> > > #define IOASID_BIND_PGTABLE   _IO(IOASID_TYPE, IOASID_BASE + 
> > > 9)
> > > #define IOASID_UNBIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 10)
> > Also feels backwards, why wouldn't we specify this, and the required
> > page table format, during alloc time?
> > 
> 
> Thinking of the required page table format, perhaps we should shed more
> light on the page table of an IOASID. So far, an IOASID might represent
> one of the following page tables (might be more):
> 
>  1) an IOMMU format page table (a.k.a. iommu_domain)
>  2) a user application CPU page table (SVA for example)
>  3) a KVM EPT (future option)
>  4) a VM guest managed page table (nesting mode)
> 
> This version only covers 1) and 4). Do you think we need to support 2),

Isn't (2) the equivalent of using the using the host-managed pagetable
then doing a giant MAP of all your user address space into it?  But
maybe we should identify that case explicitly in case the host can
optimize it.

> 3) and beyond? If so, it seems that we need some in-kernel helpers and
> uAPIs to support pre-installing a page table to IOASID. From this point
> of view an IOASID is actually not just a variant of iommu_domain, but an
> I/O page table representation in a broader sense.
> 
> Best regards,
> baolu
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Thu, Jun 03, 2021 at 08:52:24AM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 03, 2021 at 03:13:44PM +1000, David Gibson wrote:
> 
> > > We can still consider it a single "address space" from the IOMMU
> > > perspective. What has happened is that the address table is not just a
> > > 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".
> > 
> > True.  This does complexify how we represent what IOVA ranges are
> > valid, though.  I'll bet you most implementations don't actually
> > implement a full 64-bit IOVA, which means we effectively have a large
> > number of windows from (0..max IOVA) for each valid pasid.  This adds
> > another reason I don't think my concept of IOVA windows is just a
> > power specific thing.
> 
> Yes
> 
> Things rapidly get into weird hardware specific stuff though, the
> request will be for things like:
>   "ARM PASID&IO page table format from SMMU IP block vXX"

So, I'm happy enough for picking a user-managed pagetable format to
imply the set of valid IOVA ranges (though a query might be nice).

I'm mostly thinking of representing (and/or choosing) valid IOVA
ranges as something for the kernel-managed pagetable style
(MAP/UNMAP).

> Which may have a bunch of (possibly very weird!) format specific data
> to describe and/or configure it.
> 
> The uAPI needs to be suitably general here. :(
> 
> > > If we are already going in the direction of having the IOASID specify
> > > the page table format and other details, specifying that the page
> > > tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
> > > step.
> > 
> > Well, rather I think userspace needs to request what page table format
> > it wants and the kernel tells it whether it can oblige or not.
> 
> Yes, this is what I ment.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Fri, Jun 04, 2021 at 09:30:54AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 04, 2021 at 12:44:28PM +0200, Enrico Weigelt, metux IT consult 
> wrote:
> > On 02.06.21 19:24, Jason Gunthorpe wrote:
> > 
> > Hi,
> > 
> > >> If I understand this correctly, /dev/ioasid is a kind of "common
> > supplier"
> > >> to other APIs / devices. Why can't the fd be acquired by the
> > >> consumer APIs (eg. kvm, vfio, etc) ?
> > >
> > > /dev/ioasid would be similar to /dev/vfio, and everything already
> > > deals with exposing /dev/vfio and /dev/vfio/N together
> > >
> > > I don't see it as a problem, just more work.
> > 
> > One of the problems I'm seeing is in container environments: when
> > passing in an vfio device, we now also need to pass in /dev/ioasid,
> > thus increasing the complexity in container setup (or orchestration).
> 
> Containers already needed to do this today. Container orchestration is
> hard.

Right to use VFIO a container already needs both /dev/vfio and one or
more /dev/vfio/NNN group devices.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-06-07 Thread David Gibson
On Tue, Jun 01, 2021 at 09:57:12AM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 02:03:33PM +1000, David Gibson wrote:
> > On Thu, May 27, 2021 at 03:48:47PM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote:
> > > > On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote:
> > > > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > > > > 
> > > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created 
> > > > > > per
> > > > > > VF (mdev device == VF device) then that mdev device has same iommu
> > > > > > protection scope as VF associated to it. 
> > > > > 
> > > > > This doesn't require, and certainly shouldn't create, a fake group.
> > > > 
> > > > It's only fake if you start with a narrow view of what a group is. 
> > > 
> > > A group is connected to drivers/iommu. A group object without *any*
> > > relation to drivers/iommu is just a complete fiction, IMHO.
> > 
> > That might be where we differ.  As I've said, my group I'm primarily
> > meaning the fundamental hardware unit of isolation.  *Usually* that's
> > determined by the capabilities of an IOMMU, but in some cases it might
> > not be.  In either case, the boundaries still matter.
> 
> As in my other email we absolutely need a group concept, it is just a
> question of how the user API is designed around it.
> 
> > > The group mdev implicitly creates is just a fake proxy that comes
> > > along with mdev API. It doesn't do anything and it doesn't mean
> > > anything.
> > 
> > But.. the case of multiple mdevs managed by a single PCI device with
> > an internal IOMMU also exists, and then the mdev groups are *not*
> > proxies but true groups independent of the parent device.  Which means
> > that the group structure of mdevs can vary, which is an argument *for*
> > keeping it, not against.
> 
> If VFIO becomes more "vfio_device" centric then the vfio_device itself
> has some properties. One of those can be "is it inside a drivers/iommu
> group, or not?".
> 
> If the vfio_device is not using a drivers/iommu IOMMU interface then
> it can just have no group at all - no reason to lie. This would mean
> that the device has perfect isolation.

When you say "not using a drivers/iommu IOMMU interface" do you
basically mean the device doesn't do DMA?  I can see some benefit to
that, but some drawbacks too.  The *main* form of isolation (or lack
thereof) that groups is about the IOMMU, but groups can also represent
other forms of isolation failure: e.g. a multifunction device, where
function 0 has some debug registers which affect other functions.
That's relevant whether or not any of those functions use DMA.

Now, we could represent those different sorts of isolation separately,
but at the time our thinking was that we should group together devices
that can't be safely isolated for *any* reason, since the practical
upshot is the same: you can't safely split those devices between
different owners.

> What I don't like is forcing certain things depending on how the
> vfio_device was created - for instance forcing a IOMMU group as part
> and forcing an ugly "SW IOMMU" mode in the container only as part of
> mdev_device.

I don't really see how this is depending on how the device is created.
The current VFIO model is that every device always belongs to a group
- but that group might be a singleton.  That seems less complicated to
me that some devices do and some don't have a group.

> These should all be properties of the vfio_device itself.
> 
> Again this is all about the group fd - and how to fit in with the
> /dev/ioasid proposal from Kevin:
> 
> https://lore.kernel.org/kvm/mwhpr11mb1886422d4839b372c6ab245f8c...@mwhpr11mb1886.namprd11.prod.outlook.com/
> 
> Focusing on vfio_device and skipping the group fd smooths out some
> rough edges.
> 
> Code wise we are not quite there, but I have mapped out eliminating
> the group from the vfio_device centric API and a few other places it
> has crept in.
> 
> The group can exist in the background to enforce security without
> being a cornerstone of the API design.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Thu, Jun 03, 2021 at 06:49:20AM +, Tian, Kevin wrote:
> > From: David Gibson
> > Sent: Thursday, June 3, 2021 1:09 PM
> [...]
> > > > In this way the SW mode is the same as a HW mode with an infinite
> > > > cache.
> > > >
> > > > The collaposed shadow page table is really just a cache.
> > > >
> > >
> > > OK. One additional thing is that we may need a 'caching_mode"
> > > thing reported by /dev/ioasid, indicating whether invalidation is
> > > required when changing non-present to present. For hardware
> > > nesting it's not reported as the hardware IOMMU will walk the
> > > guest page table in cases of iotlb miss. For software nesting
> > > caching_mode is reported so the user must issue invalidation
> > > upon any change in guest page table so the kernel can update
> > > the shadow page table timely.
> > 
> > For the fist cut, I'd have the API assume that invalidates are
> > *always* required.  Some bypass to avoid them in cases where they're
> > not needed can be an additional extension.
> > 
> 
> Isn't a typical TLB semantics is that non-present entries are not
> cached thus invalidation is not required when making non-present
> to present?

Usually, but not necessarily.

> It's true to both CPU TLB and IOMMU TLB.

I don't think it's entirely true of the CPU TLB on all ppc MMU models
(of which there are far too many).

> In reality
> I feel there are more usages built on hardware nesting than software
> nesting thus making default following hardware TLB behavior makes
> more sense...

I'm arguing for always-require-invalidate because it's strictly more
general.  Requiring the invalidate will support models that don't
require it in all cases; we just make the invalidate a no-op.  The
reverse is not true, so we should tackle the general case first, then
optimize.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Tue, Jun 01, 2021 at 04:22:25PM -0600, Alex Williamson wrote:
> On Tue, 1 Jun 2021 07:01:57 +
> "Tian, Kevin"  wrote:
> > 
> > I summarized five opens here, about:
> > 
> > 1)  Finalizing the name to replace /dev/ioasid;
> > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > 3)  Carry device information in invalidation/fault reporting uAPI;
> > 4)  What should/could be specified when allocating an IOASID;
> > 5)  The protocol between vfio group and kvm;
> > 
> ...
> > 
> > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > original purpose of this protocol is not about I/O address space. It's
> > for KVM to know whether any device is assigned to this VM and then
> > do something special (e.g. posted interrupt, EPT cache attribute, etc.).
> 
> Right, the original use case was for KVM to determine whether it needs
> to emulate invlpg, so it needs to be aware when an assigned device is
> present and be able to test if DMA for that device is cache coherent.
> The user, QEMU, creates a KVM "pseudo" device representing the vfio
> group, providing the file descriptor of that group to show ownership.
> The ugly symbol_get code is to avoid hard module dependencies, ie. the
> kvm module should not pull in or require the vfio module, but vfio will
> be present if attempting to register this device.
> 
> With kvmgt, the interface also became a way to register the kvm pointer
> with vfio for the translation mentioned elsewhere in this thread.
> 
> The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
> page table so that it can handle iotlb programming from pre-registered
> memory without trapping out to userspace.

To clarify that's a guest side logical vIOMMU page table which is
partially managed by KVM.  This is an optimization - things can work
without it, but it means guest iomap/unmap becomes a hot path because
each map/unmap hypercall has to go
guest -> KVM -> qemu -> VFIO

So there are multiple context transitions.

> > Because KVM deduces some policy based on the fact of assigned device, 
> > it needs to hold a reference to related vfio group. this part is irrelevant
> > to this RFC. 
> 
> All of these use cases are related to the IOMMU, whether DMA is
> coherent, translating device IOVA to GPA, and an acceleration path to
> emulate IOMMU programming in kernel... they seem pretty relevant.
> 
> > But ARM's VMID usage is related to I/O address space thus needs some
> > consideration. Another strange thing is about PPC. Looks it also leverages
> > this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
> > group. I don't know why it's done through KVM instead of VFIO uAPI in
> > the first place.
> 
> AIUI, IOMMU programming on PPC is done through hypercalls, so KVM needs
> to know how to handle those for in-kernel acceleration.  Thanks,

For PAPR guests, which is the common case, yes.  Bare metal POWER
hosts have their own page table format.  And probably some of the
newer embedded ppc models have some different IOMMU model entirely,
but I'm not familiar with it.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Thu, Jun 03, 2021 at 09:28:32AM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 03, 2021 at 03:23:17PM +1000, David Gibson wrote:
> > On Wed, Jun 02, 2021 at 01:37:53PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jun 02, 2021 at 04:57:52PM +1000, David Gibson wrote:
> > > 
> > > > I don't think presence or absence of a group fd makes a lot of
> > > > difference to this design.  Having a group fd just means we attach
> > > > groups to the ioasid instead of individual devices, and we no longer
> > > > need the bookkeeping of "partial" devices.
> > > 
> > > Oh, I think we really don't want to attach the group to an ioasid, or
> > > at least not as a first-class idea.
> > > 
> > > The fundamental problem that got us here is we now live in a world
> > > where there are many ways to attach a device to an IOASID:
> > 
> > I'm not seeing that that's necessarily a problem.
> > 
> > >  - A RID binding
> > >  - A RID,PASID binding
> > >  - A RID,PASID binding for ENQCMD
> > 
> > I have to admit I haven't fully grasped the differences between these
> > modes.  I'm hoping we can consolidate at least some of them into the
> > same sort of binding onto different IOASIDs (which may be linked in
> > parent/child relationships).
> 
> What I would like is that the /dev/iommu side managing the IOASID
> doesn't really care much, but the device driver has to tell
> drivers/iommu what it is going to do when it attaches.

By the device driver, do you mean the userspace or guest device
driver?  Or do you mean the vfio_pci or mdev "shim" device driver"?

> It makes sense, in PCI terms, only the driver knows what TLPs the
> device will generate. The IOMMU needs to know what TLPs it will
> recieve to configure properly.
> 
> PASID or not is major device specific variation, as is the ENQCMD/etc
> 
> Having the device be explicit when it tells the IOMMU what it is going
> to be sending is a major plus to me. I actually don't want to see this
> part of the interface be made less strong.

Ok, if I'm understanding this right a PASID capable IOMMU will be able
to process *both* transactions with just a RID and transactions with a
RID+PASID.

So if we're thinking of this notional 84ish-bit address space, then
that includes "no PASID" as well as all the possible PASID values.
Yes?  Or am I confused?

> 
> > > The selection of which mode to use is based on the specific
> > > driver/device operation. Ie the thing that implements the 'struct
> > > vfio_device' is the thing that has to select the binding mode.
> > 
> > I thought userspace selected the binding mode - although not all modes
> > will be possible for all devices.
> 
> /dev/iommu is concerned with setting up the IOAS and filling the IO
> page tables with information
> 
> The driver behind "struct vfio_device" is responsible to "route" its
> HW into that IOAS.
> 
> They are two halfs of the problem, one is only the io page table, and one
> the is connection of a PCI TLP to a specific io page table.
> 
> Only the driver knows what format of TLPs the device will generate so
> only the driver can specify the "route"

Ok.  I'd really like if we can encode this in a way that doesn't build
PCI-specific structure into the API, though.

>  
> > > eg if two PCI devices are in a group then it is perfectly fine that
> > > one device uses RID binding and the other device uses RID,PASID
> > > binding.
> > 
> > U... I don't see how that can be.  They could well be in the same
> > group because their RIDs cannot be distinguished from each other.
> 
> Inability to match the RID is rare, certainly I would expect any IOMMU
> HW that can do PCIEe PASID matching can also do RID matching.

It's not just up to the IOMMU.  The obvious case is a PCIe-to-PCI
bridge.  All transactions show the RID of the bridge, because vanilla
PCI doesn't have them.  Same situation with a buggy multifunction
device which uses function 0's RID for all functions.

It may be rare, but we still have to deal with it one way or another.

I really don't think we want to support multiple binding types for a
single group.

> With
> such HW the above is perfectly fine - the group may not be secure
> between members (eg !ACS), but the TLPs still carry valid RIDs and
> PASID and the IOMMU can still discriminate.

They carry RIDs, whether they're valid depends on how buggy your
hardware is.

> I think you are talking about really old IOMMU's that could only
> isolate base

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Fri, Jun 04, 2021 at 12:24:08PM +0200, Jean-Philippe Brucker wrote:
> On Thu, Jun 03, 2021 at 03:45:09PM +1000, David Gibson wrote:
> > > > But it would certainly be possible for a system to have two
> > > > different host bridges with two different IOMMUs with different
> > > > pagetable formats.  Until you know which devices (and therefore
> > > > which host bridge) you're talking about, you don't know what formats
> > > > of pagetable to accept.  And if you have devices from *both* bridges
> > > > you can't bind a page table at all - you could theoretically support
> > > > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > > > in both formats, but it would be pretty reasonable not to support
> > > > that.
> > > 
> > > The basic process for a user space owned pgtable mode would be:
> > > 
> > >  1) qemu has to figure out what format of pgtable to use
> > > 
> > > Presumably it uses query functions using the device label.
> > 
> > No... in the qemu case it would always select the page table format
> > that it needs to present to the guest.  That's part of the
> > guest-visible platform that's selected by qemu's configuration.
> > 
> > There's no negotiation here: either the kernel can supply what qemu
> > needs to pass to the guest, or it can't.  If it can't qemu, will have
> > to either emulate in SW (if possible, probably using a kernel-managed
> > IOASID to back it) or fail outright.
> > 
> > > The
> > > kernel code should look at the entire device path through all the
> > > IOMMU HW to determine what is possible.
> > > 
> > > Or it already knows because the VM's vIOMMU is running in some
> > > fixed page table format, or the VM's vIOMMU already told it, or
> > > something.
> > 
> > Again, I think you have the order a bit backwards.  The user selects
> > the capabilities that the vIOMMU will present to the guest as part of
> > the qemu configuration.  Qemu then requests that of the host kernel,
> > and either the host kernel supplies it, qemu emulates it in SW, or
> > qemu fails to start.
> 
> Hm, how fine a capability are we talking about?  If it's just "give me
> VT-d capabilities" or "give me Arm capabilities" that would work, but
> probably isn't useful. Anything finer will be awkward because userspace
> will have to try combinations of capabilities to see what sticks, and
> supporting new hardware will drop compatibility for older one.

For the qemu case, I would imagine a two stage fallback:

1) Ask for the exact IOMMU capabilities (including pagetable
   format) that the vIOMMU has.  If the host can supply, you're
   good

2) If not, ask for a kernel managed IOAS.  Verify that it can map
   all the IOVA ranges the guest vIOMMU needs, and has an equal or
   smaller pagesize than the guest vIOMMU presents.  If so,
   software emulate the vIOMMU by shadowing guest io pagetable
   updates into the kernel managed IOAS.

3) You're out of luck, don't start.

For both (1) and (2) I'd expect it to be asking this question *after*
saying what devices are attached to the IOAS, based on the virtual
hardware configuration.  That doesn't cover hotplug, of course, for
that you have to just fail the hotplug if the new device isn't
supportable with the IOAS you already have.

One can imagine optimizations where for certain intermediate cases you
could do a lighter SW emu if the host supports a model that's close to
the vIOMMU one, and you're able to trap and emulate the differences.
In practice I doubt anyone's going to have time to look for such cases
and implement the logic for it.

> For example depending whether the hardware IOMMU is SMMUv2 or SMMUv3, that
> completely changes the capabilities offered to the guest (some v2
> implementations support nesting page tables, but never PASID nor PRI
> unlike v3.) The same vIOMMU could support either, presenting different
> capabilities to the guest, even multiple page table formats if we wanted
> to be exhaustive (SMMUv2 supports the older 32-bit descriptor), but it
> needs to know early on what the hardware is precisely. Then some new page
> table format shows up and, although the vIOMMU can support that in
> addition to older ones, QEMU will have to pick a single one, that it
> assumes the guest knows how to drive?
> 
> I think once it binds a device to an IOASID fd, QEMU will want to probe
> what hardware features are available before going further with the vIO

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Thu, Jun 03, 2021 at 07:17:23AM +, Tian, Kevin wrote:
> > From: David Gibson 
> > Sent: Wednesday, June 2, 2021 2:15 PM
> > 
> [...] 
> > > An I/O address space takes effect in the IOMMU only after it is attached
> > > to a device. The device in the /dev/ioasid context always refers to a
> > > physical one or 'pdev' (PF or VF).
> > 
> > What you mean by "physical" device here isn't really clear - VFs
> > aren't really physical devices, and the PF/VF terminology also doesn't
> > extent to non-PCI devices (which I think we want to consider for the
> > API, even if we're not implemenenting it any time soon).
> 
> Yes, it's not very clear, and more in PCI context to simplify the 
> description. A "physical" one here means an PCI endpoint function
> which has a unique RID. It's more to differentiate with later mdev/
> subdevice which uses both RID+PASID. Naming is always a hard
> exercise to me... Possibly I'll just use device vs. subdevice in future
> versions.
> 
> > 
> > Now, it's clear that we can't program things into the IOMMU before
> > attaching a device - we might not even know which IOMMU to use.
> 
> yes
> 
> > However, I'm not sure if its wise to automatically make the AS "real"
> > as soon as we attach a device:
> > 
> >  * If we're going to attach a whole bunch of devices, could we (for at
> >least some IOMMU models) end up doing a lot of work which then has
> >to be re-done for each extra device we attach?
> 
> which extra work did you specifically refer to? each attach just implies
> writing the base address of the I/O page table to the IOMMU structure
> corresponding to this device (either being a per-device entry, or per
> device+PASID entry).
> 
> and generally device attach should not be in a hot path.
> 
> > 
> >  * With kernel managed IO page tables could attaching a second device
> >(at least on some IOMMU models) require some operation which would
> >require discarding those tables?  e.g. if the second device somehow
> >forces a different IO page size
> 
> Then the attach should fail and the user should create another IOASID
> for the second device.

Couldn't this make things weirdly order dependent though?  If device A
has strictly more capabilities than device B, then attaching A then B
will be fine, but B then A will trigger a new ioasid fd.

> > For that reason I wonder if we want some sort of explicit enable or
> > activate call.  Device attaches would only be valid before, map or
> > attach pagetable calls would only be valid after.
> 
> I'm interested in learning a real example requiring explicit enable...
> 
> > 
> > > One I/O address space could be attached to multiple devices. In this case,
> > > /dev/ioasid uAPI applies to all attached devices under the specified 
> > > IOASID.
> > >
> > > Based on the underlying IOMMU capability one device might be allowed
> > > to attach to multiple I/O address spaces, with DMAs accessing them by
> > > carrying different routing information. One of them is the default I/O
> > > address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
> > > remaining are routed by RID + Process Address Space ID (PASID) or
> > > Stream+Substream ID. For simplicity the following context uses RID and
> > > PASID when talking about the routing information for I/O address spaces.
> > 
> > I'm not really clear on how this interacts with nested ioasids.  Would
> > you generally expect the RID+PASID IOASes to be children of the base
> > RID IOAS, or not?
> 
> No. With Intel SIOV both parent/children could be RID+PASID, e.g.
> when one enables vSVA on a mdev.

Hm, ok.  I really haven't understood how the PASIDs fit into this
then.  I'll try again on v2.

> > If the PASID ASes are children of the RID AS, can we consider this not
> > as the device explicitly attaching to multiple IOASIDs, but instead
> > attaching to the parent IOASID with awareness of the child ones?
> > 
> > > Device attachment is initiated through passthrough framework uAPI (use
> > > VFIO for simplicity in following context). VFIO is responsible for 
> > > identifying
> > > the routing information and registering it to the ioasid driver when 
> > > calling
> > > ioasid attach helper function. It could be RID if the assigned device is
> > > pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
> > > user might also provide its view of vir

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Thu, Jun 03, 2021 at 09:11:05AM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 03, 2021 at 03:45:09PM +1000, David Gibson wrote:
> > On Wed, Jun 02, 2021 at 01:58:38PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > > > /* Bind guest I/O page table  */
> > > > > > bind_data = {
> > > > > > .ioasid = gva_ioasid;
> > > > > > .addr   = gva_pgtable1;
> > > > > > // and format information
> > > > > > };
> > > > > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > > > > 
> > > > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > > > there any reason to split these things? The only advantage to the
> > > > > split is the device is known, but the device shouldn't impact
> > > > > anything..
> > > > 
> > > > I'm pretty sure the device(s) could matter, although they probably
> > > > won't usually. 
> > > 
> > > It is a bit subtle, but the /dev/iommu fd itself is connected to the
> > > devices first. This prevents wildly incompatible devices from being
> > > joined together, and allows some "get info" to report the capability
> > > union of all devices if we want to do that.
> > 
> > Right.. but I've not been convinced that having a /dev/iommu fd
> > instance be the boundary for these types of things actually makes
> > sense.  For example if we were doing the preregistration thing
> > (whether by child ASes or otherwise) then that still makes sense
> > across wildly different devices, but we couldn't share that layer if
> > we have to open different instances for each of them.
> 
> It is something that still seems up in the air.. What seems clear for
> /dev/iommu is that it
>  - holds a bunch of IOASID's organized into a tree
>  - holds a bunch of connected devices

Right, and it's still not really clear to me what devices connected to
the same /dev/iommu instance really need to have in common, as
distinct from what devices connected to the same specific ioasid need
to have in common.

>  - holds a pinned memory cache
> 
> One thing it must do is enforce IOMMU group security. A device cannot
> be attached to an IOASID unless all devices in its IOMMU group are
> part of the same /dev/iommu FD.

Well, you can't attach a device to an individual IOASID unless all
devices in its group are attached to the same individual IOASID
either, so I'm not clear what benefit there is to enforcing it at the
/dev/iommu instance as well as at the individual ioasid level.

> The big open question is what parameters govern allowing devices to
> connect to the /dev/iommu:
>  - all devices can connect and we model the differences inside the API
>somehow.
>  - Only sufficiently "similar" devices can be connected
>  - The FD's capability is the minimum of all the connected devices
> 
> There are some practical problems here, when an IOASID is created the
> kernel does need to allocate a page table for it, and that has to be
> in some definite format.
> 
> It may be that we had a false start thinking the FD container should
> be limited. Perhaps creating an IOASID should pass in a list
> of the "device labels" that the IOASID will be used with and that can
> guide the kernel what to do?
> 
> > Right, but at this stage I'm just not seeing a really clear (across
> > platforms and device typpes) boundary for what things have to be per
> > IOASID container and what have to be per IOASID, so I'm just not sure
> > the /dev/iommu instance grouping makes any sense.
> 
> I would push as much stuff as possible to be per-IOASID..

I agree.  But the question is what's *not* possible to be per-IOASID,
so what's the semantic boundary that defines when things have to be in
the same /dev/iommu instance, but not the same IOASID.

> > > I don't know if that small advantage is worth the extra complexity
> > > though.
> > > 
> > > > But it would certainly be possible for a system to have two
> > > > different host bridges with two different IOMMUs with different
> > > > pagetable formats.  Until you know which devices (and therefore
> > > > which host bridge) you're talking about, you don't know what formats
> > > > of pagetable to accept.  And if you have devices from *both* bridges
> > > > you can't bind a page table at all - you could theoretically supp

  1   2   >