Jason,
Hey!
On 3/18/25 8:23 PM, Jason Gunthorpe wrote:
On Tue, Mar 18, 2025 at 05:22:51PM -0400, Donald Dutile wrote:
I agree with Eric that 'accel' isn't needed -- this should be
ascertained from the pSMMU that a physical device is attached to.
I seem to remember the point was made that we don't actually know if
accel is possible, or desired, especially in the case of hotplug.
In the case of hw-passthrough hot-plug, what isn't known?:
a) domain:b:d.f is known
b) thus its hierarchy and SMMUv3 association in the host is known
c) thus, if the (accel) features of the SMMUv3 were exposed (known),
then the proper setup (separate SMMUv3 vs system-wide-emulated SMMUv3;
association of (allocated/configured) vSMMUv3 to pSMMUv3 would be known/made
What else is missing?
The accelerated mode has a number of limitations that the software
mode does not have. I think it does make sense that the user would
deliberately choose to use a more restrictive operating mode and then
would have to meet the requirements - eg by creating the required
number and configuration of vSMMUs.
At a qemu-cmd level, the right number & config of smmuv3's, but
libvirt, if it had the above info, could auto-generate the right number
of smmuv3's (stages, accel-features, etc.) ... just as it does today in
creating the right number of pcie bus's, RPs, etc. from simple(r)
device specs into more complete, qemu configs.
Now... how does vfio(?; why not qemu?) layer determine that? --
where are SMMUv3 'accel' features exposed either: a) in the device
struct (for the smmuv3) or (b) somewhere under sysfs? ... I couldn't
find anything under either on my g-h system, but would appreciate a
ptr if there is.
I think it is not discoverable yet other thatn through
try-and-fail. Discoverability would probably be some bits in an
iommufd GET_INFO ioctl or something like that.
I don't see how iommufd would 'get-info' the needed info any better
than any other interface/subsystem. ...
and like Eric, although 'accel' is better than the
original 'nested', it's non-obvious what accel feature(s) are being
turned on, or not.
There are really only one accel feature - direct HW usage of the IO
Page table in the guest (no shadowing).
A secondary addon would be direct HW usage of an invalidation queue in
the guest.
and, if architected correctly, even in (device-specific) sw-provided tables,
it could be 'formatted' in a way that it was discoverable by the appropriate
layers
(libvirt, qemu).
Once discoverable, this whole separate accel device -- which is really an
attribute of an SMMUv3 -- can be generalized, and reduced, to a much
smaller, simpler, sw footprint, with the concept of callbacks (as the series
uses) to enable hw accelerators to perform the shadow-ops that fully-emulated
smmuv3 would have to do.
kernel boot-param will be needed; if in sysfs, a write to 0 an
enable(disable) it maybe an alternative as well. Bottom line: we
need a way to (a) ascertain the accel feature (b) a way to disable
it when it is broken, so qemu's smmuv3 spec will 'just work'.
You'd turned it off by not asking qemu to use it, that is sort of the
reasoning behind the command line opt in for accel or not.
It would make machine-level definitions far more portable if the
working/non-working, and the one-accel, or two-accel, or three-accel, or ...
features were dynamically determined vs a static (qemu) machine config, that
would have
to be manipulated each time it ran on a different machine.
e.g., cluster sw scans servers for machines with device-X.
create VMs, assigning some/all of device-X to a VM via its own smmuv3.
done.
Now, if the smmuv3 features were exposed all the way up to userspace,
then one could argue the cluster sw could scan for those features and add
it
to the accel=x,y,z option of the smmuv3 associated with an assigned
device.
potato/po-tah-toe cluster sw or libvirt or qemu or <something-else>
scans/reads
... discoverability of the features has to be done by
(a) a computer, or (b) an error-prone human.
... all that AI gone to waste ... ;-)
- Don
Jason