nvme: fix controller hotplugging

Hannes Reinecke Fri, 09 Jul 2021 01:51:48 -0700

On 7/9/21 8:55 AM, Klaus Jensen wrote:

On Jul  9 08:16, Hannes Reinecke wrote:

On 7/9/21 8:05 AM, Klaus Jensen wrote:

On Jul  7 17:49, Klaus Jensen wrote:

From: Klaus Jensen <k.jen...@samsung.com>


Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We
discussed a bit back and fourth and I mentioned that the core issue was
an artifact of the parent/child relationship stemming from the qdev
setup we have with namespaces attaching to controller through a qdev
bus.

The gist of this series is the fourth patch "hw/nvme: fix controllerhot

unplugging" which basically causes namespaces to be reassigned to a bus
owned by the subsystem if the parent controller is linked to one. This
fixes `device_del/add nvme` in such settings.

Note, that in the case that there is no subsystem involved, nvmedevices

can be removed from the system with `device_del`, but this *will* cause
the namespaces to be removed as well since there is no place (i.e. no
subsystem) for them to "linger". And since this series does not add
support for hotplugging nvme-ns devices, while an nvme device can be
readded, no namespaces can. Support for hotplugging nvme-ns devices is
present in [1], but I'd rather not add that since I think '-device
nvme-ns' is already a bad design choice.

Now, I do realize that it is not "pretty" to explicitly change the
parent bus, so I do have a an RFC patch in queue that replaces the
subsystem and namespace devices with objects, but keeps -device shims
available for backwards compatibility. This approach will solve the

problems properly and should be a better model. However, I don'tbelieve

it will make it for 6.1 and I'd really like to at least fix the
unplugging for 6.1 and this gets the job done.

 [1]: 20210511073511.32511-1-h...@suse.de

v2:
- added R-b's by Hannes for patches 1 through 3
- simplified "hw/nvme: fix controller hot unplugging"

Klaus Jensen (4):
 hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
 hw/nvme: mark nvme-subsys non-hotpluggable
 hw/nvme: unregister controller with subsystem at exit
 hw/nvme: fix controller hot unplugging

hw/nvme/nvme.h   | 18 +++++++++-------
hw/nvme/ctrl.c   | 14 ++++++------
hw/nvme/ns.c     | 55 +++++++++++++++++++++++++++++++-----------------
hw/nvme/subsys.c |  9 ++++++++
4 files changed, 63 insertions(+), 33 deletions(-)

--
2.32.0


Applied patches 1 through 3 to nvme-next.


So, how do we go about with patch 4?

Without it this whole exercise is a bit pointless, seeing that itdoesn't fix anything.

Patch 1-3 are fixes we need anyway, so I thought I might as well applythem :)

Shall we go with that patch as an interim solution?
Will you replace it with your 'object' patch?
What is the plan?

Yes, if acceptable, I would like to use patch 4 as an interim solution.We have a bug we need to fix for 6.1, and I believe this does the job.

Oh, yes, it does. But it's ever so slightly ugly with the reparentingstuff. But if that's considered an interim solution I'm fine with it.

You can add my 'Reviewed-by: Hannes Reinecke <h...@suse.de>' tag if youlike.

I considered changing the existing nvme-bus to be on the main systembus, but then we break the existing behavior that the namespaces attachto the most recently defined controller in the absence of the sharedparameter or an explicit bus parameter.

Do we?

My idea was to always attach a namespace to a subsystem (and, if notpresent, create one). The controller would then 'link' to thatsubsystem. The subsystem would have a 'shared' attribute, which woulddetermine if more than one controller can be 'linked' to it.

That way we change the relationship between the controller and thenamespace, as then the namespace would be a child of the subsystem,

and the namespace would need to be detached separately from the controller.

But it fits neatly into the current device model, except the slightlyawkward 'link' thingie.

Wrt. "the plan", right now, I see two solutions going forward:

1. Introduce new -object's for nvme-nvm-subsystem and nvme-ns
This is the approach that I am taking right now and it works well.It allows many-to-many relationships and separates the life times ofsubsystems, namespaces and controllers like you mentioned.


Ah. Would like to see that path, then.

Conceptually, I also really like that the subsystem and namespaceare not "devices". One could argue that the namespace is comparableto a SCSI LUN (-device scsi-hd, right?), but where the SCSI LUNactually "shows up" in the host, the nvme namespace does not.

Well, 'devices' really is an abstraction, and it really depends on whatyou declare a device is. But yes, in the QDEV sense with its strictinheritance the nvme topology is not a good fit, agreed.

As for SCSI: the namespace is quite comparable to a SCSI LUN; the NVMecontroller is roughly equivalent to the 'initiator' on SCSI, and thesubsystem would match up to the SCSI target.

The problem for NVMe is that the whole NVMe-over-Fabrics stuff waslayered on top of the existing NVMe-PCI spec, so that the 'subsystem'only truly exists for NVMe-over-Fabrics; for PCI you don't actually needit, and indeed some NVMe PCI devices don't even fill out these values.And it makes things tricky for qemu, as the nvme emulation is actuallybased on the pre-fabrics spec, hence subsystem concept was neverimplemented properly.

My series handles backwards compatibility by keeping -device "shims" around that just wraps the new objects but behaves like it used to. The plan would be to deprecate these devices.

Or keeping the '-device' shims around for just nvme-pci, and require-object specification if one would want to use nvme-over-fabrics.

The downside to this approach is that it moves the subsystem andnamespaces out of the "qdev tree (info qtree)" and into the pure QOM"/objects" tree. Instead of qtree, we can have QMP and HMP commandsfor introspection.


Serves them right for introducing tons of different abstractions.
Not a problem from my side.

2. Make the subsystem a "system bus device"
This way we add an "nvme-nvm-subsystems" bus as a direct child ofthe main system bus, and we can possibly get rid of the explicit-device nvme-subsys as well. We change the namespace device to pluginto that instead. The nvme controller device still needs to pluginto the PCI bus, so it cannot be a child of the subsystems bus, butcan keep using a link parameter to hook into the subsystem and attachto any namespaces it would like.

I don't think we can or should do away with the subsystem; that's quitea central structure in the nvme-oF spec, and trying to create anabstraction without it will just lead to lots of duplicatedidentification, not to mention the increased complexity during lookup(As per spec, the controller connects to a subsystem, and the subsystempresents the namespaces. Abstracting away the subsystem would mean thatyou have to have lots of tracking in the individual namespace, with lotsof possibilities to get it wrong.)

But from my perspective it should be perfectly feasible to have thesubsystem a child of the main/system bus, and the controller a child ofthe PCI bus.

As mentioned above, that would break the implicit destruction of thenamespace when detaching the controller, but one could argue that that'sexactly the point, seeing that several controllers can have access tothe same namespace.

I'm unsure if we can do this without deprecating the existingnamespace device, just like option 1.
I have not implemented this, so I need to look more into it. Itseems like the main thing that this gives us compared to 1) is `infoqtree`support and we still end up just "wiring" namespace attachmentwith backlinks anyway.


Yeah, we'll have to do wiring one way or other.

I'm not sure what I would prefer, but I've found that implementing it as-object's is a breath of fresh air and as I mentioned, conceptually, Ilike option 1 because namespaces are -objects and not -devices.

Sure. I just tend leave the infrastructure questions to those activelyworking with the qemu community. I've found the qemu development processto be too unwieldy for me to make more than the random contribution.

And, by the way, thanks for chipping in on this Hannes, I had sort ofcrossed off option 2 before you showed up and threw some ideas in theair ;)


Sure.

I could give it a go at option 2); patch 4 should be a good startingpoint. And shouldn't be too hard to implement, either.


Then we can compare results and make a judgement call.

Cheers,

Hannes
--
Dr. Hannes Reinecke                Kernel Storage Architect
h...@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging

Reply via email to