Quoting David Gibson (2015-09-09 20:18:01) > On Wed, Sep 09, 2015 at 12:19:22PM -0500, Michael Roth wrote: > > Quoting David Gibson (2015-09-08 23:10:41) > > > On Tue, Sep 08, 2015 at 06:44:55PM -0500, Michael Roth wrote: > > > > Logical resources start with allocation-state:UNUSABLE / > > > > isolation-state:ISOLATED. During hotplug, guests will transition > > > > them to allocate-state:USABLE, and then to isolate-state:UNISOLATED. > > > > The former transition does not seem to have any failure path for > > > > cases where a DRC does not have any resources associated with it to > > > > allocate for guest, but instead relies on the subsequent > > > > isolation-state:UNISOLATED transition to indicate failure in this > > > > situation. > > > > > > > > Currently DRC code does not implement this logic, but instead > > > > tries to indicate failure by refusing the allocation-state:USABLE > > > > transition. Unfortunately, since that's not a documented failure > > > > path, guests continue undeterred, causing undefined behavior in > > > > QEMU and guest code. > > > > > > > > Fix this by handling things as PAPR defines (13.7 and 13.7.3.1). > > > > > > > > Cc: qemu-...@nongnu.org > > > > Cc: David Gibson <da...@gibson.dropbear.id.au> > > > > Cc: Bharata B Rao <bhar...@linux.vnet.ibm.com> > > > > Signed-off-by: Michael Roth <mdr...@linux.vnet.ibm.com> > > > > --- > > > > v2: > > > > - actually include the full changeset in the patch > > > > > > Several queries for clarification: > > > > Looks like you've already picked this up, but just in case: > > > > > > > > * Is this intended to replace Bharata's patch "spapr_drc: > > > Return correct state for logical DR in entity_sense()" or to apply > > > on top of it? > > > > Yes this should replace that patch. > > > > > > > > * If I'm understanding correctly, the problem here is that although > > > the guest is supposed to check for failures to set the allocation > > > state, it's actually not? This patch is to make qemu gracefully > > > handle the guest's failure to do this? Is that right? > > > > The root issue is that allocation state setting doesn't actually have > > a documented failure path. Instead, the subsequent isolation state > > setting is where we surface an error if we're unable to actually > > provide the device to the guest (due to it not being attached > > to the DRC in question). > > Um.. 13.7.3.1 (step 2) implies checking for failures on setting > allocation state, and 13.5.3.5 Table 177 says "If the transition to > the usable state is not possible the -3 (no such indicator > implemented) status is returned." > > How is that not a documented failure path?
Hmm, I think I misunderpreted this statement from 13.7: "The OS may use the get-sensor-state RTAS call with the dr-entity-sense token to deter- mine if a given drc-index refers to a connector that is currently usable for DR operations. If the connector is not currently usable the return state is “DR entity unusable” (2). A set-indicator (isolation state) RTAS call to an unusable connector or (dr-indicator) to any logical resource connector results in a “No such indicator implemented” return sta- tus." To me it seemed to suggest that isolation-state is where we fail attempts to bring up an UNUSABLE resource, but the statement doesn't actually preclude us from returning error on setting allocation-state. I'm not sure how I misinterpreted 13.7.3.1 to support this assumption, it does seem to clearly indicate setting allocation-state can return error (and in fact mentions nothing about accounting for isolation-state errors in that case), and 13.5.3.5 makes it clear it should also be -3 in that case. I'll send a v3 with the set-allocation logic in place. I'll also look into returning RTAS errors directly. Thanks for the catch! > > > We weren't surfacing that error in the isolation state setting, so the > > guest was continuing on with configuring devices that aren't actually > > present. This patch should correct that. > > So, to be clear, I think the check in set isolation-state is a good > idea as well, to properly handle a misbehaving guest. > > > Personally it seems like allocation state setting is where that failure > > should occur, since that's the earliest point to surface the error, but > > that's how PAPR has it documented. > > .. but I'm not seeing how PAPR is documenting it as happening in set > isolation state rather than set allocation-state. > > -- > David Gibson | I'll have my music baroque, and my code > david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ > | _way_ _around_! > http://www.ozlabs.org/~dgibson