On 09/05/2017 05:46 PM, Cornelia Huck wrote: > On Tue, 5 Sep 2017 17:24:19 +0200 > Halil Pasic <pa...@linux.vnet.ibm.com> wrote: > >> My problem with a program check (indicated by SCSW word 2 bit 10) is >> that, in my reading of the architecture, the semantic behind it is: The >> channel subsystem (not the cu or device) has detected, that the >> the channel program (previously submitted as an ORB) is erroneous. Which >> programs are erroneous is specified by the architecture. What we have >> here does not qualify. >> >> My idea was to rather blame the virtual hardware (device) and put no blame >> on the program nor he channel subsystem. This could be done using device >> status (unit check with command reject, maybe unit exception) or interface >> check. My train of thought was, the problem is not consistent across a >> device type, so it has to be device specific. > > Unit exception might be a better way to express what is happening here. > At least, it moves us away from cc 1 and not towards cc 3 :) >
I will do a follow up patch pursuing device exception. >> >> Of course blaming the device could mislead the person encountering the >> problem, and make him believe it's an non-virtual hardware problem. >> >> About the misleading, I think the best we can do is log out a message >> indicating what really happened. > > Just document it in the code? If it doesn't happen with Linux as a > guest, it is highly unlikely to be seen in the wild. > Well we have two problems here: 1) Unit exception can be already defined by the device type for the command (reference: http://publibfp.dhe.ibm.com/cgi-bin/bookmgr/BOOKS/dz9ar110/2.6.10?DT=19920904110920). I think this one is what you mean. And I agree that's best handled with comment in code. 2) The poor user/programmer is trying to figure out why things don't work (why are we getting the unit exception)? I think that's best remedied with producing something for the log (maybe a warning with warn_report which states that the implementation vfio-ccw requires the given flags). [..] >>>>>> @@ -1115,7 +1112,7 @@ static int do_subchannel_work(SubchDev *sch) >>>>>> if (sch->do_subchannel_work) { >>>>>> return sch->do_subchannel_work(sch); >>>>>> } else { >>>>>> - return -EINVAL; >>>>>> + return -ENODEV; >>>>> >>>>> This rather seems like a job for an assert? If we don't have a function >>>>> for the 'asynchronous' handling of the various functions assigned for a >>>>> subchannel, that looks like an internal error. >>>>> >>>> >>>> IMHO it depends. Aborting qemu is heavy handed, and as an user I would not >>>> be happy about it. But certainly it is an assert situation. We can look >>>> for >>>> an even better solution, but I think this is an improvement. The logic >>>> behind >>>> is that the device is broken and can't be talked to properly. >>> >>> We currently don't have a vast array of subchannel types (and are >>> unlikely to get more types that need a different handler function). We >>> know the current ones are fine, and an assert would catch programming >>> errors early. >>> >> >> Despite of that we already had a problem of this type: see 1728cff2ab >> ("s390x/3270: fix instruction interception handler", 2017-06-09) by >> Dong Jia. If we had some automated testing covering all the asserts >> I would not think twice about using an assert here. But I don't think >> we do and I'm reluctant (not positive that assert is superior to what >> we have now). Maybe we could agree on reported by again. > > Yes, we (as in generally 'we') are really lacking automated testing... > (it is somewhere on my todo list). > > Either leave it as-is, or do an assert. -ENODEV just feels wrong. > I think I will leave this one as is and maybe try to discuss with the folks here about reliable test coverage. Just spoke with Marc H., and according to that we have a long way to go.