On Mar 20, 2014, at 8:51 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Wed, Mar 19, 2014 at 11:04:19AM +1030, Rusty Russell wrote: >> Dave Airlie <airl...@gmail.com> writes: >>> So I'm looking at how best to do virtio gpu device error reporting, >>> and how to deal with illegal stuff, >>> >>> I've two levels of errors I want to support, >>> >>> a) unrecoverable or bad guest kernel programming errors, >> >> The QEMU standard approach is to exit at this point. No, really. > > It's easy on the hypervisor but often not very friendly for driver writers > who might not be qemu experts. > QEMU's moving away from exiting on errors and it would be nice > to have a robust way to report driver bugs. > How about setting VIRTIO_CONFIG_S_DEVICE_FAILED ? > > Another idea that windows driver implemented is reporting > failure reason hint. They wrote it out to ISR, specifically > they notified host about watchdog timer expiration for net device > in this way. I removed it for now and really would like to have an official way to bring it back. Also going back to the original question - Windows can handle graphic cards HW errors by reloading the driver and reseting the device (stating from Vista). > >>> b) per 3D context errors from the renderer backend, >>> >>> (b) I can easily report in an event queue and the guest kernel can in >>> theory blow away the offenders, this is how GL works with some >>> extensions, >> >> That's probably sanest. > > If it's possible to identify the offenders, I agree > a VQ is better than config space for that. > Need to make sure the queue is big enough to avoid > underrun of that queue though. Is that always possible? > >>> GPU control queue, the response should always be no error, but in some >>> cases it will be because the guest hit some host resource error, or >>> asked for something insane, (guest kernel drivers would be broken in >>> most of these cases). >>> >>> Alternately I can use the separate event queue to send async errors >>> when the guest does something bad, >>> >>> I'm also considering adding some sort of flag in config space saying >>> the device needs a reset before it will continue doing anything, >> >> I generally dislike error codes which Never Happen; it's like making >> every void function return int just in case: the caller has no idea what >> to do if it fails. >> >> The litmus test: does *your* guest handle failures other than by giving >> up on the device? If so, sure, you need to have a sane error-reporting >> strategy. > > Right but driver development is also a valid need. > >>> The main reason I'm considering this stuff is for security reasons if >>> the guest asks for something really illegal or crazy what should the >>> expected behaviour of the host be? (at least secure I know that). >> >> If the guest userspace can do it, don't exit. If the kernel only, and >> it's should have known better, abort is OK. > > I second that, at least for now. > Maybe we will add more capabilities in virtio 1.0, or > after that. > >> Sure that doesn't help much! >> Rusty.