On Mon, May 08 2023, Markus Armbruster <arm...@redhat.com> wrote: > css_clear_io_interrupt() aborts on unexpected ioctl() errors, and I > wonder whether that's appropriate. Let's have a closer look: > > static void css_clear_io_interrupt(uint16_t subchannel_id, > uint16_t subchannel_nr) > { > Error *err = NULL; > static bool no_clear_irq; > S390FLICState *fs = s390_get_flic(); > S390FLICStateClass *fsc = s390_get_flic_class(fs); > int r; > > if (unlikely(no_clear_irq)) { > return; > } > r = fsc->clear_io_irq(fs, subchannel_id, subchannel_nr); > switch (r) { > case 0: > break; > case -ENOSYS: > no_clear_irq = true; > /* > * Ignore unavailability, as the user can't do anything > * about it anyway. > */ > break; > default: > error_setg_errno(&err, -r, "unexpected error condition"); > error_propagate(&error_abort, err); > } > } > > The default case is abort() with a liberal amount of lipstick applied. > Let's ignore the lipstick and focus on the abort(). > > fsc->clear_io_irq ist either qemu_s390_clear_io_flic() order > kvm_s390_clear_io_flic(). > > Only kvm_s390_clear_io_flic() can return non-zero: -errno when ioctl() > fails. > > The ioctl() is KVM_SET_DEVICE_ATTR for KVM_DEV_FLIC_CLEAR_IO_IRQ with > subchannel_id and subchannel_nr. I.e. we assume that this can only fail > with ENOSYS, und crash hard when the assumption turns out to be wrong. > > Is this error condition a programming error? I figure it can be one > only if the ioctl()'s contract promises us it cannot fail in any other > way unless we violate preconditions. > > Is the error condition fatal, i.e. continuing would be unsafe? > > If it's a fatal programming error, then abort() is appropriate. > > If it's fatal, but not a programming error, we should exit(1) instead. > > If it's a survivable programming error, use of abort() is a matter of > taste.
>From what I remember, this was introduced to clean up a potentially queued interrupt that is not supposed to be delivered, so the worst thing that could happen on failure is a spurious interrupt (same as what could happen if the kernel flic doesn't provide this function in the first place.) My main worry would be changes/breakages on the kernel side (while the QEMU side remains unchanged). So, I think we should continue to log the error in any case; but I don't have a strong opinion as to whether we should use exit(1) (as I wouldn't consider it a programming error) or just continue. Halil, your choice :)