Again FWIW:
No recurrence of the SCSI abort notices since increasing the timeout,
but still getting guest userspace lockups. Guest kernel logs show RCU
"detected stalls" messages and triggering NMIs across the CPUs. These
consistently indicate CPU 2 sitting in the CFS scheduler via the timer
interrupt, appearing to make some progress (i.e. RIP changes over time),
and the other CPUs all sitting idle. Although the guest kernel keeps
going and logging these issues out, none of the guest userspace
processes make any progress at all over several hours.
I'm upgrading to the QEMU version shipped with RHEV
(qemu-kvm-rhev-2.3.0-31.el7_2.7) to see if that helps - so far so good.
My best guess is that there's a missing bugfix in the RHEL 7 qemu
1.5.3 codebase, but which is fixed upstream and in the RHEV QEMU release.
Cheers,
Jim
On 04/02/16 13:41, Jim Minter wrote:
FWIW, I've now done:
echo 300 >/sys/block/sda/device/timeout
Not entirely sure whether it would help or not, but so far I haven't had
a recurrence.
Cheers,
Jim