Thomas Lamprecht <t.lampre...@proxmox.com> writes: > On 08.04.21 14:49, Markus Armbruster wrote: >> Kevin Wolf <kw...@redhat.com> writes: >>> Am 08.04.2021 um 11:21 hat Markus Armbruster geschrieben: >>>> Should this go into 6.0? >>> >>> This is something that the responsible maintainer needs to decide. >> >> Yes, and that's me. I'm soliciting opinions. >> >>> If it helps you with the decision, and if I understand correctly, it is >>> a regression from 5.1, but was already broken in 5.2. >> >> It helps. >> >> Even more helpful would be a risk assessment: what's the risk of >> applying this patch now vs. delaying it? > > Stefan is on vacation this week, but I can share some information, maybe it > helps. > >> >> If I understand Stefan correctly, Proxmox observed VM hangs. How >> frequent are these hangs? Did they result in data corruption? > > > They were not highly frequent, but frequent enough to get roughly a bit over a > dozen of reports in our forum, which normally means something is off but its > limited to certain HW, storage-tech used or load patterns. > > We had initially a hard time to reproduce this, but a user finally could send > us a backtrace of a hanging VM and with that information we could pin it > enough > down and Stefan came up with a good reproducer (see v1 of this patch).
Excellent work, props! > We didn't got any report of actual data corruption due to this, but the VM > hangs completely, so a user killing it may produce that theoretical; but only > for those program running in the guest that where not made power-loss safe > anyway... > >> >> How confident do we feel about the fix? >> > > Cannot comment from a technical POV, but can share the feedback we got with > it. > > Some context about reach: > We have rolled the fix out to all repository stages which had already a build > of > 5.2, that has a reach of about 100k to 300k installations, albeit we only have > some rough stats about the sites that accesses the repository daily, cannot > really > tell who actually updated to the new versions, but there are some quite > update-happy > people in the community, so with that in mind and my experience of the > feedback > loop of rolling out updates, I'd figure a lower bound one can assume without > going > out on a limb is ~25k. > > Positive feedback from users: > We got some positive feedback from people which ran into this at least once > per > week about the issue being fixed with that. In total almost a dozen user > reported > improvements, a good chunk of those which reported the problem in the first > place. > > Mixed feedback: > We had one user which reported still getting QMP timeouts, but that their VMs > did > not hang anymore (could be high load or the like). Only one user reported > that it > did not help, still investigating there, they have quite high CPU pressure > stats > and it actually may also be another issue, cannot tell for sure yet though. > > Negative feedback: > We had no new users reporting of new/worse problems in that direction, at > least > from what I'm aware off. > > Note, we do not use OOB currently, so above does not speak for the OOB case at > all. Thanks!