sorry, I have caught this latter tooo late. On 12/18/2017 08:49 PM, John Snow wrote: > > On 12/14/2017 06:29 AM, Denis V. Lunev wrote: >>> If this has been broken since 2.9, 2.11-rc3 is too late for a bandaid >>> applied to something I can't diagnose. Let's discuss this for 2.12 and I >>> will keep trying to figure out what the root cause is. >> I have read the entire letter in 2 subsequent attempts, but >> unfortunately I can not say much more additionally :( >> > No problem, sometimes I don't understand myself. And the IDE code isn't > exactly the nicest stuff to read. If I was smart enough I'd refactor the > whole thing, but without breaking migration it's a little hard :( > >>> Some questions for you: >>> >>> (1) Is the guest Linux? Do we know why this one machine might be >>> tripping up QEMU? (Is it running a fuzzer, a weird OS, etc...?) >> This is running by the end-user by our customer and we do not have >> access to that machine and customer. This is anonymized crash report >> from the node. This is not a single crash. We observe 1-2 reports with >> this crash in a day. >> > Yikes. Is this still on a 2.9-based VM, or have you upgraded to 2.10 or > 2.11 at this point? > > (From memory this was a problem with a 2.9 based machine) the problem is with 2.9
>>> (2) Does the VM actually have a CDROM inserted at some point? Is it >>> possible we're racing on some kind of eject or graph manipulation failure? >> unclear but IMHO probable. >> > If they're using a 2.10+ based VM, could you look at some trace points? > > either: > trace_ide_atapi_cmd (just scsi byte 0), or > trace_ide_atapi_cmd_packet (the entire scsi cdb) > > and > > trace_ide_exec_cmd > > the actual command bytes never get saved in the state struct, so it's > hard to tell from traces what commands were being processed, but these > traces help. unfortunately I do not have access to the crashing node :( that is the problem. >>> (3) Is this using AHCI or IDE? >> IDE. This is known 120%. We do not provide ability to enable AHCI >> without manual tweaking. >> > At least that helps narrow down the path... > >>> If I can't figure it out within a week or so from here I'll just check >>> in the band-aid with some /* FIXME */ comments attached. >> No prob. We are going to ship my band-aid and see to report statistics. >> >> Thank you in advance, >> Den > I'll stage the band-aid with some FIXME comments, and maybe some scary > error_report prints with some information in them. I'll send it to the list. I do not see them merged. Have you? For now I have merged my patch downstream. I do not see that it could be wrong. The release is scheduled late this spring and if crashes will stop to happen - I'll let you know. Den