вс, 17 окт. 2021 г. в 11:19, Alexander Motin <mav...@gmail.com>:
> It may be a noise, but comparing logs I see in reboot case also > "acpi_ec0: not getting interrupts, switched to polled mode". I am > thinking whether the problem may be caused not by SSD, but by some > resource conflict/misconfiguration in the system. Pavel, can you > compare `devinfo -vr` and `lspci -vvvvv` in both cases. looking for any > differences? Are you running the latest BIOS? > > On 12.10.2021 15:56, Warner Losh wrote: > > One piece of data that would be good to have: > > > > nvmecontrol identify nvme0 > > > > There's an optional feature that none of my drives have, but that the > Linux > > driver does oddly > > weird things when enabled. The output of that command will help me > > understand if that may > > be in play. Maybe we need to do oddly weird things too :) > > > > Warner > > > > On Sun, Oct 10, 2021 at 11:00 PM Warner Losh <i...@bsdimp.com> wrote: > > > >> > >> > >> On Sun, Oct 10, 2021 at 10:48 PM Pavel Timofeev <tim...@gmail.com> > wrote: > >> > >>> сб, 9 окт. 2021 г. в 14:59, Warner Losh <i...@bsdimp.com>: > >>> > >>>> > >>>> > >>>> On Sat, Oct 9, 2021, 8:44 AM Pavel Timofeev <tim...@gmail.com> wrote: > >>>> > >>>>> > >>>>> > >>>>> пт, 8 окт. 2021 г. в 14:49, Warner Losh <i...@bsdimp.com>: > >>>>> > >>>>>> > >>>>>> > >>>>>> On Fri, Oct 8, 2021 at 2:42 PM Pavel Timofeev <tim...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> сб, 21 авг. 2021 г. в 15:22, Warner Losh <i...@bsdimp.com>: > >>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Sat, Aug 21, 2021 at 3:06 PM Pavel Timofeev <tim...@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Warner Losh <i...@bsdimp.com>: > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Fri, Aug 20, 2021 at 10:42 PM Pavel Timofeev < > tim...@gmail.com> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Pavel Timofeev <tim...@gmail.com>: > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Chuck Tuffli <ctuf...@gmail.com>: > >>>>>>>>>>>> > >>>>>>>>>>>>> On Mon, Aug 16, 2021 at 7:43 PM Pavel Timofeev < > >>>>>>>>>>> tim...@gmail.com> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Hello > >>>>>>>>>>>>>> I've got a Dell Latitude 7400 and tried installing the > latest > >>>>>>>>>>>>> 14.0-CURRENT > >>>>>>>>>>>>>> (main-n248636-d20e9e02db3) on it. > >>>>>>>>>>>>>> Despite other things the weird one which concerns me is > >>>>>>>>>>>>>> nvme0: Missing interrupt > >>>>>>>>>>>>>> message I get sometimes on the console. > >>>>>>>>>>>>>> It seems like I get it only after the reboot of the laptop, > >>>>>>>>>>> i. e. not > >>>>>>>>>>>>>> getting that message if I power cycle the laptop, at least I > >>>>>>>>>>> haven't > >>>>>>>>>>>>> seen > >>>>>>>>>>>>>> them for now in such cases. > >>>>>>>>>>>>>> So when the laptop is rebooted I can't even take advantage > of > >>>>>>>>>>>>>> nvmecontrol(8) quickly. > >>>>>>>>>>>>>> Well, it still works, but it takes tens of seconds to return > >>>>>>>>>>> the output. > >>>>>>>>>>>>> ... > >>>>>>>>>>>>>> dmesg when power cycled - > >>>>>>>>>>>>>> > >>>>>>>>>>> > https://drive.google.com/file/d/1dB27oB1O2CcnZy6DvOOhmFO8SN8V8SwJ > >>>>>>>>>>>>>> dmesg when rebooted - > >>>>>>>>>>>>>> > >>>>>>>>>>> > https://drive.google.com/file/d/1DsKTMkihp_OmUcirByLaVO4o2mU38Bxh > >>>>>>>>>>>>> > >>>>>>>>>>>>> I'm sort of curious about the time stamps for the log > messages > >>>>>>>>>>> in the > >>>>>>>>>>>>> failing case. Something like: > >>>>>>>>>>>>> > >>>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages > >>>>>>>>>>>>> > >>>>>>>>>>>>> --chuck > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Well, I can't see timestamps in the verbose boot log. Am I > >>>>>>>>>>> missing some > >>>>>>>>>>>> configuration for that? > >>>>>>>>>>>> > >>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages > >>>>>>>>>>>> nvme0: <Generic NVMe Device> mem > >>>>>>>>>>>> > >>>>>>>>>>> > 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device > >>>>>>>>>>>> 0.0 on pci6 > >>>>>>>>>>>> nvme0: attempting to allocate 5 MSI-X vectors (17 supported) > >>>>>>>>>>>> nvme0: using IRQs 133-137 for MSI-X > >>>>>>>>>>>> nvme0: CapLo: 0x140103ff: MQES 1023, CQR, TO 20 > >>>>>>>>>>>> nvme0: CapHi: 0x00000030: DSTRD 0, NSSRS, CSS 1, MPSMIN 0, > >>>>>>>>>>> MPSMAX 0 > >>>>>>>>>>>> nvme0: Version: 0x00010300: 1.3 > >>>>>>>>>>>> nvme0: Missing interrupt > >>>>>>>>>>>> nvme0: Missing interrupt > >>>>>>>>>>>> nvme0: Missing interrupt > >>>>>>>>>>>> nvme0: Missing interrupt > >>>>>>>>>>>> nvme0: Missing interrupt > >>>>>>>>>>>> nvme0: Missing interrupt > >>>>>>>>>>>> nvme0: Missing interrupt > >>>>>>>>>>>> nvme0: Missing interrupt > >>>>>>>>>>>> nvme0: Missing interrupt > >>>>>>>>>>>> nvme0: Missing interrupt > >>>>>>>>>>>> nvme0: Missing interrupt > >>>>>>>>>>>> nvme0: Missing interrupt > >>>>>>>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace > >>>>>>>>>>>> GEOM: new disk nvd0 > >>>>>>>>>>>> nvd0: 488386MB (1000215216 512 byte sectors) > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Ah, sorry, provided wrong output. > >>>>>>>>>>> Here is what you requested: > >>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages > >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: <Generic NVMe Device> > mem > >>>>>>>>>>> > 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff > >>>>>>>>>>> at device > >>>>>>>>>>> 0.0 on pci6 > >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: attempting to allocate > 5 > >>>>>>>>>>> MSI-X > >>>>>>>>>>> vectors (17 supported) > >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: using IRQs 133-137 for > >>>>>>>>>>> MSI-X > >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapLo: 0x140103ff: MQES > >>>>>>>>>>> 1023, CQR, > >>>>>>>>>>> TO 20 > >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapHi: 0x00000030: > DSTRD > >>>>>>>>>>> 0, NSSRS, > >>>>>>>>>>> CSS 1, MPSMIN 0, MPSMAX 0 > >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Version: 0x00010300: > 1.3 > >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt > >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt > >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt > >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: <PC611 NVMe SK hynix > >>>>>>>>>>> 512GB> NVMe > >>>>>>>>>>> namespace > >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: GEOM: new disk nvd0 > >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: 488386MB (1000215216 512 > >>>>>>>>>>> byte > >>>>>>>>>>> sectors) > >>>>>>>>>>> Aug 21 04:34:42 nostromo kernel: nvme0: Missing interrupt > >>>>>>>>>>> Aug 21 04:35:36 nostromo kernel: nvme0: Missing interrupt > >>>>>>>>>>> Aug 21 04:35:50 nostromo kernel: nvme0: Missing interrupt > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> What happens if you set hw.nvme.use_nvd=0 and > >>>>>>>>>> hw.cam.nda.nvd_compat=1 > >>>>>>>>>> in the boot loader and reboot? Same thing except nda where nvd > >>>>>>>>>> was? Or does > >>>>>>>>>> it work? > >>>>>>>>>> > >>>>>>>>>> Something weird is going on in the interrupt assignment, I > think, > >>>>>>>>>> but I > >>>>>>>>>> wanted to get any nvd vs nda issues out of the way first. > >>>>>>>>>> > >>>>>>>>>> Warner > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> Do you mean kern.cam.nda.nvd_compat instead > >>>>>>>>> of hw.cam.nda.nvd_compat? > >>>>>>>>> kern.cam.nda.nvd_compat is 1 by default now. > >>>>>>>>> > >>>>>>>>> So I tried to set hw.nvme.use_nvd to 1 as suggested, but I still > >>>>>>>>> see > >>>>>>>>> nvme0: Missing interrupt > >>>>>>>>> and now also > >>>>>>>>> Root mount waiting for: CAM > >>>>>>>>> messages besides those > >>>>>>>>> > >>>>>>>> > >>>>>>>> OK. That all makes sense. I'd forgotten that nvd_compat=1 by > default > >>>>>>>> these > >>>>>>>> days. > >>>>>>>> > >>>>>>>> I'll take a look on monday starting at the differences in > interrupt > >>>>>>>> assignment that > >>>>>>>> are apparent when you cold boot vs reboot. > >>>>>>>> > >>>>>>>> Thanks for checking... I'd hoped this was a cheap fix, but also > >>>>>>>> didn't really > >>>>>>>> expect it to be. > >>>>>>>> > >>>>>>>> Warner > >>>>>>>> > >>>>>>>> > >>>>>>> I've recently upgraded to main-n249974-17f790f49f5 and it got even > >>>>>>> worse now. > >>>>>>> So clean poweron works as before. > >>>>>>> But if rebooted nvme drive refuses to work, while before the code > >>>>>>> upgrade it was just complaining about missing interrupts. > >>>>>>> > >>>>>>> currently dmesg show this: > >>>>>>> nvme0: <Generic NVMe Device> mem > >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff > at device > >>>>>>> 0.0 on pci6 > >>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace > >>>>>>> nvd0: 488386MB (1000215216 512 byte sectors) > >>>>>>> nvme0: <Generic NVMe Device> mem > >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff > at device > >>>>>>> 0.0 on pci6 > >>>>>>> > >>>>>> > >>>>>> Why is this showing up twice? Or is everything above this line left > >>>>>> over from the first, working boot? > >>>>>> > >>>>>> > >>>>>>> nvme0: RECOVERY_START 9585870784 vs 9367036288 > >>>>>>> nvme0: timeout with nothing complete, resetting > >>>>>>> nvme0: Resetting controller due to a timeout. > >>>>>>> nvme0: RECOVERY_WAITING > >>>>>>> nvme0: resetting controller > >>>>>>> nvme0: aborting outstanding admin command > >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 > >>>>>>> cdw11:00000000 > >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 > >>>>>>> nvme0: nvme_identify_controller failed! > >>>>>>> nvme0: waiting > >>>>>>> > >>>>>> > >>>>>> Clearly something bad is going on with the drive here... We looked > >>>>>> into the completion queues when we didn't get an interrupt and > there was > >>>>>> nothing complete there.... > >>>>>> > >>>>>> The only thing I can think of is that this means there's a phase > error > >>>>>> between the drive and the system. I recently removed a second reset > and > >>>>>> made it an option NVME_2X_RESET. Can you see if adding > >>>>>> 'options NVME_2X_RESET' to your kernel config fixes this? > >>>>>> > >>>>>> Warner > >>>>>> > >>>>>> > >>>>>>> nvme0: <Generic NVMe Device> mem > >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff > at device > >>>>>>> 0.0 on pci6 > >>>>>>> nvme0: RECOVERY_START 9362778467 vs 9361830561 > >>>>>>> nvme0: timeout with nothing complete, resetting > >>>>>>> nvme0: Resetting controller due to a timeout. > >>>>>>> nvme0: RECOVERY_WAITING > >>>>>>> nvme0: resetting controller > >>>>>>> nvme0: aborting outstanding admin command > >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 > >>>>>>> cdw11:00000000 > >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 > >>>>>>> nvme0: nvme_identify_controller failed! > >>>>>>> nvme0: waiting > >>>>>>> > >>>>>>> > >>>>> > >>>>> Sorry, it's showing twice due to multiple reboots. For one boot it's > >>>>> like: > >>>>> nvme0: <Generic NVMe Device> mem > >>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at > device > >>>>> 0.0 on pci6 > >>>>> nvme0: RECOVERY_START 9633303481 vs 9365971423 > >>>>> nvme0: timeout with nothing complete, resetting > >>>>> nvme0: Resetting controller due to a timeout. > >>>>> nvme0: RECOVERY_WAITING > >>>>> nvme0: resetting controller > >>>>> nvme0: aborting outstanding admin command > >>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 > cdw11:00000000 > >>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 > >>>>> nvme0: nvme_identify_controller failed! > >>>>> nvme0: waiting > >>>>> > >>>>> Well, neither Windows not Linux have any problems with the device. I > >>>>> understand they may be hiding it or workaround somehow. > >>>>> > >>>> > >>>> Yea, I'm trying to figure out why your machine is different than mine, > >>>> and what Windows or Linux do that is different. It may be dodgy > hardware, > >>>> but others have no trouble... > >>>> > >>>> I'll try setting NVME_2X_RESET in the kernel config and report back > in a > >>>>> while. > >>>>> > >>>> > >>>> Thanks. If it helps, that tells me something. If it doesn't, that > tells > >>>> me something else. > >>>> > >>>> I suspect that it is somewhere else in the system, tbh, but I need to > >>>> find it systematically. > >>>> > >>>> Warner > >>>> > >>> > >>> Surprisingly, setting NVME_2X_RESET in the kernel config hasn't changed > >>> anything. I. e it didn't help. > >>> > >> > >> While it would have been nice to have this be the fix, I'm not that > >> surprised either. > >> It was the biggest change of late, apart from the big re-arrangement > that > >> I'd done. > >> > >> So the other changes have moved from the occasional missing interrupt > >> message > >> (which the old code would get when a command wasn't completed in the > >> timeout > >> period, but that we found to be done when we scanned the completion > queue. > >> Now > >> the device is detected fine (as before), but then doesn't do I/O at all > >> (including not > >> completing the identify command!) and is worse. This is unexpected and > I'm > >> trying > >> understand what happens on reboot that 'changes'the working state and > why > >> the > >> new code behaves so differently. > >> > >> Warner > >> > > > > -- > Alexander Motin > Thanks for the reply. It's using the latest firmware. This is the first thing I do in such case. Attaching devinfo and lspci output. These are diffs showing the difference between clean boot and a reboot: $ diff -u devinfo.ok devinfo.nok --- devinfo.ok 2021-10-17 17:48:07.964346000 -0600 +++ devinfo.nok 2021-10-17 17:48:07.886881000 -0600 @@ -214,10 +214,6 @@ nvme0 pnpinfo vendor=0x1c5c device=0x1639 subvendor=0x1c5c subdevice=0x1639 class=0x010802 at slot=0 function=0 dbsf=pci0:59:0:0 handle=\_SB_.PCI0.RP13.PXSX Interrupt request lines: 0x85 - 0x86 - 0x87 - 0x88 - 0x89 pcib7 memory window: 0xcc100000-0xcc103fff 0xcc104000-0xcc104fff $ diff -u lspci.ok lspci.nok --- lspci.ok 2021-10-17 17:48:15.894470000 -0600 +++ lspci.nok 2021-10-17 17:48:15.341379000 -0600 @@ -132,7 +132,7 @@ Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ - Address: 00000000fee06000 Data: 0033 + Address: 00000000fee06000 Data: 0034 Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0 ExtTag- RBE- FLReset+