Re: nvme crash

Robert G. (Doc) Savage via users Thu, 24 Dec 2020 10:04:30 -0800

On Mon, 2020-12-14 at 20:19 +1100, Eyal Lebedinsky wrote:
> 
> 
> On 14/12/2020 13.20, Chris Murphy wrote:
> > On Sun, Dec 13, 2020 at 4:42 PM Eyal Lebedinsky
> > <fed...@eyal.emu.id.au> wrote:
> > > 
> > > I am not sure which list this should go to, so I am starting
> > > here.
> > > 
> > > I run f32 fully updated
> > >          5.9.13-100.fc32.x86_64
> > > on relatively new hardware
> > >          kernel: DMI: Gigabyte Technology Co., Ltd. Z390 UD/Z390
> > > UD, BIOS F8 05/24/2019
> > 
> > > boot/root/swap/data is on nvme
> > >          WD Blue SN550 1TB M.2 2280 NVMe SSD WDS100T2B0C
> > 
> > I can't tell from WD's website if there's any newer firmware
> > available. They seem to hide this information within the Windows-
> > only
> > software "Western Digital Dashboard". If you have Windows already
> > installed, it's straightforward to install this and find out if the
> > firmware is up to date.
> 
> Option 1) My nvme disk is on the mobo which has only one slot. I have
> access to a windows laptop
> but I will also need an external nvme/USB adapter - will the
> Dashboard work this way?
> Will a fw update leave the disk content safe?
> 
> I will try something else first.
> 
> > There is a boot parameter 'nvme_core.default_ps_max_latency_us'
> > which
> > takes a value in usec, but I can't find a value specific to this
> > make/model NVMe. My gut instinct is, it's a hack put in by upstream
> > kernel developers to work around a proper autodetect solution
> > between
> > PCIe intereface and the drive. I would sooner return the drive and
> > get
> > one known to work. I can vouch for Crucial, Seagate, and Samsung
> > SSD
> > and NVMe for the most part.
> 
> Option 2) Reading the reports (and more) I decided to test the boot
> param
>         nvme_core.default_ps_max_latency_us=0
> which I understand turns off the APSTE feature.
> 
> > Oh here's a bug report
> > https://bugzilla.redhat.com/show_bug.cgi?id=1844905
> > 
> > That leads here:
> > https://bugzilla.kernel.org/show_bug.cgi?id=208123
> > 
> > comment 1 is a more solid lead than comment 2, because comment 2 is
> > a
> > value that is based on what? A guess? Reading the rest of the
> > thread,
> > it's still uncertain.
> > 
> > 
> > > For the second time this disk stopped working (first was about
> > > two months ago).
> > > It seems that the disk failed hard and could not be reset, the
> > > machine was powered off/on.
> > > I think (not sure) that last time I just hit the reset button but
> > > it did not boot.
> > > 
> > > The machine was booted (after dnf update) around 8pm, and crashed
> > > at 4am.
> > > 
> > > Following the earlier crash a serial console was set up which is
> > > how I can see the failure messages.
> > > 
> > > == nvme related messages
> > > [    7.488638] nvme nvme0: pci function 0000:06:00.0
> > > [    7.536593] nvme nvme0: allocated 32 MiB host memory buffer.
> > > [    7.541819] nvme nvme0: 8/0/0 default/read/poll queues
> > > [    7.558122]  nvme0n1: p1 p2 p3 p4
> > > [   19.590010] EXT4-fs (nvme0n1p3): mounted filesystem with
> > > ordered data mode. Opts: (null)
> > > [   20.653500] Adding 16777212k swap on /dev/nvme0n1p2. 
> > > Priority:-2 extents:1 across:16777212k SSFS
> > > [   20.820539] EXT4-fs (nvme0n1p3): re-mounted. Opts: (null)
> > > [   23.137206] EXT4-fs (nvme0n1p1): mounted filesystem with
> > > ordered data mode. Opts: (null)
> > > [   23.210717] EXT4-fs (nvme0n1p4): mounted filesystem with
> > > ordered data mode. Opts: (null)
> > > ## nothing unusual for 8 hours, then
> > > [28972.459036] nvme nvme0: I/O 840 QID 6 timeout, aborting
> > > [28972.464757] nvme nvme0: I/O 565 QID 7 timeout, aborting
> > > [28972.470277] nvme nvme0: I/O 566 QID 7 timeout, aborting
> > > [28973.291025] nvme nvme0: I/O 989 QID 1 timeout, aborting
> > > [28978.603061] nvme nvme0: I/O 990 QID 1 timeout, aborting
> > > [29002.667243] nvme nvme0: I/O 840 QID 6 timeout, reset
> > > controller
> > > [29032.875421] nvme nvme0: I/O 24 QID 0 timeout, reset controller
> > > [29074.097644] nvme nvme0: Device not ready; aborting reset,
> > > CSTS=0x1
> > > [29074.110354] nvme nvme0: Abort status: 0x371
> > > [29074.114953] nvme nvme0: Abort status: 0x371
> > > [29074.119523] nvme nvme0: Abort status: 0x371
> > > [29074.124114] nvme nvme0: Abort status: 0x371
> > > [29074.128710] nvme nvme0: Abort status: 0x371
> > > [29096.645478] nvme nvme0: Device not ready; aborting reset,
> > > CSTS=0x1
> > > [29096.652210] nvme nvme0: Removing after probe failure status: -
> > > 19
> > > [29119.165921] nvme nvme0: Device not ready; aborting reset,
> > > CSTS=0x1
> > > ## many I/O errors on nvme0 (p2/p3/p4) repeating until a reboot
> > > at 8:30am
> > > ## one different message, appearing just once:
> > > [29123.800844] nvme nvme0: failed to set APST feature (-19)
> > 
> > I'd take the position that it's defective and permit the
> > manufacturer
> > a short leash to convince me otherwise via a tech support call or
> > email. But I really wouldn't just wait around for another 2 months
> > not
> > knowing if it's going to fail again. I'd like some kind of answer
> > for
> > this problem from support folks. And if they can't give support,
> > get
> > rid of it.
> > 
> > The time frame for a repeat of the problem is why I'm taking this
> > slightly different view, than the tinker with firmware view
> > earlier.
> > It's not horrible to update firmware, and give it a go, if this
> > problem happens once a week or more often. But every two months?
> > Forget it. Make it their problem.
> > 
> > And seriously I give them one chance. If they b.s. me and it flakes
> > out again in a month or two, no more chances. So the quandary is,
> > what's your return policy window? If it's about to end, just return
> > it
> > now. It should just work out of the box. WDC does contribute to the
> > kernel. Whether this is a product supported on Linux I don't know.
> 
> Option 3) Get a new disk from a reliable brand (as mentioned on this
> thread)
> and keep this one as a spare. I will do this if the problem happens
> again..
> 
> I will log an issue with WD and see what they have to say.
> 
> Thanks everyone
> 
> -- 
> Eyal Lebedinsky (fed...@eyal.emu.id.au


You mentioned only a single NVME slot on your motherboard. If you have
an available PCIe slot, there's a nifty adapter you can buy for a
second NVME drive:
 
https://www.amazon.com/GLOTRENDS-Adapter-Aluminum-Heatsink-PA09_HS/dp/B07FN3YZ8P/ref=sr_1_2_sspa?dchild=1&keywords=nvme-PCIe+adapter&qid=1608832943&sr=8-2-spons&psc=1&smid=A36DOQ8QSJXCYP&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzMEtDMUI4RFZMTTVJJmVuY3J5cHRlZElkPUEwMjkxMzY0MkNPNkxaV0ZHVEFEOSZlbmNyeXB0ZWRBZElkPUEwNzYyNDAxM0hXNDQ0MzFHOVBZVCZ3aWRnZXROYW1lPXNwX2F0ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU=

--Doc Savage
    Fairview Heights, IL

_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org

Re: nvme crash

Reply via email to