Hi,

If QDV10130 pre-dates feb/march 2018, you may have suffered the same
firmware bug as existed on the DC S4600 series. I'm under NDA so I
can't bitch and moan about specifics, but your symptoms sounds very
familiar.

It's entirely possible that there's *something* about bluestore that
has access patterns that differ from "regular filesystems", we burnt
ourselves with the DC S4600, which were burnt in (I were told) - but
probably the burn-in testing were done through filesystems rather than
ceph/bluestore.

Previously discussed around here
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023835.html

On Mon, Feb 18, 2019 at 7:44 AM David Turner <drakonst...@gmail.com> wrote:
>
> We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk 
> (partitioned), 3 disks per node, 5 nodes per cluster.  The clusters are 
> 12.2.4 running CephFS and RBDs.  So in total we have 15 NVMe's per cluster 
> and 30 NVMe's in total.  They were all built at the same time and were 
> running firmware version QDV10130.  On this firmware version we early on had 
> 2 disks failures, a few months later we had 1 more, and then a month after 
> that (just a few weeks ago) we had 7 disk failures in 1 week.
>
> The failures are such that the disk is no longer visible to the OS.  This 
> holds true beyond server reboots as well as placing the failed disks into a 
> new server.  With a firmware upgrade tool we got an error that pretty much 
> said there's no way to get data back and to RMA the disk.  We upgraded all of 
> our remaining disks' firmware to QDV101D1 and haven't had any problems since 
> then.  Most of our failures happened while rebalancing the cluster after 
> replacing dead disks and we tested rigorously around that use case after 
> upgrading the firmware.  This firmware version seems to have resolved 
> whatever the problem was.
>
> We have about 100 more of these scattered among database servers and other 
> servers that have never had this problem while running the QDV10130 firmware 
> as well as firmwares between this one and the one we upgraded to.  Bluestore 
> on Ceph is the only use case we've had so far with this sort of failure.
>
> Has anyone else come across this issue before?  Our current theory is that 
> Bluestore is accessing the disk in a way that is triggering a bug in the 
> older firmware version that isn't triggered by more traditional filesystems.  
> We have a scheduled call with Intel to discuss this, but their preliminary 
> searches into the bugfixes and known problems between firmware versions 
> didn't indicate the bug that we triggered.  It would be good to have some 
> more information about what those differences for disk accessing might be to 
> hopefully get a better answer from them as to what the problem is.
>
>
> [1] 
> https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Kjetil Joergensen <kje...@medallia.com>
SRE, Medallia Inc
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to