I'm running some s4610 (SSDPE2KE064T8), with firmware VDV10140.

don't have any problem with them since 6months.

But I remember than around september 2017, supermicro has warned me about a 
firmware bug on s4600. (don't known which firmware version)



----- Mail original -----
De: "David Turner" <drakonst...@gmail.com>
À: "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Lundi 18 Février 2019 16:44:18
Objet: [ceph-users] Intel P4600 3.2TB U.2 form factor NVMe firmware problems 
causing dead disks

We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk 
(partitioned), 3 disks per node, 5 nodes per cluster. The clusters are 12.2.4 
running CephFS and RBDs. So in total we have 15 NVMe's per cluster and 30 
NVMe's in total. They were all built at the same time and were running firmware 
version QDV10130. On this firmware version we early on had 2 disks failures, a 
few months later we had 1 more, and then a month after that (just a few weeks 
ago) we had 7 disk failures in 1 week. 

The failures are such that the disk is no longer visible to the OS. This holds 
true beyond server reboots as well as placing the failed disks into a new 
server. With a firmware upgrade tool we got an error that pretty much said 
there's no way to get data back and to RMA the disk. We upgraded all of our 
remaining disks' firmware to QDV101D1 and haven't had any problems since then. 
Most of our failures happened while rebalancing the cluster after replacing 
dead disks and we tested rigorously around that use case after upgrading the 
firmware. This firmware version seems to have resolved whatever the problem 
was. 

We have about 100 more of these scattered among database servers and other 
servers that have never had this problem while running the QDV10130 firmware as 
well as firmwares between this one and the one we upgraded to. Bluestore on 
Ceph is the only use case we've had so far with this sort of failure. 

Has anyone else come across this issue before? Our current theory is that 
Bluestore is accessing the disk in a way that is triggering a bug in the older 
firmware version that isn't triggered by more traditional filesystems. We have 
a scheduled call with Intel to discuss this, but their preliminary searches 
into the bugfixes and known problems between firmware versions didn't indicate 
the bug that we triggered. It would be good to have some more information about 
what those differences for disk accessing might be to hopefully get a better 
answer from them as to what the problem is. 


[1] [ 
https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
 | 
https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
 ] 

_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to