no, but I know that if the wear leveling isn't right then I wouldn't expect them to last long, FW updates on SSDs are very important.
On Mon, Feb 18, 2019 at 7:44 AM David Turner <drakonst...@gmail.com> wrote: > We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk > (partitioned), 3 disks per node, 5 nodes per cluster. The clusters are > 12.2.4 running CephFS and RBDs. So in total we have 15 NVMe's per cluster > and 30 NVMe's in total. They were all built at the same time and were > running firmware version QDV10130. On this firmware version we early on > had 2 disks failures, a few months later we had 1 more, and then a month > after that (just a few weeks ago) we had 7 disk failures in 1 week. > > The failures are such that the disk is no longer visible to the OS. This > holds true beyond server reboots as well as placing the failed disks into a > new server. With a firmware upgrade tool we got an error that pretty much > said there's no way to get data back and to RMA the disk. We upgraded all > of our remaining disks' firmware to QDV101D1 and haven't had any problems > since then. Most of our failures happened while rebalancing the cluster > after replacing dead disks and we tested rigorously around that use case > after upgrading the firmware. This firmware version seems to have resolved > whatever the problem was. > > We have about 100 more of these scattered among database servers and other > servers that have never had this problem while running the > QDV10130 firmware as well as firmwares between this one and the one we > upgraded to. Bluestore on Ceph is the only use case we've had so far with > this sort of failure. > > Has anyone else come across this issue before? Our current theory is that > Bluestore is accessing the disk in a way that is triggering a bug in the > older firmware version that isn't triggered by more traditional > filesystems. We have a scheduled call with Intel to discuss this, but > their preliminary searches into the bugfixes and known problems between > firmware versions didn't indicate the bug that we triggered. It would be > good to have some more information about what those differences for disk > accessing might be to hopefully get a better answer from them as to what > the problem is. > > > [1] > https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com