Hi, If QDV10130 pre-dates feb/march 2018, you may have suffered the same firmware bug as existed on the DC S4600 series. I'm under NDA so I can't bitch and moan about specifics, but your symptoms sounds very familiar.
It's entirely possible that there's *something* about bluestore that has access patterns that differ from "regular filesystems", we burnt ourselves with the DC S4600, which were burnt in (I were told) - but probably the burn-in testing were done through filesystems rather than ceph/bluestore. Previously discussed around here http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023835.html On Mon, Feb 18, 2019 at 7:44 AM David Turner <drakonst...@gmail.com> wrote: > > We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk > (partitioned), 3 disks per node, 5 nodes per cluster. The clusters are > 12.2.4 running CephFS and RBDs. So in total we have 15 NVMe's per cluster > and 30 NVMe's in total. They were all built at the same time and were > running firmware version QDV10130. On this firmware version we early on had > 2 disks failures, a few months later we had 1 more, and then a month after > that (just a few weeks ago) we had 7 disk failures in 1 week. > > The failures are such that the disk is no longer visible to the OS. This > holds true beyond server reboots as well as placing the failed disks into a > new server. With a firmware upgrade tool we got an error that pretty much > said there's no way to get data back and to RMA the disk. We upgraded all of > our remaining disks' firmware to QDV101D1 and haven't had any problems since > then. Most of our failures happened while rebalancing the cluster after > replacing dead disks and we tested rigorously around that use case after > upgrading the firmware. This firmware version seems to have resolved > whatever the problem was. > > We have about 100 more of these scattered among database servers and other > servers that have never had this problem while running the QDV10130 firmware > as well as firmwares between this one and the one we upgraded to. Bluestore > on Ceph is the only use case we've had so far with this sort of failure. > > Has anyone else come across this issue before? Our current theory is that > Bluestore is accessing the disk in a way that is triggering a bug in the > older firmware version that isn't triggered by more traditional filesystems. > We have a scheduled call with Intel to discuss this, but their preliminary > searches into the bugfixes and known problems between firmware versions > didn't indicate the bug that we triggered. It would be good to have some > more information about what those differences for disk accessing might be to > hopefully get a better answer from them as to what the problem is. > > > [1] > https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Kjetil Joergensen <kje...@medallia.com> SRE, Medallia Inc _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com