[ceph-users] Re: Cluster recovery: DC power failure killed OSD node BlueStore block.DB devices

Anthony D'Atri Wed, 23 Apr 2025 08:27:05 -0700

> I have seen drives work absolutely fine past their "100% used" indicator in 
> smart,  but on a power cycle they flat out refuse to enumerate on the bus.  
> The team that ran into this was lucky enough to catch it on one machine first 
> so they could grab the data before rebooting the other hosts.   This was also 
> many years ago so I hope their firmware does something different now.


In recent years the trend is to NOT hard-disable at the end of rated PE cycles, 
but this is still on a SKU-by-SKU basis.

> 
> What does the health look like on the remaining drives?   How long were the 
> dead ones in service?

That SKU is from 2018, but was rated at 10DWPD so I suspect it’s not a lifetime 
issue as such, perhaps firmware.

> 
> -paul
> 
> --
> 
> Paul Mezzanini
> Platform Engineer III
> Research Computing
> Rochester Institute of Technology
> 
> 
> 
> ________________________________________
> From: Frédéric Nass <frederic.n...@univ-lorraine.fr>
> Sent: Wednesday, April 23, 2025 10:09 AM
> To: Paul Browne
> Cc: ceph-users; Anthony D'Atri
> Subject: [ceph-users] Re: Cluster recovery: DC power failure killed OSD node 
> BlueStore block.DB devices
> 
> Hi Paul,
> 
> I'm continuing this thread here, following Anthony's insightful remarks and 
> offer to help.
> 
> I find it hard to believe that enterprise-grade NVMe drives would fail during 
> a power outage, unless there's an issue with the NVMe or HBA firmware. I 
> recommend opening a support case with DELL, HPE, or whatever manufacturer 
> made your server.
> 
> Before doing that, try these troubleshooting steps:
> 
> - Shut down the server completely
> - Disconnect all power cables for at least 10 minutes
> - Restart the server (this might resolve temporary discovery issues during 
> boot)
> 
> If the drives reappear during startup, you may need to 'import' them in the 
> boot process. Watch for a message on the console prompting you to do this.
> If these steps don't help, try upgrading all firmware on the server.
> 
> I've seen 'dead' DELL SSDs (Toshiba) come back to life after a firmware 
> upgrade, even when marked as dead in iDrac. See [1] for details.
> 
> Ultimately, your best course of action is to open a support case with your 
> hardware manufacturer.
> 
> Regards,
> Frédéric.
> 
> [1] https://www.spinics.net/lists/ceph-users/msg78647.html
> 
> ----- Le 23 Avr 25, à 15:37, Anthony D'Atri anthony.da...@gmail.com a écrit :
> 
>>> The failed SSD disks seem to be quite dead unfortunately, not visible to 
>>> the OS
>>> and also marked as dead in the node iDRAC BMC.
>> 
>> I’ve found that iDRAC’s view of SSDs is sometimes … imperfect, but not 
>> visible
>> to the OS is telling.
>> 
>> If you could send me
>> 
>>      storcli64 /c0 show termlog >/var/tmp/termlog.txt      # or perccli64
>>      storcli64 /c0 show all
>> 
>> I’d love to take a look and see if the HBA has any additional information.
>> 
>> One possible though unlikely scenario is that the lost drives had a firmware
>> flaw but surviving drives had a newer revision
>> 
>> 
>> 
>>> We haven't tried moving them to a different node to test though, I can try 
>>> that.
>>> 
>>> In this power event we lost all of the SSD devices on 2 out of 3 OSD nodes 
>>> in
>>> the cluster (it was a small testing cluster) and half of them on the 3rd OSD
>>> node.
>>> 
>>> So the vast majority of OSDs can't start here and the overall cluster state 
>>> is
>>> extremely degraded.
>>> 
>>> So if there is state contained within the old, dead DB devices that can't be
>>> directly replaced with the instantiation of new replacement DB devices, then
>>> it's looking like we've just lost too many DB devices in one foul swoop to 
>>> ever
>>> recover this Ceph cluster, despite the OSD HDDs all being clean+untouched by
>>> the power event.
>>> 
>>> I had been hoping that the DB state was more ephemeral than it seems to be, 
>>> and
>>> so instantiation of new DB devices mapped to the correct OSD devices (via 
>>> LUKS
>>> key) would allow for restarting the down+out OSD devices. But that's
>>> increasingly looking to not be possible, from updates on this thread.
>>> 
>>> *******************
>>> Paul Browne
>>> Research Computing Platforms
>>> University Information Services
>>> Roger Needham Building
>>> JJ Thompson Avenue
>>> University of Cambridge
>>> Cambridge
>>> United Kingdom
>>> E-Mail: pf...@cam.ac.uk<mailto:pf...@cam.ac.uk>
>>> Tel: 0044-1223-746548
>>> *******************
>>> ________________________________
>>> From: Frédéric Nass <frederic.n...@univ-lorraine.fr>
>>> Sent: 23 April 2025 11:24
>>> To: Paul Browne <pf...@cam.ac.uk>
>>> Cc: ceph-users <ceph-users@ceph.io>
>>> Subject: Re: [ceph-users] Cluster recovery: DC power failure killed OSD node
>>> BlueStore block.DB devices
>>> 
>>> Hi Paul,
>>> 
>>> Could you provide more details about the 'SSD BlueStore block.DB devices 
>>> dead'
>>> issue?
>>> 
>>> Are these devices not seen or seen as defective at the hardware level 
>>> (through
>>> iLO, iDrac, etc.)? Or are they visible to the operating system but their
>>> associated OSDs are failing to start?
>>> If you can't bring these RocksDB devices back online, associated OSDs will 
>>> be
>>> permanently dead.
>>> 
>>> Regards,
>>> Frédéric.
>>> 
>>> ----- Le 22 Avr 25, à 23:11, Paul Browne pf...@cam.ac.uk a écrit :
>>> 
>>>> Hi ceph-users,
>>>> 
>>>> We recently suffered a total power failure at our main DC; fortunately, our
>>>> production Ceph cluster emerged unscathed but a smaller Ceph cluster came 
>>>> back
>>>> with the majority of its dedicated SSD BlueStore block.DB devices dead 
>>>> (but its
>>>> HDD OSD devices unharmed). This cluster underpinned a small OpenStack 
>>>> cloud, so
>>>> it would be preferable to recover it rather than writing it off.
>>>> 
>>>> In terms of deployment tooling, this ailing Ceph cluster is a fairly 
>>>> standard
>>>> Red Hat Ceph Storage 7 (so Quincy) cephadm deployed cluster, with the main
>>>> wrinkle about it being that both the DB and HDD OSD devices make use of the
>>>> cephadm supported LVM->LUKS layering above the BlueStore devices.
>>>> 
>>>> The dead BlueStore block.DB devices are of course blocking the surviving 
>>>> HDD OSD
>>>> daemons (in cephadm deployed containers) from coming up cleanly and so the 
>>>> Ceph
>>>> cluster status is currently very degraded (attached status for the ugly
>>>> picture)
>>>> 
>>>> I've kicked around some ideas of recovering the dead DB devices and 
>>>> restarting
>>>> down+out OSDs by;
>>>> 
>>>>   * Manually partitioning replacement SSDs into new DB device 
>>>> partitions+LVs
>>>>    * Installing the same LUKS keys on them retrieved from the Ceph config 
>>>> DB,
>>>>    matching up against which OSD is on which OSD host.
>>>> 
>>>> *
>>>> Manually changing over device links for OSDs to their DB device with
>>>> "ceph-bluestore-tool bluefs-bdev-new-db" or similar
>>>> *
>>>> Try restarting OSDs with updated links to new LVM->LUKS->block.DB devices
>>>> 
>>>> This approach seems highly messy and subject to needing to extract a lot of
>>>> information error-free from dumps of ceph volume lvm, list in order to 
>>>> exactly
>>>> match extant OSD UUIDs to newly created DB devicemapper devices.
>>>> 
>>>> Is there going to be some smarter/better/faster way to non-destructively 
>>>> recover
>>>> these intact HDD OSDs which have links to dead block.DB devices, using 
>>>> native
>>>> cephadm tooling rather than getting so low-level as all the above?
>>>> 
>>>> Many thanks for any advice,
>>>> 
>>>> *******************
>>>> Paul Browne
>>>> Research Computing Platforms
>>>> University Information Services
>>>> Roger Needham Building
>>>> JJ Thompson Avenue
>>>> University of Cambridge
>>>> Cambridge
>>>> United Kingdom
>>>> E-Mail: pf...@cam.ac.uk<mailto:pf...@cam.ac.uk>
>>>> Tel: 0044-1223-746548
>>>> *******************
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Cluster recovery: DC power failure killed OSD node BlueStore block.DB devices

Reply via email to