[ceph-users] Re: nodes with high density of OSDs

Anthony D'Atri Tue, 15 Apr 2025 09:07:41 -0700


> When I first migrated to Ceph, my servers were all running CentOS 7, which I 
> (wrongly) thought could not handle anything above Octopus,


Containerized deployments do have the advantage of less coupling to the 
underlying OS for dependencies, though the very latest CentOS 9 containers may 
have issues on the old CentOS kernel.

> and on top of that, I initially did legacy installs. So in order to run 
> Pacific and to keep the overall clutter in the physical box configuration 
> down, I made my Ceph hosts VMs. With cephadm, it's easier to run off the 
> direct physical layer, but I'm likely to keep the VMs.

VMs conventionally present more overhead than containers, fwiw.

> I have added 1 or 2 hosts since then that don't virtualize ceph, but since my 
> farm isn't big enough to justify a complete set of storage-only boxes, I'll 
> likely continue with VMs for the foreseeable future.

Whatever floats your boat.

> This mobo is circa 2011, but the model has worked so well for my needs that 
> I've made it the backbone for all the big boxes.

PCIe …. gen 2?

> There are 6 onboard SATA ports, capable of being set up as RAID in the BIOS, 
> but I run them in basic mode. I finally got my ceph health totally clean this 
> week, but I'd been seeing 1-2 PGs get corrupted overnight several times

As a function of scrubs? Scrubs are often when latent issues are surfaced.

> and this morning came in and the entire box had powered itself off. Since I'd 
> just pulled the CPU fan for its annual cat hair removal, there was no logical 
> excuse for that

Don’t get me started about cats ;)

>  and so I pulled the box and swapped the functioning drives to a new box. I'm 
> going to test the RAM, and then probably swap the mobo on the retired box.
> 
> One disk was definitely faulty, SMART or not, as it gave the same errors in 
> the replacement box. The other OSD disk and the OS drive were also throwing 
> occasional errors, but that went away on the replacement box. The I/O errors 
> were being reported by the base OS, so I don't consider it a fault in the VM 
> or in ceph. SMART has never been very good about giving me useful warnings 
> before a disk blew out.

The overall pass/fail self-reported status isn’t worth much, but watching

SATA downshifts
UDMA/CRC errors
grown defects
increased rate of LBA reallocation

can help predict some issues before the drive becomes a real problem.

> 
> On metadata, yes, LV, VG, and PV metadata are stored in their respective 
> storage definitions. The ceph metadata is in filesystem form on the host 
> (/var/lib/ceph/...), but I've no doubt that ceph could find a way to 
> replicate it into the OSD itself.

Filestore or BlueStore?


> 
>   Tim
> 
> On 4/12/25 11:13, Anthony D'Atri wrote:
>> 
>>> Apparently those UUIDs aren't as reliable as I thought.
>>> 
>>> I've had problems with a server box that hosts a ceph VM.
>> VM?
>> 
>>> Looks like the mobo disk controller is unreliable
>> Lemme guess, it is an IR / RoC / RAID type? As opposed to JBOB / IT?
>> 
>> If the former and it’s an LSI SKU as most are, I’d love if you could send me 
>> privately the output of
>> 
>> storcli64 /c0 show termlog >/tmp/termlog.txt
>> 
>> Sometimes flakiness is actually with the drive backplane, especially when it 
>> has an embedded expander.  In either case, updating HBA firmware sometimes 
>> makes a real difference.
>> 
>> And drive firmware.
>> 
>>> AND one of the disks passes SMART
>> I’m curious if it shows SATA downshifts.
>> 
>>> but has interface problems. So I moved the disks to an alternate box.
>>> 
>>> Between relocation and dropping the one disk, neither of the 2 OSDs for 
>>> that host will come up. If everything was running solely on static UUIDs, 
>>> the good disk should have been findable even if its physical disk device 
>>> name shifted. But it wasn't.
>> Did you try
>> 
>>      ceph-volume lvm activate —all
>> 
>> ?
>> 
>>> Which brings up something I've wondered about for some time. Shouldn't it 
>>> be possible for OSDs to be portable?
>> I haven’t tried it much, but that *should* be true, modulo CRUSH location.
>> 
>>> That is, if a box goes bad, in theory I should be able to remove the drive 
>>> and jack it into a hot-swap bay on another server and have that server able 
>>> to import the relocated OSD.
>> I’ve effectively done a chassis swap, moving all the drives including the 
>> boot volume, but that admittedly was in the ceph-disk days.
>> 
>> 
>>> True, the metadata for an OSD is currently located on its host, but it 
>>> seems like it should be possible to carry a copy on the actual device.
>> My limited understanding is that *is* the case with LVM.
>> 
>> 
>>>    Tim
>>> 
>>> On 4/11/25 16:23, Anthony D'Atri wrote:
>>>> Filestore, pre-ceph-volume may have been entirely different.  IIRC LVM is 
>>>> used these days to exploit persistent metadata tags.
>>>> 
>>>>> On Apr 11, 2025, at 4:03 PM, Tim Holloway <t...@mousetech.com> wrote:
>>>>> 
>>>>> I just checked an OSD and the "block" entry is indeed linked to storage 
>>>>> using a /dev/mapper uuid LV, not a /dev/device. When ceph builds an 
>>>>> LV-based OSD, it creates a VG whose name is "ceph-uuuuu", where "uuuu" is 
>>>>> a UUID, and an LV named "osd-block-vvvv", where "vvvv" is also a uuid. So 
>>>>> although you'd map the osd to something like /dev/vdb in a VM, the actual 
>>>>> name ceph uses is uuid-based (and lvm-based) and thus not subject to 
>>>>> change with alterations in the hardware as the uuids are part of the 
>>>>> metadata in VGs and LVs created by ceph.
>>>>> 
>>>>> Since I got that from a VM, I can't vouch for all cases, but I thought it 
>>>>> especially interesting that a ceph was creating LVM counterparts even for 
>>>>> devices that were not themselves LVM-based.
>>>>> 
>>>>> And yeah, I understand that it's the amount of OSD replicate data that 
>>>>> counts more than the number of hosts, but when an entire host goes down 
>>>>> and there are few hosts, that can take a large bite out of the replicas.
>>>>> 
>>>>>    Tim
>>>>> 
>>>>> On 4/11/25 10:36, Anthony D'Atri wrote:
>>>>>> I thought those links were to the by-uuid paths for that reason?
>>>>>> 
>>>>>>> On Apr 11, 2025, at 6:39 AM, Janne Johansson <icepic...@gmail.com> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Den fre 11 apr. 2025 kl 09:59 skrev Anthony D'Atri 
>>>>>>> <anthony.da...@gmail.com>:
>>>>>>>> Filestore IIRC used partitions, with cute hex GPT types for various 
>>>>>>>> states and roles.  Udev activation was sometimes problematic, and LVM 
>>>>>>>> tags are more flexible and reliable than the prior approach.  There no 
>>>>>>>> doubt is more to it but that’s what I recall.
>>>>>>> Filestore used to have softlinks towards the journal device (if used)
>>>>>>> which pointed to sdX where that X of course would jump around if you
>>>>>>> changed the number of drives on the box, or the kernel disk detection
>>>>>>> order changed, breaking the OSD.
>>>>>>> 
>>>>>>> -- 
>>>>>>> May the most significant bit of your life be positive.
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: nodes with high density of OSDs

Reply via email to