[ceph-users] Re: nodes with high density of OSDs

Tim Holloway Sat, 12 Apr 2025 10:13:58 -0700

When I first migrated to Ceph, my servers were all running CentOS 7,which I (wrongly) thought could not handle anything above Octopus, andon top of that, I initially did legacy installs. So in order to runPacific and to keep the overall clutter in the physical boxconfiguration down, I made my Ceph hosts VMs. With cephadm, it's easierto run off the direct physical layer, but I'm likely to keep the VMs. Ihave added 1 or 2 hosts since then that don't virtualize ceph, but sincemy farm isn't big enough to justify a complete set of storage-onlyboxes, I'll likely continue with VMs for the foreseeable future.

This mobo is circa 2011, but the model has worked so well for my needsthat I've made it the backbone for all the big boxes. There are 6onboard SATA ports, capable of being set up as RAID in the BIOS, but Irun them in basic mode. I finally got my ceph health totally clean thisweek, but I'd been seeing 1-2 PGs get corrupted overnight several timesand this morning came in and the entire box had powered itself off.Since I'd just pulled the CPU fan for its annual cat hair removal, therewas no logical excuse for that, and so I pulled the box and swapped thefunctioning drives to a new box. I'm going to test the RAM, and thenprobably swap the mobo on the retired box.

One disk was definitely faulty, SMART or not, as it gave the same errorsin the replacement box. The other OSD disk and the OS drive were alsothrowing occasional errors, but that went away on the replacement box.The I/O errors were being reported by the base OS, so I don't considerit a fault in the VM or in ceph. SMART has never been very good aboutgiving me useful warnings before a disk blew out.

On metadata, yes, LV, VG, and PV metadata are stored in their respectivestorage definitions. The ceph metadata is in filesystem form on the host(/var/lib/ceph/...), but I've no doubt that ceph could find a way toreplicate it into the OSD itself.


  Tim

On 4/12/25 11:13, Anthony D'Atri wrote:

Apparently those UUIDs aren't as reliable as I thought.

I've had problems with a server box that hosts a ceph VM.

VM?

Looks like the mobo disk controller is unreliable

Lemme guess, it is an IR / RoC / RAID type? As opposed to JBOB / IT?

If the former and it’s an LSI SKU as most are, I’d love if you could send me 
privately the output of

storcli64 /c0 show termlog >/tmp/termlog.txt

Sometimes flakiness is actually with the drive backplane, especially when it 
has an embedded expander.  In either case, updating HBA firmware sometimes 
makes a real difference.

And drive firmware.

AND one of the disks passes SMART

I’m curious if it shows SATA downshifts.

but has interface problems. So I moved the disks to an alternate box.

Between relocation and dropping the one disk, neither of the 2 OSDs for that 
host will come up. If everything was running solely on static UUIDs, the good 
disk should have been findable even if its physical disk device name shifted. 
But it wasn't.

Did you try

        ceph-volume lvm activate —all

?

Which brings up something I've wondered about for some time. Shouldn't it be 
possible for OSDs to be portable?

I haven’t tried it much, but that *should* be true, modulo CRUSH location.

That is, if a box goes bad, in theory I should be able to remove the drive and 
jack it into a hot-swap bay on another server and have that server able to 
import the relocated OSD.

I’ve effectively done a chassis swap, moving all the drives including the boot 
volume, but that admittedly was in the ceph-disk days.

True, the metadata for an OSD is currently located on its host, but it seems 
like it should be possible to carry a copy on the actual device.

My limited understanding is that *is* the case with LVM.

    Tim

On 4/11/25 16:23, Anthony D'Atri wrote:

Filestore, pre-ceph-volume may have been entirely different.  IIRC LVM is used 
these days to exploit persistent metadata tags.

On Apr 11, 2025, at 4:03 PM, Tim Holloway <t...@mousetech.com> wrote:

I just checked an OSD and the "block" entry is indeed linked to storage using a /dev/mapper uuid LV, not a /dev/device. 
When ceph builds an LV-based OSD, it creates a VG whose name is "ceph-uuuuu", where "uuuu" is a UUID, and an 
LV named "osd-block-vvvv", where "vvvv" is also a uuid. So although you'd map the osd to something like 
/dev/vdb in a VM, the actual name ceph uses is uuid-based (and lvm-based) and thus not subject to change with alterations in the 
hardware as the uuids are part of the metadata in VGs and LVs created by ceph.

Since I got that from a VM, I can't vouch for all cases, but I thought it 
especially interesting that a ceph was creating LVM counterparts even for 
devices that were not themselves LVM-based.

And yeah, I understand that it's the amount of OSD replicate data that counts 
more than the number of hosts, but when an entire host goes down and there are 
few hosts, that can take a large bite out of the replicas.

    Tim

On 4/11/25 10:36, Anthony D'Atri wrote:

I thought those links were to the by-uuid paths for that reason?

On Apr 11, 2025, at 6:39 AM, Janne Johansson <icepic...@gmail.com> wrote:

Den fre 11 apr. 2025 kl 09:59 skrev Anthony D'Atri <anthony.da...@gmail.com>:

Filestore IIRC used partitions, with cute hex GPT types for various states and 
roles.  Udev activation was sometimes problematic, and LVM tags are more 
flexible and reliable than the prior approach.  There no doubt is more to it 
but that’s what I recall.

Filestore used to have softlinks towards the journal device (if used)
which pointed to sdX where that X of course would jump around if you
changed the number of drives on the box, or the kernel disk detection
order changed, breaking the OSD.

--
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: nodes with high density of OSDs

Reply via email to