> When I first migrated to Ceph, my servers were all running CentOS 7, which I > (wrongly) thought could not handle anything above Octopus,
Containerized deployments do have the advantage of less coupling to the underlying OS for dependencies, though the very latest CentOS 9 containers may have issues on the old CentOS kernel. > and on top of that, I initially did legacy installs. So in order to run > Pacific and to keep the overall clutter in the physical box configuration > down, I made my Ceph hosts VMs. With cephadm, it's easier to run off the > direct physical layer, but I'm likely to keep the VMs. VMs conventionally present more overhead than containers, fwiw. > I have added 1 or 2 hosts since then that don't virtualize ceph, but since my > farm isn't big enough to justify a complete set of storage-only boxes, I'll > likely continue with VMs for the foreseeable future. Whatever floats your boat. > This mobo is circa 2011, but the model has worked so well for my needs that > I've made it the backbone for all the big boxes. PCIe …. gen 2? > There are 6 onboard SATA ports, capable of being set up as RAID in the BIOS, > but I run them in basic mode. I finally got my ceph health totally clean this > week, but I'd been seeing 1-2 PGs get corrupted overnight several times As a function of scrubs? Scrubs are often when latent issues are surfaced. > and this morning came in and the entire box had powered itself off. Since I'd > just pulled the CPU fan for its annual cat hair removal, there was no logical > excuse for that Don’t get me started about cats ;) > and so I pulled the box and swapped the functioning drives to a new box. I'm > going to test the RAM, and then probably swap the mobo on the retired box. > > One disk was definitely faulty, SMART or not, as it gave the same errors in > the replacement box. The other OSD disk and the OS drive were also throwing > occasional errors, but that went away on the replacement box. The I/O errors > were being reported by the base OS, so I don't consider it a fault in the VM > or in ceph. SMART has never been very good about giving me useful warnings > before a disk blew out. The overall pass/fail self-reported status isn’t worth much, but watching SATA downshifts UDMA/CRC errors grown defects increased rate of LBA reallocation can help predict some issues before the drive becomes a real problem. > > On metadata, yes, LV, VG, and PV metadata are stored in their respective > storage definitions. The ceph metadata is in filesystem form on the host > (/var/lib/ceph/...), but I've no doubt that ceph could find a way to > replicate it into the OSD itself. Filestore or BlueStore? > > Tim > > On 4/12/25 11:13, Anthony D'Atri wrote: >> >>> Apparently those UUIDs aren't as reliable as I thought. >>> >>> I've had problems with a server box that hosts a ceph VM. >> VM? >> >>> Looks like the mobo disk controller is unreliable >> Lemme guess, it is an IR / RoC / RAID type? As opposed to JBOB / IT? >> >> If the former and it’s an LSI SKU as most are, I’d love if you could send me >> privately the output of >> >> storcli64 /c0 show termlog >/tmp/termlog.txt >> >> Sometimes flakiness is actually with the drive backplane, especially when it >> has an embedded expander. In either case, updating HBA firmware sometimes >> makes a real difference. >> >> And drive firmware. >> >>> AND one of the disks passes SMART >> I’m curious if it shows SATA downshifts. >> >>> but has interface problems. So I moved the disks to an alternate box. >>> >>> Between relocation and dropping the one disk, neither of the 2 OSDs for >>> that host will come up. If everything was running solely on static UUIDs, >>> the good disk should have been findable even if its physical disk device >>> name shifted. But it wasn't. >> Did you try >> >> ceph-volume lvm activate —all >> >> ? >> >>> Which brings up something I've wondered about for some time. Shouldn't it >>> be possible for OSDs to be portable? >> I haven’t tried it much, but that *should* be true, modulo CRUSH location. >> >>> That is, if a box goes bad, in theory I should be able to remove the drive >>> and jack it into a hot-swap bay on another server and have that server able >>> to import the relocated OSD. >> I’ve effectively done a chassis swap, moving all the drives including the >> boot volume, but that admittedly was in the ceph-disk days. >> >> >>> True, the metadata for an OSD is currently located on its host, but it >>> seems like it should be possible to carry a copy on the actual device. >> My limited understanding is that *is* the case with LVM. >> >> >>> Tim >>> >>> On 4/11/25 16:23, Anthony D'Atri wrote: >>>> Filestore, pre-ceph-volume may have been entirely different. IIRC LVM is >>>> used these days to exploit persistent metadata tags. >>>> >>>>> On Apr 11, 2025, at 4:03 PM, Tim Holloway <t...@mousetech.com> wrote: >>>>> >>>>> I just checked an OSD and the "block" entry is indeed linked to storage >>>>> using a /dev/mapper uuid LV, not a /dev/device. When ceph builds an >>>>> LV-based OSD, it creates a VG whose name is "ceph-uuuuu", where "uuuu" is >>>>> a UUID, and an LV named "osd-block-vvvv", where "vvvv" is also a uuid. So >>>>> although you'd map the osd to something like /dev/vdb in a VM, the actual >>>>> name ceph uses is uuid-based (and lvm-based) and thus not subject to >>>>> change with alterations in the hardware as the uuids are part of the >>>>> metadata in VGs and LVs created by ceph. >>>>> >>>>> Since I got that from a VM, I can't vouch for all cases, but I thought it >>>>> especially interesting that a ceph was creating LVM counterparts even for >>>>> devices that were not themselves LVM-based. >>>>> >>>>> And yeah, I understand that it's the amount of OSD replicate data that >>>>> counts more than the number of hosts, but when an entire host goes down >>>>> and there are few hosts, that can take a large bite out of the replicas. >>>>> >>>>> Tim >>>>> >>>>> On 4/11/25 10:36, Anthony D'Atri wrote: >>>>>> I thought those links were to the by-uuid paths for that reason? >>>>>> >>>>>>> On Apr 11, 2025, at 6:39 AM, Janne Johansson <icepic...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Den fre 11 apr. 2025 kl 09:59 skrev Anthony D'Atri >>>>>>> <anthony.da...@gmail.com>: >>>>>>>> Filestore IIRC used partitions, with cute hex GPT types for various >>>>>>>> states and roles. Udev activation was sometimes problematic, and LVM >>>>>>>> tags are more flexible and reliable than the prior approach. There no >>>>>>>> doubt is more to it but that’s what I recall. >>>>>>> Filestore used to have softlinks towards the journal device (if used) >>>>>>> which pointed to sdX where that X of course would jump around if you >>>>>>> changed the number of drives on the box, or the kernel disk detection >>>>>>> order changed, breaking the OSD. >>>>>>> >>>>>>> -- >>>>>>> May the most significant bit of your life be positive. >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io