[ceph-users] Re: nodes with high density of OSDs
OSDs are absolutely portable. I've moved them around by simply migrating the journal back into the spinner, moving the drive, pulling the journal back out and then doing ceph-volume lvm activate all. /var/lib/ceph/ are all tmpfs mounts generated on boot. This is for "physical" setups and not containers.Ymmv -- Paul Mezzanini Platform Engineer III Research Computing Rochester Institute of Technology From: Tim Holloway Sent: Saturday, April 12, 2025 1:13:05 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: nodes with high density of OSDs When I first migrated to Ceph, my servers were all running CentOS 7, which I (wrongly) thought could not handle anything above Octopus, and on top of that, I initially did legacy installs. So in order to run Pacific and to keep the overall clutter in the physical box configuration down, I made my Ceph hosts VMs. With cephadm, it's easier to run off the direct physical layer, but I'm likely to keep the VMs. I have added 1 or 2 hosts since then that don't virtualize ceph, but since my farm isn't big enough to justify a complete set of storage-only boxes, I'll likely continue with VMs for the foreseeable future. This mobo is circa 2011, but the model has worked so well for my needs that I've made it the backbone for all the big boxes. There are 6 onboard SATA ports, capable of being set up as RAID in the BIOS, but I run them in basic mode. I finally got my ceph health totally clean this week, but I'd been seeing 1-2 PGs get corrupted overnight several times and this morning came in and the entire box had powered itself off. Since I'd just pulled the CPU fan for its annual cat hair removal, there was no logical excuse for that, and so I pulled the box and swapped the functioning drives to a new box. I'm going to test the RAM, and then probably swap the mobo on the retired box. One disk was definitely faulty, SMART or not, as it gave the same errors in the replacement box. The other OSD disk and the OS drive were also throwing occasional errors, but that went away on the replacement box. The I/O errors were being reported by the base OS, so I don't consider it a fault in the VM or in ceph. SMART has never been very good about giving me useful warnings before a disk blew out. On metadata, yes, LV, VG, and PV metadata are stored in their respective storage definitions. The ceph metadata is in filesystem form on the host (/var/lib/ceph/...), but I've no doubt that ceph could find a way to replicate it into the OSD itself. Tim On 4/12/25 11:13, Anthony D'Atri wrote: > >> Apparently those UUIDs aren't as reliable as I thought. >> >> I've had problems with a server box that hosts a ceph VM. > VM? > >> Looks like the mobo disk controller is unreliable > Lemme guess, it is an IR / RoC / RAID type? As opposed to JBOB / IT? > > If the former and it’s an LSI SKU as most are, I’d love if you could send me > privately the output of > > storcli64 /c0 show termlog >/tmp/termlog.txt > > Sometimes flakiness is actually with the drive backplane, especially when it > has an embedded expander. In either case, updating HBA firmware sometimes > makes a real difference. > > And drive firmware. > >> AND one of the disks passes SMART > I’m curious if it shows SATA downshifts. > >> but has interface problems. So I moved the disks to an alternate box. >> >> Between relocation and dropping the one disk, neither of the 2 OSDs for that >> host will come up. If everything was running solely on static UUIDs, the >> good disk should have been findable even if its physical disk device name >> shifted. But it wasn't. > Did you try > >ceph-volume lvm activate —all > > ? > >> Which brings up something I've wondered about for some time. Shouldn't it be >> possible for OSDs to be portable? > I haven’t tried it much, but that *should* be true, modulo CRUSH location. > >> That is, if a box goes bad, in theory I should be able to remove the drive >> and jack it into a hot-swap bay on another server and have that server able >> to import the relocated OSD. > I’ve effectively done a chassis swap, moving all the drives including the > boot volume, but that admittedly was in the ceph-disk days. > > >> True, the metadata for an OSD is currently located on its host, but it seems >> like it should be possible to carry a copy on the actual device. > My limited understanding is that *is* the case with LVM. > > >> Tim >> >> On 4/11/25 16:23, Anthony D'Atri wrote: >>> Filestore, pre-ceph-volume may have been entirely different. IIRC LVM is >>> used these days to exploit persistent metadata tags. >>> On Apr 11, 2025, at 4:03 PM, Tim Holloway wrote: I just checked an OSD and the "block" entry is indeed linked to storage using a /dev/mapper uuid LV, not a /dev/device. When ceph builds an LV-based OSD, it creates a VG whose name is "ceph-u", where "" is a UUID, and an LV named "osd-block-", where "" is
[ceph-users] Re: nodes with high density of OSDs
For administered (container) OSDs, the setup would likely be similar. If my experience is indicative, the mere presence of an OSD's metadata directory under /var/lib/ceph/ should be enough to cause ceph to generate the container. So all that's necessary is to move the OSD metadata over there and probably restart ceph on that node. Although a support utility would undoubtedly be useful to make the process more normal to ordinary ceph operations. The downside here is that since the metadata has to be pulled off the host before migrating, if the host - or at least the /var/lib/ceph drive - is dead, it's too late to capture that data, which is why it would be nice to replicate it on the actual USD store. On 4/12/25 17:45, Paul Mezzanini wrote: OSDs are absolutely portable. I've moved them around by simply migrating the journal back into the spinner, moving the drive, pulling the journal back out and then doing ceph-volume lvm activate all. /var/lib/ceph/ are all tmpfs mounts generated on boot. This is for "physical" setups and not containers.Ymmv -- Paul Mezzanini Platform Engineer III Research Computing Rochester Institute of Technology From: Tim Holloway Sent: Saturday, April 12, 2025 1:13:05 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: nodes with high density of OSDs When I first migrated to Ceph, my servers were all running CentOS 7, which I (wrongly) thought could not handle anything above Octopus, and on top of that, I initially did legacy installs. So in order to run Pacific and to keep the overall clutter in the physical box configuration down, I made my Ceph hosts VMs. With cephadm, it's easier to run off the direct physical layer, but I'm likely to keep the VMs. I have added 1 or 2 hosts since then that don't virtualize ceph, but since my farm isn't big enough to justify a complete set of storage-only boxes, I'll likely continue with VMs for the foreseeable future. This mobo is circa 2011, but the model has worked so well for my needs that I've made it the backbone for all the big boxes. There are 6 onboard SATA ports, capable of being set up as RAID in the BIOS, but I run them in basic mode. I finally got my ceph health totally clean this week, but I'd been seeing 1-2 PGs get corrupted overnight several times and this morning came in and the entire box had powered itself off. Since I'd just pulled the CPU fan for its annual cat hair removal, there was no logical excuse for that, and so I pulled the box and swapped the functioning drives to a new box. I'm going to test the RAM, and then probably swap the mobo on the retired box. One disk was definitely faulty, SMART or not, as it gave the same errors in the replacement box. The other OSD disk and the OS drive were also throwing occasional errors, but that went away on the replacement box. The I/O errors were being reported by the base OS, so I don't consider it a fault in the VM or in ceph. SMART has never been very good about giving me useful warnings before a disk blew out. On metadata, yes, LV, VG, and PV metadata are stored in their respective storage definitions. The ceph metadata is in filesystem form on the host (/var/lib/ceph/...), but I've no doubt that ceph could find a way to replicate it into the OSD itself. Tim On 4/12/25 11:13, Anthony D'Atri wrote: Apparently those UUIDs aren't as reliable as I thought. I've had problems with a server box that hosts a ceph VM. VM? Looks like the mobo disk controller is unreliable Lemme guess, it is an IR / RoC / RAID type? As opposed to JBOB / IT? If the former and it’s an LSI SKU as most are, I’d love if you could send me privately the output of storcli64 /c0 show termlog >/tmp/termlog.txt Sometimes flakiness is actually with the drive backplane, especially when it has an embedded expander. In either case, updating HBA firmware sometimes makes a real difference. And drive firmware. AND one of the disks passes SMART I’m curious if it shows SATA downshifts. but has interface problems. So I moved the disks to an alternate box. Between relocation and dropping the one disk, neither of the 2 OSDs for that host will come up. If everything was running solely on static UUIDs, the good disk should have been findable even if its physical disk device name shifted. But it wasn't. Did you try ceph-volume lvm activate —all ? Which brings up something I've wondered about for some time. Shouldn't it be possible for OSDs to be portable? I haven’t tried it much, but that *should* be true, modulo CRUSH location. That is, if a box goes bad, in theory I should be able to remove the drive and jack it into a hot-swap bay on another server and have that server able to import the relocated OSD. I’ve effectively done a chassis swap, moving all the drives including the boot volume, but that admittedly was in the ceph-disk days. True, the metadata for an OSD is currently located on its host, but it see
[ceph-users] Re: nodes with high density of OSDs
Apparently those UUIDs aren't as reliable as I thought. I've had problems with a server box that hosts a ceph VM. Looks like the mobo disk controller is unreliable AND one of the disks passes SMART but has interface problems. So I moved the disks to an alternate box. Between relocation and dropping the one disk, neither of the 2 OSDs for that host will come up. If everything was running solely on static UUIDs, the good disk should have been findable even if its physical disk device name shifted. But it wasn't. Which brings up something I've wondered about for some time. Shouldn't it be possible for OSDs to be portable? That is, if a box goes bad, in theory I should be able to remove the drive and jack it into a hot-swap bay on another server and have that server able to import the relocated OSD. True, the metadata for an OSD is currently located on its host, but it seems like it should be possible to carry a copy on the actual device. Tim On 4/11/25 16:23, Anthony D'Atri wrote: Filestore, pre-ceph-volume may have been entirely different. IIRC LVM is used these days to exploit persistent metadata tags. On Apr 11, 2025, at 4:03 PM, Tim Holloway wrote: I just checked an OSD and the "block" entry is indeed linked to storage using a /dev/mapper uuid LV, not a /dev/device. When ceph builds an LV-based OSD, it creates a VG whose name is "ceph-u", where "" is a UUID, and an LV named "osd-block-", where "" is also a uuid. So although you'd map the osd to something like /dev/vdb in a VM, the actual name ceph uses is uuid-based (and lvm-based) and thus not subject to change with alterations in the hardware as the uuids are part of the metadata in VGs and LVs created by ceph. Since I got that from a VM, I can't vouch for all cases, but I thought it especially interesting that a ceph was creating LVM counterparts even for devices that were not themselves LVM-based. And yeah, I understand that it's the amount of OSD replicate data that counts more than the number of hosts, but when an entire host goes down and there are few hosts, that can take a large bite out of the replicas. Tim On 4/11/25 10:36, Anthony D'Atri wrote: I thought those links were to the by-uuid paths for that reason? On Apr 11, 2025, at 6:39 AM, Janne Johansson wrote: Den fre 11 apr. 2025 kl 09:59 skrev Anthony D'Atri : Filestore IIRC used partitions, with cute hex GPT types for various states and roles. Udev activation was sometimes problematic, and LVM tags are more flexible and reliable than the prior approach. There no doubt is more to it but that’s what I recall. Filestore used to have softlinks towards the journal device (if used) which pointed to sdX where that X of course would jump around if you changed the number of drives on the box, or the kernel disk detection order changed, breaking the OSD. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: nodes with high density of OSDs
One possibility would be so have ceph simply set aside space on the OSD and echo the metadata there automatically. Then a mechanism could scan for un-adopted drives and import as needed. So even a dead host would be OK as long as the device/LV was still usable. I've migrated non-ceph LVs, after all. Tim On 4/12/25 10:25, Gregory Orange wrote: On 12/4/25 20:56, Tim Holloway wrote: Which brings up something I've wondered about for some time. Shouldn't it be possible for OSDs to be portable? That is, if a box goes bad, in theory I should be able to remove the drive and jack it into a hot-swap bay on another server and have that server able to import the relocated OSD. True, the metadata for an OSD is currently located on its host, but it seems like it should be possible to carry a copy on the actual device. It seems to me the theoretical way to do this would be to `ceph-volume lvm migrate` it to the HDD, then move the HDD to a new machine. That would require the box not be fatally bad though, so I'm not sure how much that helps. Are there lower level tools which could be used instead of the above? Greg. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: nodes with high density of OSDs
When I first migrated to Ceph, my servers were all running CentOS 7, which I (wrongly) thought could not handle anything above Octopus, and on top of that, I initially did legacy installs. So in order to run Pacific and to keep the overall clutter in the physical box configuration down, I made my Ceph hosts VMs. With cephadm, it's easier to run off the direct physical layer, but I'm likely to keep the VMs. I have added 1 or 2 hosts since then that don't virtualize ceph, but since my farm isn't big enough to justify a complete set of storage-only boxes, I'll likely continue with VMs for the foreseeable future. This mobo is circa 2011, but the model has worked so well for my needs that I've made it the backbone for all the big boxes. There are 6 onboard SATA ports, capable of being set up as RAID in the BIOS, but I run them in basic mode. I finally got my ceph health totally clean this week, but I'd been seeing 1-2 PGs get corrupted overnight several times and this morning came in and the entire box had powered itself off. Since I'd just pulled the CPU fan for its annual cat hair removal, there was no logical excuse for that, and so I pulled the box and swapped the functioning drives to a new box. I'm going to test the RAM, and then probably swap the mobo on the retired box. One disk was definitely faulty, SMART or not, as it gave the same errors in the replacement box. The other OSD disk and the OS drive were also throwing occasional errors, but that went away on the replacement box. The I/O errors were being reported by the base OS, so I don't consider it a fault in the VM or in ceph. SMART has never been very good about giving me useful warnings before a disk blew out. On metadata, yes, LV, VG, and PV metadata are stored in their respective storage definitions. The ceph metadata is in filesystem form on the host (/var/lib/ceph/...), but I've no doubt that ceph could find a way to replicate it into the OSD itself. Tim On 4/12/25 11:13, Anthony D'Atri wrote: Apparently those UUIDs aren't as reliable as I thought. I've had problems with a server box that hosts a ceph VM. VM? Looks like the mobo disk controller is unreliable Lemme guess, it is an IR / RoC / RAID type? As opposed to JBOB / IT? If the former and it’s an LSI SKU as most are, I’d love if you could send me privately the output of storcli64 /c0 show termlog >/tmp/termlog.txt Sometimes flakiness is actually with the drive backplane, especially when it has an embedded expander. In either case, updating HBA firmware sometimes makes a real difference. And drive firmware. AND one of the disks passes SMART I’m curious if it shows SATA downshifts. but has interface problems. So I moved the disks to an alternate box. Between relocation and dropping the one disk, neither of the 2 OSDs for that host will come up. If everything was running solely on static UUIDs, the good disk should have been findable even if its physical disk device name shifted. But it wasn't. Did you try ceph-volume lvm activate —all ? Which brings up something I've wondered about for some time. Shouldn't it be possible for OSDs to be portable? I haven’t tried it much, but that *should* be true, modulo CRUSH location. That is, if a box goes bad, in theory I should be able to remove the drive and jack it into a hot-swap bay on another server and have that server able to import the relocated OSD. I’ve effectively done a chassis swap, moving all the drives including the boot volume, but that admittedly was in the ceph-disk days. True, the metadata for an OSD is currently located on its host, but it seems like it should be possible to carry a copy on the actual device. My limited understanding is that *is* the case with LVM. Tim On 4/11/25 16:23, Anthony D'Atri wrote: Filestore, pre-ceph-volume may have been entirely different. IIRC LVM is used these days to exploit persistent metadata tags. On Apr 11, 2025, at 4:03 PM, Tim Holloway wrote: I just checked an OSD and the "block" entry is indeed linked to storage using a /dev/mapper uuid LV, not a /dev/device. When ceph builds an LV-based OSD, it creates a VG whose name is "ceph-u", where "" is a UUID, and an LV named "osd-block-", where "" is also a uuid. So although you'd map the osd to something like /dev/vdb in a VM, the actual name ceph uses is uuid-based (and lvm-based) and thus not subject to change with alterations in the hardware as the uuids are part of the metadata in VGs and LVs created by ceph. Since I got that from a VM, I can't vouch for all cases, but I thought it especially interesting that a ceph was creating LVM counterparts even for devices that were not themselves LVM-based. And yeah, I understand that it's the amount of OSD replicate data that counts more than the number of hosts, but when an entire host goes down and there are few hosts, that can take a large bite out of the replicas. Tim On
[ceph-users] Re: nodes with high density of OSDs
On 12/4/25 20:56, Tim Holloway wrote: > Which brings up something I've wondered about for some time. Shouldn't > it be possible for OSDs to be portable? That is, if a box goes bad, in > theory I should be able to remove the drive and jack it into a hot-swap > bay on another server and have that server able to import the relocated OSD. > > True, the metadata for an OSD is currently located on its host, but it > seems like it should be possible to carry a copy on the actual device. It seems to me the theoretical way to do this would be to `ceph-volume lvm migrate` it to the HDD, then move the HDD to a new machine. That would require the box not be fatally bad though, so I'm not sure how much that helps. Are there lower level tools which could be used instead of the above? Greg. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: nodes with high density of OSDs
> Apparently those UUIDs aren't as reliable as I thought. > > I've had problems with a server box that hosts a ceph VM. VM? > Looks like the mobo disk controller is unreliable Lemme guess, it is an IR / RoC / RAID type? As opposed to JBOB / IT? If the former and it’s an LSI SKU as most are, I’d love if you could send me privately the output of storcli64 /c0 show termlog >/tmp/termlog.txt Sometimes flakiness is actually with the drive backplane, especially when it has an embedded expander. In either case, updating HBA firmware sometimes makes a real difference. And drive firmware. > AND one of the disks passes SMART I’m curious if it shows SATA downshifts. > but has interface problems. So I moved the disks to an alternate box. > > Between relocation and dropping the one disk, neither of the 2 OSDs for that > host will come up. If everything was running solely on static UUIDs, the good > disk should have been findable even if its physical disk device name shifted. > But it wasn't. Did you try ceph-volume lvm activate —all ? > Which brings up something I've wondered about for some time. Shouldn't it be > possible for OSDs to be portable? I haven’t tried it much, but that *should* be true, modulo CRUSH location. > That is, if a box goes bad, in theory I should be able to remove the drive > and jack it into a hot-swap bay on another server and have that server able > to import the relocated OSD. I’ve effectively done a chassis swap, moving all the drives including the boot volume, but that admittedly was in the ceph-disk days. > True, the metadata for an OSD is currently located on its host, but it seems > like it should be possible to carry a copy on the actual device. My limited understanding is that *is* the case with LVM. > >Tim > > On 4/11/25 16:23, Anthony D'Atri wrote: >> Filestore, pre-ceph-volume may have been entirely different. IIRC LVM is >> used these days to exploit persistent metadata tags. >> >>> On Apr 11, 2025, at 4:03 PM, Tim Holloway wrote: >>> >>> I just checked an OSD and the "block" entry is indeed linked to storage >>> using a /dev/mapper uuid LV, not a /dev/device. When ceph builds an >>> LV-based OSD, it creates a VG whose name is "ceph-u", where "" is a >>> UUID, and an LV named "osd-block-", where "" is also a uuid. So >>> although you'd map the osd to something like /dev/vdb in a VM, the actual >>> name ceph uses is uuid-based (and lvm-based) and thus not subject to change >>> with alterations in the hardware as the uuids are part of the metadata in >>> VGs and LVs created by ceph. >>> >>> Since I got that from a VM, I can't vouch for all cases, but I thought it >>> especially interesting that a ceph was creating LVM counterparts even for >>> devices that were not themselves LVM-based. >>> >>> And yeah, I understand that it's the amount of OSD replicate data that >>> counts more than the number of hosts, but when an entire host goes down and >>> there are few hosts, that can take a large bite out of the replicas. >>> >>>Tim >>> >>> On 4/11/25 10:36, Anthony D'Atri wrote: I thought those links were to the by-uuid paths for that reason? > On Apr 11, 2025, at 6:39 AM, Janne Johansson wrote: > > Den fre 11 apr. 2025 kl 09:59 skrev Anthony D'Atri > : >> Filestore IIRC used partitions, with cute hex GPT types for various >> states and roles. Udev activation was sometimes problematic, and LVM >> tags are more flexible and reliable than the prior approach. There no >> doubt is more to it but that’s what I recall. > Filestore used to have softlinks towards the journal device (if used) > which pointed to sdX where that X of course would jump around if you > changed the number of drives on the box, or the kernel disk detection > order changed, breaking the OSD. > > -- > May the most significant bit of your life be positive. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: FS not mount after update to quincy
Hi Konstantine, Perfect!!! it works Regards, I -- Ibán Cabrillo Bartolomé Instituto de Física de Cantabria (IFCA-CSIC) Santander, Spain Tel: +34942200969/+34669930421 Responsible for advanced computing service (RSC) = = All our suppliers must know and accept IFCA policy available at: https://confluence.ifca.es/display/IC/Information+Security+Policy+for+External+Suppliers == ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io