Re: [ceph-users] Boot volume on OSD device

Brian Topping Sat, 19 Jan 2019 12:51:13 -0800

> On Jan 18, 2019, at 10:58 AM, Hector Martin <hec...@marcansoft.com> wrote:
> 
> Just to add a related experience: you still need 1.0 metadata (that's
> the 1.x variant at the end of the partition, like 0.9.0) for an
> mdadm-backed EFI system partition if you boot using UEFI. This generally
> works well, except on some Dell servers where the firmware inexplicably
> *writes* to the ESP, messing up the RAID mirroring.

I love this list. You guys are great. I have to admit I was kind of intimidated
at first, I felt a little unworthy in the face of such cutting-edge tech.
Thanks to everyone that’s helped with my posts.

Hector, one of the things I was thinking through last night and finally pulled
the trigger on today was the overhead of various subsystems. LVM does not
create much overhead, but tiny initial mistakes explode into a lot of wasted
CPU over the course of a deployment lifetime. So I wanted to review everything
and thought I would share my notes here.

My main constraint is I had four disks on a single machine to start with and
any one of the disks should be able to fail without affecting the ability for
the machine to boot, the bad disk replaced without requiring obscure admin
skills, and the final recovery to the promised land of “HEALTH_OK”. A single
machine Ceph deployment is not much better than just using local storage,
except the ability to later scale out. That’s the use case I’m addressing here.

The first exploration I had was how to optimize for a good balance between
safety for mon logs, disk usage and performance for the boot partitions. As I
learned, an OSD can fit in a single partition with no spillover, so I had three
partitions to work with. `inotifywait -mr /var/lib/ceph/` provided a good
handle on what was being written to the log and with what frequency and I could
see that the log was mostly writes.

https://theithollow.com/2012/03/21/understanding-raid-penalty/
<https://theithollow.com/2012/03/21/understanding-raid-penalty/> provided a
good background that I did not previously have on the RAID write penalty. I
combined this with what I learned in
https://serverfault.com/questions/685289/software-vs-hardware-raid-performance-and-cache-usage/685328#685328

<https://serverfault.com/questions/685289/software-vs-hardware-raid-performance-and-cache-usage/685328#685328>.
By the end of these two articles, I felt like I knew all the tradeoffs, but
the final decision really came down to the penalty table in the first article
and a “RAID penalty” of 2 for RAID 10, which was the same as the penalty for
RAID 1, but with 50% better storage efficiency.

For the boot partition, there are fewer choices. Specifying anything other than
RAID 1 will not keep all the copies of /boot both up-to-date and ready to
seamlessly restart the machine in case of a disk failure. Combined with a
choice of RAID 10 for the root partition, we are left with a configuration that
can reliably boot from any single drive failure (maybe two, I don’t know what
mdadm would do if a “less than perfect storm” happened with one mirror from
each stripe were to be lost instead of two mirrors from one stripe…)

With this setup, each disk used exactly two partitions and mdadm is using the
latest MD metadata because Grub2 knows how to deal with everything. As well,
`sfdisk /dev/sd[abcd]` shows all disks marked with the first partition as
bootable. Milestone 1 success!

The next piece I was unsure of but didn’t want to spam the list with stuff I
could just try was how many partitions an OSD would use. Hector mentioned that
he was using LVM for Bluestore volumes. I privately wondered the value in
creating LVM VGs when groups did not span disks. But this is exactly what the
`ceph-deploy osd create` command as documented does in creating Bluestore OSDs.
Knowing how to wire LVM is not rocket science, but if possible, I wanted to
avoid as many manual steps as possible. This was a biggie.

And after adding the OSD partitions one after the other, “HEALTH_OK”. w00t!!!
Final Milestone Success!!

I know there’s no perfect starter configuration for every hardware environment,
but I thought I would share exactly what I ended up with here for future
seekers. This has been a fun adventure.

Next up: Convert my existing two pre-production nodes that need to use this
layout. Fortunately there’s nothing on the second node except Ceph and I can
take that one down pretty easily. It will be good practice to gracefully shut
down the four OSDs on that node without losing any data, reformat the node with
this pattern, bring it the cluster back to health, then migrate the mon (and
the workloads) to it while I do the same for the first node. With that, I’ll be
able to remove these satanic SATADOMs and get back to some real work!!

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Boot volume on OSD device

Reply via email to