> On Jan 18, 2019, at 10:58 AM, Hector Martin <hec...@marcansoft.com> wrote:
> 
> Just to add a related experience: you still need 1.0 metadata (that's
> the 1.x variant at the end of the partition, like 0.9.0) for an
> mdadm-backed EFI system partition if you boot using UEFI. This generally
> works well, except on some Dell servers where the firmware inexplicably
> *writes* to the ESP, messing up the RAID mirroring. 

I love this list. You guys are great. I have to admit I was kind of intimidated 
at first, I felt a little unworthy in the face of such cutting-edge tech. 
Thanks to everyone that’s helped with my posts.

Hector, one of the things I was thinking through last night and finally pulled 
the trigger on today was the overhead of various subsystems. LVM does not 
create much overhead, but tiny initial mistakes explode into a lot of wasted 
CPU over the course of a deployment lifetime. So I wanted to review everything 
and thought I would share my notes here.

My main constraint is I had four disks on a single machine to start with and 
any one of the disks should be able to fail without affecting the ability for 
the machine to boot, the bad disk replaced without requiring obscure admin 
skills, and the final recovery to the promised land of “HEALTH_OK”. A single 
machine Ceph deployment is not much better than just using local storage, 
except the ability to later scale out. That’s the use case I’m addressing here.

The first exploration I had was how to optimize for a good balance between 
safety for mon logs, disk usage and performance for the boot partitions. As I 
learned, an OSD can fit in a single partition with no spillover, so I had three 
partitions to work with. `inotifywait -mr /var/lib/ceph/` provided a good 
handle on what was being written to the log and with what frequency and I could 
see that the log was mostly writes.

https://theithollow.com/2012/03/21/understanding-raid-penalty/ 
<https://theithollow.com/2012/03/21/understanding-raid-penalty/> provided a 
good background that I did not previously have on the RAID write penalty. I 
combined this with what I learned in 
https://serverfault.com/questions/685289/software-vs-hardware-raid-performance-and-cache-usage/685328#685328
 
<https://serverfault.com/questions/685289/software-vs-hardware-raid-performance-and-cache-usage/685328#685328>.
 By the end of these two articles, I felt like I knew all the tradeoffs, but 
the final decision really came down to the penalty table in the first article 
and a “RAID penalty” of 2 for RAID 10, which was the same as the penalty for 
RAID 1, but with 50% better storage efficiency.

For the boot partition, there are fewer choices. Specifying anything other than 
RAID 1 will not keep all the copies of /boot both up-to-date and ready to 
seamlessly restart the machine in case of a disk failure. Combined with a 
choice of RAID 10 for the root partition, we are left with a configuration that 
can reliably boot from any single drive failure (maybe two, I don’t know what 
mdadm would do if a “less than perfect storm” happened with one mirror from 
each stripe were to be lost instead of two mirrors from one stripe…)

With this setup, each disk used exactly two partitions and mdadm is using the 
latest MD metadata because Grub2 knows how to deal with everything. As well, 
`sfdisk /dev/sd[abcd]` shows all disks marked with the first partition as 
bootable. Milestone 1 success!

The next piece I was unsure of but didn’t want to spam the list with stuff I 
could just try was how many partitions an OSD would use. Hector mentioned that 
he was using LVM for Bluestore volumes. I privately wondered the value in 
creating LVM VGs when groups did not span disks. But this is exactly what the 
`ceph-deploy osd create` command as documented does in creating Bluestore OSDs. 
Knowing how to wire LVM is not rocket science, but if possible, I wanted to 
avoid as many manual steps as possible. This was a biggie.

And after adding the OSD partitions one after the other, “HEALTH_OK”. w00t!!! 
Final Milestone Success!!

I know there’s no perfect starter configuration for every hardware environment, 
but I thought I would share exactly what I ended up with here for future 
seekers. This has been a fun adventure. 

Next up: Convert my existing two pre-production nodes that need to use this 
layout. Fortunately there’s nothing on the second node except Ceph and I can 
take that one down pretty easily. It will be good practice to gracefully shut 
down the four OSDs on that node without losing any data, reformat the node with 
this pattern, bring it the cluster back to health, then migrate the mon (and 
the workloads) to it while I do the same for the first node. With that, I’ll be 
able to remove these satanic SATADOMs and get back to some real work!! 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to