[ceph-users] Re: Question about cluster expansion

Anthony D'Atri Mon, 24 Mar 2025 15:46:31 -0700

> 
> AD> What use-case (s) ?  Are your pools R3, EC? Mix?
> 
> My use case is storage for virtual machines (Proxmox).

So probably all small-block RBD?

> AD> I like to solve first for at least 9-10 nodes, but assuming that you’re 
> using replicated size=3 pools 5 is okay.
> 
> Yes, I am using replication=3

I was once pressured by a CIO to do EC on 3-node clusters (I refused), so I try 
to not assume ;)

> AD> Conventional wisdom is that when using NVMe SSDs to offload WAL+DB from 
> HDDs you want one NVMe SSD to back at most 10x HDD OSDs.  Do you have your 
> 1TB NVMe SSDs dedicating 250GB to each of the 4 HDDs?  Or do you have them 
> sliced smaller?  If you don’t have room on them for additional HDD OSDs that 
> complicates the proposition.
> 
> Oh, that is interesting; I thought you needed about 4% of the HDD space for 
> DB/WAL, so we sliced the SSDs in to 250GB partitions.

That 4% figure has been … complicated since it emerged years ago. Until … a few 
major releases ago RocksDB could only actually *use* certain stepped capacities 
that very roughly included 30GB, 300GB, etc.  So a 50GB WAL+DB partition would 
only actually use 30GB or so.  RocksDB column sharding was introduced that 
allowed better utilization of arbitrary sizes.

So the 4% figure persists.  In practice that sizing depends in part on the 
workload.  RBD workloads tend to not need as much as, say, an RGW workload with 
a lot of tiny S3 objects.  So for your (presumably) RBD workload, often a 2% 
figure is bandied about.  For your OSDs, thus by that rule you’d target like 
300GB.  4% of 16TB would be 640GB I think, so you’re below 2% already.  There 
is a certain diminishing returns factor.  You can look at your OSDs to see if 
there is any RocksDB overflow onto the slow device.  The lower levels are the 
most important performance-wise, and the DB is routinely compacted.  With the 
recent changing of RocksDB defaults to do compression, you should get better 
use out of any realistically sized WAL+DB partition than with OSDs built on 
older releases.  

> Yes, we do have four very thin NVME slots available (u.3)

Since you’re calling them thin, I’m thinking that they’re probably E3.S.  U.3 
is the size of a conventional 2.5” SFF SSD or HDD.

>  and we are actually using an adapter for the SSD that holds the WAL/DB for 
> the HDDs.  We were going to add another SSD for any additional HDDs we might 
> add, as we didn't want to have *all* the HDDs' WAL/DB on a single SSD because 
> our understanding is that if that SSD fails before we are able to replace it, 
> then we lose all the OSDs that have their DB/WAL on it.

Yes that’s one of the ramification of shared devices for WAL+DB offload.

>  At this point in time, losing four OSDs is a reasonable risk, but if we were 
> to add four more additional HDDs to each node, we wouldn't want to lose eight 
> OSDs at one time.

Understandable, but you might think in terms of percentage.  If you add four 
HDD OSDs to each node, with 8 per NVMe offload device, that device is the same 
overall percentage of the cluster as what you have today.

>  Of course, we have monitoring on the system, so we will *hopefully* know 
> when we are about to lose one of the SSDs (via the SMART monitoring that 
> tells how much of the lifetime is used)

If your firmware conveys those attributes properly, and if your monitoring is 
aware that some drives report lifetime used and some report lifetime 
*remaining*.  Most isn’t.

> , but of course the SSD could just fail out of the blue as well.

My sense is that careful monitoring of reallocated block rate, grown defects, 
etc., can maybe predict half of failures, others come out of the blue.  
Depending what one considers a failure, many SSD failures are actually firmware 
and thus can be fixed with an update.

> 
> AD> Are your NVMe drives enterprise-class?
> 
> The NVMEs we are using for actual storage in the cluster are 
> enterprise-class, but admittedly the SSDs we are using for the WAL/DB are not.

Tsk tsk tsk.

>  That is actually what I meant when I said we didn't have room for more NVME 
> drives, though we technically still have three "very thin" NVME slots 
> available as well (u.3 connector types, I think?), but we are currently using 
> with m.2 adapters for the WAL/DB SSDs.

There *are* M.2 enterprise SSDs available.  Client-class SSDs, M.2 or 
otherwise, often lack PLP power loss protection, so if you suffer a power 
outage you may be in a world of hurt.  These are also designed for price points 
and not for a 24x7 duty cycle.  Client class drives are much more likely to 
random failure under enterprise workloads.

> 
> -----Original Message-----
> From: Anthony D'Atri <anthony.da...@gmail.com> 
> Sent: March 23, 2025 22:57
> To: Alan Murrell <a...@t-net.ca>
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Question about cluster expansion
> 
> *** This is an EXTERNAL email. Please exercise caution. DO NOT open 
> attachments or click links from unknown senders or unexpected email. ***
> 
> 
> What use-case (s) ?  Are your pools R3, EC? Mix?
> 
> I like to solve first for at least 9-10 nodes, but assuming that you’re using 
> replicated size=3 pools 5 is okay.
> 
> Part of the answer is what these nodes look like.  Are they dedicated to 
> Ceph?  How much RAM and CPU?
> 
> Conventional wisdom is that when using NVMe SSDs to offload WAL+DB from HDDs 
> you want one NVMe SSD to back at most 10x HDD OSDs.  Do you have your 1TB 
> NVMe SSDs dedicating 250GB to each of the 4 HDDs?  Or do you have them sliced 
> smaller?  If you don’t have room on them for additional HDD OSDs that 
> complicates the proposition.
> 
> Sometimes people use PCIe to M.2 adapters to fit in additional NVMe drives, 
> but take care to look into PCIe bifurcation etc. when selecting a card to 
> accept more than one M.2 NVMe SSD.
> 
> Are your NVMe drives enterprise-class?
> 
>> On Mar 23, 2025, at 10:14 PM, Alan Murrell <a...@t-net.ca> wrote:
>> 
>> Hello,
>> 
>> We have a 5-node cluster that each have the following drives:
>> 
>> * 4 x 16TB HDD
>> * 4 x 2TB NVME
>> * 1 x 1TB NVME (for the WAL/DB for the HDDs)
>> 
>> The nodes don't have any more room to add more NVMEs, but they do have room 
>> to add four more HDDs.  I know adding more HDDs are able to make the cluster 
>> faster due to the additional IOPs.
>> 
>> So my question is this:
>> 
>> Is it better to:
>> 
>> * Add the additional drives/IOPs by adding an additional node
>> * Add the additional drives by adding the the HDDs to the existing 
>> nodes
>> 
>> Or does it not really matter?  I would prefer to add the drives to the 
>> existing nodes (ultimately maxing them out)
> 
> Please share what your nodes are like to inform suggestions.  I’ve recently 
> seen a cluster deployed with 8+2 EC on only 10 nodes and inadequate CPU.  
> When things went pear-shaped it really, really wasn’t pretty.  How many 
> SAS/SATA drive bays do your nodes have for HDDs?  Like most things in tech 
> there are disagreements, but a rule of thumb is 2x vcores / threads per HDD 
> OSD, 4-6 for NVMe OSDs.  And extra for the OS, mons, mgrs, RGWs, etc.
> 
>> , but just wondering if that affects performance as much as expanding by 
>> adding additional nodes.
>> 
>> Thanks! :-)
>> 
>> Sent from my mobile device.  Please excuse brevity and ttpos.
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
>> email to ceph-users-le...@ceph.io
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Question about cluster expansion

Reply via email to