Hello,

AD> What use-case (s) ?  Are your pools R3, EC? Mix?

My use case is storage for virtual machines (Proxmox).

AD> I like to solve first for at least 9-10 nodes, but assuming that you’re 
using replicated size=3 pools 5 is okay.

Yes, I am using replication=3

AD> Conventional wisdom is that when using NVMe SSDs to offload WAL+DB from 
HDDs you want one NVMe SSD to back at most 10x HDD OSDs.  Do you have your 1TB 
NVMe SSDs dedicating 250GB to each of the 4 HDDs?  Or do you have them sliced 
smaller?  If you don’t have room on them for additional HDD OSDs that 
complicates the proposition.

Oh, that is interesting; I thought you needed about 4% of the HDD space for 
DB/WAL, so we sliced the SSDs in to 250GB partitions.

AD> Sometimes people use PCIe to M.2 adapters to fit in additional NVMe drives, 
but take care to look into PCIe bifurcation etc. when selecting a card to 
accept more than one M.2 NVMe SSD.

Yes, we do have four very thin NVME slots available (u.3), and we are actually 
using an adapter for the SSD that holds the WAL/DB for the HDDs.  We were going 
to add another SSD for any additional HDDs we might add, as we didn't want to 
have *all* the HDDs' WAL/DB on a single SSD because our understanding is that 
if that SSD fails before we are able to replace it, then we lose all the OSDs 
that have their DB/WAL on it.  At this point in time, losing four OSDs is a 
reasonable risk, but if we were to add four more additional HDDs to each node, 
we wouldn't want to lose eight OSDs at one time.  Of course, we have monitoring 
on the system, so we will *hopefully* know when we are about to lose one of the 
SSDs (via the SMART monitoring that tells how much of the lifetime is used), 
but of course the SSD could just fail out of the blue as well.

AD> Are your NVMe drives enterprise-class?

The NVMEs we are using for actual storage in the cluster are enterprise-class, 
but admittedly the SSDs we are using for the WAL/DB are not.  That is actually 
what I meant when I said we didn't have room for more NVME drives, though we 
technically still have three "very thin" NVME slots available as well (u.3 
connector types, I think?), but we are currently using with m.2 adapters for 
the WAL/DB SSDs.

-----Original Message-----
From: Anthony D'Atri <anthony.da...@gmail.com> 
Sent: March 23, 2025 22:57
To: Alan Murrell <a...@t-net.ca>
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Question about cluster expansion

*** This is an EXTERNAL email. Please exercise caution. DO NOT open attachments 
or click links from unknown senders or unexpected email. ***


What use-case (s) ?  Are your pools R3, EC? Mix?

I like to solve first for at least 9-10 nodes, but assuming that you’re using 
replicated size=3 pools 5 is okay.

Part of the answer is what these nodes look like.  Are they dedicated to Ceph?  
How much RAM and CPU?

Conventional wisdom is that when using NVMe SSDs to offload WAL+DB from HDDs 
you want one NVMe SSD to back at most 10x HDD OSDs.  Do you have your 1TB NVMe 
SSDs dedicating 250GB to each of the 4 HDDs?  Or do you have them sliced 
smaller?  If you don’t have room on them for additional HDD OSDs that 
complicates the proposition.

Sometimes people use PCIe to M.2 adapters to fit in additional NVMe drives, but 
take care to look into PCIe bifurcation etc. when selecting a card to accept 
more than one M.2 NVMe SSD.

Are your NVMe drives enterprise-class?

> On Mar 23, 2025, at 10:14 PM, Alan Murrell <a...@t-net.ca> wrote:
>
> Hello,
>
> We have a 5-node cluster that each have the following drives:
>
>  * 4 x 16TB HDD
>  * 4 x 2TB NVME
>  * 1 x 1TB NVME (for the WAL/DB for the HDDs)
>
> The nodes don't have any more room to add more NVMEs, but they do have room 
> to add four more HDDs.  I know adding more HDDs are able to make the cluster 
> faster due to the additional IOPs.
>
> So my question is this:
>
> Is it better to:
>
>  * Add the additional drives/IOPs by adding an additional node
>  * Add the additional drives by adding the the HDDs to the existing 
> nodes
>
> Or does it not really matter?  I would prefer to add the drives to the 
> existing nodes (ultimately maxing them out)

Please share what your nodes are like to inform suggestions.  I’ve recently 
seen a cluster deployed with 8+2 EC on only 10 nodes and inadequate CPU.  When 
things went pear-shaped it really, really wasn’t pretty.  How many SAS/SATA 
drive bays do your nodes have for HDDs?  Like most things in tech there are 
disagreements, but a rule of thumb is 2x vcores / threads per HDD OSD, 4-6 for 
NVMe OSDs.  And extra for the OS, mons, mgrs, RGWs, etc.

> , but just wondering if that affects performance as much as expanding by 
> adding additional nodes.
>
> Thanks! :-)
>
> Sent from my mobile device.  Please excuse brevity and ttpos.
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to