[ceph-users] Re: Question about cluster expansion

Anthony D'Atri Tue, 25 Mar 2025 21:59:43 -0700


>> Since you’re calling them thin, I’m thinking that they’re probably
>> E3.S.  U.3 is the size of a conventional 2.5” SFF SSD or HDD.
> 
> Hrm, my terminology is probably confusing.  According to the specs of
> the servers, they are U.3 slots.


Ah.  I forget sometimes that there are both 7mm and 15mm drive heights.

>> Understandable, but you might think in terms of percentage.  If you
>> add four HDD OSDs to each node, with 8 per NVMe offload device, that
>> device is the same overall percentage of the cluster as what you have
>> today.
> 
> But I also think of it in terms of re-setting up four OSDs as opposed
> to eight :-)

Honestly that is one reason I recommend that people find a way to deploy 
all-NVMe
chassis, which can even cost LESS than HDD.

>> so if you suffer a power outage you may be in a world of hurt.
> 
> But only if 3+ nodes lose power/get "rudley" rebooted first, correct?

If you have 5 nodes with R3 pools, losing power to 2 of them could result in 
data unavailable without manual intervention.  Loss of 3+ could result in 
permanent data loss.  Data in flight can easily be lost or corrupted.  I’ve 
experienced this myself when testing resilience by dropping power to an entire 
rack at once.  Which is also a terrific and terrifying way to expose flaws in 
expensive RAID HBAs, against which I’ve ranted on this list for years.

> Just bringing this back to my original question: since we have the room
> to add up to four more HDDs to each of our existing 5 nodes, if we
> wanted to add an addition 20 HDDs altogether, is there any real
> performance difference between adding them to the existign nodes or by
> adding 5 more nodes?

This is Ceph, so the answer is It Depends.  If your nodes have sufficient RAM, 
CPU, and networking, there might not be a measurable difference.  More nodes 
would have the advantage of each node having a smaller blast radius in terms of 
percentages, and would also give you the potential for using more-advantageous 
EC profiles should you wish in the future.

> I could see that there might be, as by adding more nodes, the IOPs are
> spread across a bigger footprint, and less likely to saturate the
> bandwidth

Network?  Depends on your links.  It’s harder to saturate a network with HDDs 
than with SSDs, especially NVMe, but with say a 1GE network without proper 
bonding your nodes could conceivably saturate the links.  Denser nodes with as 
many as 180 OSDs each (I’ve seen it proposed) or a more modest number of NVMe 
SSDs can easily saturate even faster network links, especially if their hash 
policies aren’t ideal.  Dense HDD nodes can also saturate HBAs, backplanes, and 
expanders.

> , as opposed to being more concentrated, but then I am not
> 100% sure that it works that way?  Maybe it just matters more that
> there are more spinners available to increase the total IOPs?

With modern rotational media (I doubt we have drum OSDs but that’d be really 
cool) that’s often the case — IOPS limited by the interface, and by 
rotational/seek latency.

A rough rule of thumb is 2 vcores / threads per HDD OSD, though a 1TB HDD OSD 
and a 30TB OSD might have different demands.  CPU-limited systems would tend to 
benefit from more nodes, as would those with limited physmem.  The latter of 
course is often easier to augment, the default osd_memory_target is 4GB so 6GB 
per OSD + other daemons + OS overhead is a good target.  NVMe OSDs depending 
who you ask, 4-6 vcores / threads

> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Question about cluster expansion

Reply via email to