[ceph-users] Re: Best Practices and real world experiences for Big Scale S3 Clusters

Anthony D'Atri Mon, 28 Apr 2025 09:04:11 -0700

> 
> We’re running a 30+ node cluster (+plan to add more) to provide S3 services 
> to our consumers.


These days that’s medium scale ;)

> 
> Each node has:
> - 64 Cores, 300+ GB RAM (Monitors have less memory, we have 5 monitors 
> separated from OSD nodes)
> - 14 NVMe (14TB) or 14 HDD (10TB) disks

64 cores == 128 threads?  Intel or AMD?

> - Each OSD node also hosts RGW services(pods).
> - HP/Dell/Lenovo hardware

Suggest disabling deep C-states, e.g. TuneD latency-performance


> - roughly 3 CPU cores and 8GB RAM are allocated per OSD sizing.

Cores or threads?  With Ceph my sense is that if the CPU does hyperthreading, 
it should be enabled, so that would mean 128 vcores / threads per node?  That 
works out to ~~ 8 threads per OSD plus OS and other daemons.  Which is ample, 
especially for HDD OSDs.  Assuming you have enough HDD nodes and they are 
ideally in multiple racks, I might consider running more than one RGW on each 
of those to exploit the cores, and labeling them so that prom, grafana, etc. if 
deployed favor those nodes.

> 
> * OS Tuning for Ceph Nodes
> - WriteCache is disabled for NVMe disks.

On the drives?  I tend to think of that as more important for HDDs.

> - No multi-socket nodes (no NUMA configuration).
> - No other workloads running on Ceph nodes.
> - Plan is to keep the default installation unless strongly advised otherwise.
> - Is there anything else critical we should tune (kernel parameters, etc.)?

https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/ Is a good place to 
start.  If you’re using AMD CPUs, disable IOMMU.  Jack up Linux sysctl and 
nf_conntrack params.  With AMD CPUs NPS settings can still make a difference 
even if only 1S.

I suggest not applying K8s `limits` and letting the OSDs autotune 
osd_memory_target, if that’s possible with Rook.

> 
> * Buying New NVMe Disks
> - When we order disks, we request 2 batches per disk group (e.g., 100 disks 
> from Batch1, 100 from Batch2). We are looking for more professional 
> recommendations. Is it a good idea to have different batches or shall we 
> combine two different vendor disk groups?

It’s a classic dream to have a different mfg for each failure domain, but 
procurement reality often makes that infeasible.  I would suggest ensuring that 
you can get firmware updates either from the drive mfg(s) or the chassis 
vendor(s).  Firmware updates on SSDs especially can be VERY important.

> 
> * Tuning Ceph for Large Scale S3
> - What default parameters should be adjusted for a large-scale S3 deployment?

Run at least one RGW on every node. 

Set mon_max_pg_per_osd to 1000
Set mon_target_pg_per_osd to 200 for the HDD OSDs and 300 for the NVMe OSDs.

Ensure that the rgw log, meta, index pools use a CRUSH rule constrained to only 
the ssd/nvme device class, and that the pools you use for the HDDs likewise 
constrain.

> 
> * Customer Limitations
> - In theory we don't need any limitations, but we don't want to see 
> non-sharding objects, etc. , If we allow everything our customers will find 
> ways to abuse this. What parameters will protect us from operational 
> difficulties.

Pay attention to any large omaps that arise and deal with them before they 
paint you into a corner.
Find out your expected object size distribution.  If you have a lot of small 
objects, say <256KB, you might consider multiple storage classes, e.g. an R3 
default SC for small objects, with larger objects directed to an EC and/or HDD 
pool.  


> - Are there standard limitations we should enforce, like maximum object size, 
> maximum number of objects per bucket, etc.

Small objects stress HDDs and are ideally placed on SSDs.  Large objects handle 
EC and HDDs better.

> 
> * Inventory Management and Netbox Integration
> - Is it a good practice to integrate Netbox into Ceph operations?

Netbox is good stuff.  I love it for hosts.

> - Like, should we register all disks in Netbox for easier daily maintenance, 
> adding automation to the netbox and replacing the disks via inventory changes 
> on netbox ?

Interesting idea, though I’m not sure how Netbox would factor into maint.  I 
usually find that getting drive model/SN/firmware into Prometheus does most of 
what I need.


> * Why Some Large Deployments Avoid Rook
> - I mostly see Rook deployments for Openstack HCI deployments only (maybe I 
> am wrong on this)
> - Is it about operational complexity, or is Rook seen as unsuitable for very 
> large clusters?

If you plan to use Ceph entirely *within* the K8s cluster, Rook can be a 
natural fit, though beware that depending on your SDN choices you might have 
MTU size issues.  Rook does make it challenging to set certain things, but this 
is constantly improving.

> 
> * RocksDB Tuning and Archiving
> - How should we manage RocksDB in large-scale setups?
> - Should we periodically split/archive the DB?
> - How important is RocksDB latency monitoring?

That’s a very complex topic.  My sense is to leave RocksDB tuning to the 
experts within the Ceph community and roll with the defaults.  It’s possible to 
really shoot yourself in the foot.  Are you planning to offload HDD WAL+DB to 
SSDs, or leave them on the HDDs?  For the most part Ceph will do the needful re 
RocksDB.  If you experience a massive deletion you might benefit from some 
manual online compaction.

> 
> * OS-Level Monitoring
> - Besides standard exporter logs, what else should we monitor (e.g., network 
> latency, OSD-to-OSD throughput)?
> - we are planning to create a separate exporter than the ceph_exporter to 
> export some specific metrics (mostly disk and NW related)

node_exporter gives you a lot of that out of the box.  Using one or the other 
SMART exporter can be additionally valuable.  You can try smartctl_exporter, or 
I can give you a no-strings script I whipped up.

> * Design Mistakes to Avoid
> - What are the common mistakes (e.g., settings that block easy upgrades) that 
> we should be aware of when planning a large-scale S3 service?

Using HDDs ;)

Be sure that EVERY pool uses a CRUSH rule that specifies the device class, 
including the .mgr pool, which when last I ran Rook automatically was created 
without one.  I had to edit the CRUSH map manually to address this, which was 
straightforward.

Use the HDDs only for bucket data pool(S).  EC is common, I’d start with 4,2 
according to your needs, but again very small objects can be hotspots and are 
best placed on replicated pools, especially SSD.  

https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?gid=358760253#gid=358760253



> 
> * Vendor Relations
> - Is there a resource to check disk types, endurance, wearout timelines, etc. 
> When we order new HW the vendor doesn't provide us any data?

Not sure what you mean by this.  Mostly a read-intensive SSD is fine for Ceph, 
paying more for mixed-use isn’t worth it.  Exporters as described above can 
collect information.  If using Solidigm and/or Samsung SSDs, at least, the 
timed workload feature of the firmware can help predict lifetime.

> * Unsupported Ceph Versions
> - Some large deployments stick with Ceph 16.x (unsupported).
> - Why do they prefer being alone on the old version and backporting features 
> themselves? this seems very dangerous (there has to be a valid reason why?)

Some organizations are using non-containerized deployments for various reasons, 
and may be constrained by an organizational requirement to stick with a really 
old OS.

> 
> * Cluster Health Monitoring
> - Do large deployments rely only on HEALTH_OK?

Oh no.

> - Or do they combine metrics for a custom health status definition?

Alertmanager comes with a rich set of rules right out of the box.


> 
> * Security Best Practices
> - What's the best approach for securing the cluster?
> - How should we handle disk encryption, key rotation (frequency/method), and 
> secure disposal?

SED / Opal are complicated, I know there are schemes out there there for 
managing the keys but am not personally familiar with them.  Depending on your 
needs you might enable OSD-level encryption.

> 
> * Diversity of Pools
> - If our service is focused purely on S3, should we still consider using 
> other pool types (RBD, CephFS)?

Only deploy what you need.

> 
> * Disk Size Diversity
> - Is it problematic to have OSDs with different disk sizes?

Not at all, though there are benefits to keeping each node’s aggregate capacity 
for each storage class more or less comparable.  

> 
> * WAF or Smart Proxy for RGW nodes
> - Our idea is to have RGW on each of our OSD node, (Monitor and Mgr would be 
> excluded on any operation), what would be the best idea besides having a LDB 
> in front of these RGW services. How are the others designing RGW, is it safe 
> enough to just use it directly?

You will need some manner of LB to spread workload across multiple RGWs.  You 
can use your own haproxy/ngix setup, appliances like F5 or Citrix, or let Ceph 
/ Rook deploy an ingress service.

> 
> I would highly appreciate your answers/opinions on these topics. I also 
> assume that most of these comments would come from real use cases and 
> experiences.
> 
> Thanks in advance,
> _
> Senol Colak
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Best Practices and real world experiences for Big Scale S3 Clusters

Reply via email to