[ceph-users] Best Practices and real world experiences for Big Scale S3 Clusters

Şenol Çolak Mon, 28 Apr 2025 02:41:58 -0700

Hello everyone,

We’re running a 30+ node cluster (+plan to add more) to provide S3services to our consumers.


Each node has:

- 64 Cores, 300+ GB RAM (Monitors have less memory, we have 5 monitorsseparated from OSD nodes)

- 14 NVMe (14TB) or 14 HDD (10TB) disks
- Each OSD node also hosts RGW services(pods).
- HP/Dell/Lenovo hardware
- Around 6 PB Raw capacity.
- roughly 3 CPU cores and 8GB RAM are allocated per OSD sizing.
- We use Rook to deploy our cluster

I know that some of the answers might be long, I also have my ownexperiences on some of them(which I included), but I would highlyappreciate to see the state for Ceph S3 on 04/2025.


and here we have several questions we’d love your input on:

* Performance Counters

- We want to enable performance counters to identify noisy buckets anddiagnose issues when they occur.- Does enabling performance counters cause noticeable performancedegradation? If so, by how much?

- For now, we plan to keep them disabled.(Because we are not sure)

* OS Tuning for Ceph Nodes
- WriteCache is disabled for NVMe disks.
- No multi-socket nodes (no NUMA configuration).
- No other workloads running on Ceph nodes.

- Plan is to keep the default installation unless strongly advisedotherwise.- Is there anything else critical we should tune (kernel parameters,etc.)?


* Tracking Over-utilization
- How can we detect and identify users overloading the cluster
(e.g., excessive object creation/deletion, heavy throughput)?

* NVMe Configuration for Ceph

- Beyond disabling cache, what are the best practices arounddiscard/trimming operations for NVMe drives?


* Buying New NVMe Disks

- When we order disks, we request 2 batches per disk group (e.g., 100disks from Batch1, 100 from Batch2). We are looking for moreprofessional recommendations. Is it a good idea to have differentbatches or shall we combine two different vendor disk groups?- We always check the rebranding values for NVMe disks and try to groupthem with rebranding informations.


* Tuning Ceph for Large Scale S3

- What default parameters should be adjusted for a large-scale S3deployment?


* Customer Limitations

- In theory we don't need any limitations, but we don't want to seenon-sharding objects, etc. , If we allow everything our customers willfind ways to abuse this. What parameters will protect us fromoperational difficulties.- Are there standard limitations we should enforce, like maximum objectsize, maximum number of objects per bucket, etc.


* Inventory Management and Netbox Integration
- Is it a good practice to integrate Netbox into Ceph operations?

- Like, should we register all disks in Netbox for easier dailymaintenance, adding automation to the netbox and replacing the disks viainventory changes on netbox ?


* Cluster Deployment Best Practices

- Ansible, Rook, custom deployments, or manual host installations —what’s best for large clusters, might depend on the team and companystructure but would be nice to see opinions?


* Why Some Large Deployments Avoid Rook

- I mostly see Rook deployments for Openstack HCI deployments only(maybe I am wrong on this)- Is it about operational complexity, or is Rook seen as unsuitable forvery large clusters?


* Monitoring Metrics
- What are the most critical metrics for diagnosing S3 issues?

* RocksDB Tuning and Archiving
- How should we manage RocksDB in large-scale setups?
- Should we periodically split/archive the DB?
- How important is RocksDB latency monitoring?

* OS-Level Monitoring

- Besides standard exporter logs, what else should we monitor (e.g.,network latency, OSD-to-OSD throughput)?- we are planning to create a separate exporter than the ceph_exporterto export some specific metrics (mostly disk and NW related)


* Design Mistakes to Avoid

- What are the common mistakes (e.g., settings that block easy upgrades)that we should be aware of when planning a large-scale S3 service?


* Vendor Relations

- Is there a resource to check disk types, endurance, wearout timelines,etc. When we order new HW the vendor doesn't provide us any data?- Maybe there is a group of Ceph users holding some internal informationabout different vendors and their HW, if not I would be happy to startthis..- What's the best approach when dealing with hardware vendors, gettingthe cheapest or getting the reliable is the right approach?


* Scheduled Operations

- Are there standard Ansible playbooks or procedures for recurringmaintenance tasks?


* Unsupported Ceph Versions
- Some large deployments stick with Ceph 16.x (unsupported).

- Why do they prefer being alone on the old version and backportingfeatures themselves? this seems very dangerous (there has to be a validreason why?)


* Cluster Health Monitoring
- Do large deployments rely only on HEALTH_OK?
- Or do they combine metrics for a custom health status definition?

* Security Best Practices
- What's the best approach for securing the cluster?

- How should we handle disk encryption, key rotation (frequency/method),and secure disposal?


* Diversity of Pools

- If our service is focused purely on S3, should we still consider usingother pool types (RBD, CephFS)?


* Disk Size Diversity
- Is it problematic to have OSDs with different disk sizes?

* Multisite Clusters

- In a multisite setup, if replication is enabled for some pools, is theperformance impact localized to those pools, or does it affect theentire cluster?


* WAF or Smart Proxy for RGW nodes

- Our idea is to have RGW on each of our OSD node, (Monitor and Mgrwould be excluded on any operation), what would be the best idea besideshaving a LDB in front of these RGW services. How are the othersdesigning RGW, is it safe enough to just use it directly?

I would highly appreciate your answers/opinions on these topics. I alsoassume that most of these comments would come from real use cases andexperiences.


Thanks in advance,
_
Senol Colak
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Best Practices and real world experiences for Big Scale S3 Clusters

Reply via email to