Hello everyone,

We’re running a 30+ node cluster (+plan to add more) to provide S3 services to our consumers.

Each node has:
- 64 Cores, 300+ GB RAM (Monitors have less memory, we have 5 monitors separated from OSD nodes)
- 14 NVMe (14TB) or 14 HDD (10TB) disks
- Each OSD node also hosts RGW services(pods).
- HP/Dell/Lenovo hardware
- Around 6 PB Raw capacity.
- roughly 3 CPU cores and 8GB RAM are allocated per OSD sizing.
- We use Rook to deploy our cluster

I know that some of the answers might be long, I also have my own experiences on some of them(which I included), but I would highly appreciate to see the state for Ceph S3 on 04/2025.

and here we have several questions we’d love your input on:

* Performance Counters
- We want to enable performance counters to identify noisy buckets and diagnose issues when they occur. - Does enabling performance counters cause noticeable performance degradation? If so, by how much?
- For now, we plan to keep them disabled.(Because we are not sure)

* OS Tuning for Ceph Nodes
- WriteCache is disabled for NVMe disks.
- No multi-socket nodes (no NUMA configuration).
- No other workloads running on Ceph nodes.
- Plan is to keep the default installation unless strongly advised otherwise. - Is there anything else critical we should tune (kernel parameters, etc.)?

* Tracking Over-utilization
- How can we detect and identify users overloading the cluster
(e.g., excessive object creation/deletion, heavy throughput)?

* NVMe Configuration for Ceph
- Beyond disabling cache, what are the best practices around discard/trimming operations for NVMe drives?

* Buying New NVMe Disks
- When we order disks, we request 2 batches per disk group (e.g., 100 disks from Batch1, 100 from Batch2). We are looking for more professional recommendations. Is it a good idea to have different batches or shall we combine two different vendor disk groups? - We always check the rebranding values for NVMe disks and try to group them with rebranding informations.

* Tuning Ceph for Large Scale S3
- What default parameters should be adjusted for a large-scale S3 deployment?

* Customer Limitations
- In theory we don't need any limitations, but we don't want to see non-sharding objects, etc. , If we allow everything our customers will find ways to abuse this. What parameters will protect us from operational difficulties. - Are there standard limitations we should enforce, like maximum object size, maximum number of objects per bucket, etc.

* Inventory Management and Netbox Integration
- Is it a good practice to integrate Netbox into Ceph operations?
- Like, should we register all disks in Netbox for easier daily maintenance, adding automation to the netbox and replacing the disks via inventory changes on netbox ?

* Cluster Deployment Best Practices
- Ansible, Rook, custom deployments, or manual host installations — what’s best for large clusters, might depend on the team and company structure but would be nice to see opinions?

* Why Some Large Deployments Avoid Rook
- I mostly see Rook deployments for Openstack HCI deployments only (maybe I am wrong on this) - Is it about operational complexity, or is Rook seen as unsuitable for very large clusters?

* Monitoring Metrics
- What are the most critical metrics for diagnosing S3 issues?

* RocksDB Tuning and Archiving
- How should we manage RocksDB in large-scale setups?
- Should we periodically split/archive the DB?
- How important is RocksDB latency monitoring?

* OS-Level Monitoring
- Besides standard exporter logs, what else should we monitor (e.g., network latency, OSD-to-OSD throughput)? - we are planning to create a separate exporter than the ceph_exporter to export some specific metrics (mostly disk and NW related)

* Design Mistakes to Avoid
- What are the common mistakes (e.g., settings that block easy upgrades) that we should be aware of when planning a large-scale S3 service?

* Vendor Relations
- Is there a resource to check disk types, endurance, wearout timelines, etc. When we order new HW the vendor doesn't provide us any data? - Maybe there is a group of Ceph users holding some internal information about different vendors and their HW, if not I would be happy to start this.. - What's the best approach when dealing with hardware vendors, getting the cheapest or getting the reliable is the right approach?

* Scheduled Operations
- Are there standard Ansible playbooks or procedures for recurring maintenance tasks?

* Unsupported Ceph Versions
- Some large deployments stick with Ceph 16.x (unsupported).
- Why do they prefer being alone on the old version and backporting features themselves? this seems very dangerous (there has to be a valid reason why?)

* Cluster Health Monitoring
- Do large deployments rely only on HEALTH_OK?
- Or do they combine metrics for a custom health status definition?

* Security Best Practices
- What's the best approach for securing the cluster?
- How should we handle disk encryption, key rotation (frequency/method), and secure disposal?

* Diversity of Pools
- If our service is focused purely on S3, should we still consider using other pool types (RBD, CephFS)?

* Disk Size Diversity
- Is it problematic to have OSDs with different disk sizes?

* Multisite Clusters
- In a multisite setup, if replication is enabled for some pools, is the performance impact localized to those pools, or does it affect the entire cluster?

* WAF or Smart Proxy for RGW nodes
- Our idea is to have RGW on each of our OSD node, (Monitor and Mgr would be excluded on any operation), what would be the best idea besides having a LDB in front of these RGW services. How are the others designing RGW, is it safe enough to just use it directly?

I would highly appreciate your answers/opinions on these topics. I also assume that most of these comments would come from real use cases and experiences.

Thanks in advance,
_
Senol Colak
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to