Hello everyone,
We’re running a 30+ node cluster (+plan to add more) to provide S3
services to our consumers.
Each node has:
- 64 Cores, 300+ GB RAM (Monitors have less memory, we have 5 monitors
separated from OSD nodes)
- 14 NVMe (14TB) or 14 HDD (10TB) disks
- Each OSD node also hosts RGW services(pods).
- HP/Dell/Lenovo hardware
- Around 6 PB Raw capacity.
- roughly 3 CPU cores and 8GB RAM are allocated per OSD sizing.
- We use Rook to deploy our cluster
I know that some of the answers might be long, I also have my own
experiences on some of them(which I included), but I would highly
appreciate to see the state for Ceph S3 on 04/2025.
and here we have several questions we’d love your input on:
* Performance Counters
- We want to enable performance counters to identify noisy buckets and
diagnose issues when they occur.
- Does enabling performance counters cause noticeable performance
degradation? If so, by how much?
- For now, we plan to keep them disabled.(Because we are not sure)
* OS Tuning for Ceph Nodes
- WriteCache is disabled for NVMe disks.
- No multi-socket nodes (no NUMA configuration).
- No other workloads running on Ceph nodes.
- Plan is to keep the default installation unless strongly advised
otherwise.
- Is there anything else critical we should tune (kernel parameters,
etc.)?
* Tracking Over-utilization
- How can we detect and identify users overloading the cluster
(e.g., excessive object creation/deletion, heavy throughput)?
* NVMe Configuration for Ceph
- Beyond disabling cache, what are the best practices around
discard/trimming operations for NVMe drives?
* Buying New NVMe Disks
- When we order disks, we request 2 batches per disk group (e.g., 100
disks from Batch1, 100 from Batch2). We are looking for more
professional recommendations. Is it a good idea to have different
batches or shall we combine two different vendor disk groups?
- We always check the rebranding values for NVMe disks and try to group
them with rebranding informations.
* Tuning Ceph for Large Scale S3
- What default parameters should be adjusted for a large-scale S3
deployment?
* Customer Limitations
- In theory we don't need any limitations, but we don't want to see
non-sharding objects, etc. , If we allow everything our customers will
find ways to abuse this. What parameters will protect us from
operational difficulties.
- Are there standard limitations we should enforce, like maximum object
size, maximum number of objects per bucket, etc.
* Inventory Management and Netbox Integration
- Is it a good practice to integrate Netbox into Ceph operations?
- Like, should we register all disks in Netbox for easier daily
maintenance, adding automation to the netbox and replacing the disks via
inventory changes on netbox ?
* Cluster Deployment Best Practices
- Ansible, Rook, custom deployments, or manual host installations —
what’s best for large clusters, might depend on the team and company
structure but would be nice to see opinions?
* Why Some Large Deployments Avoid Rook
- I mostly see Rook deployments for Openstack HCI deployments only
(maybe I am wrong on this)
- Is it about operational complexity, or is Rook seen as unsuitable for
very large clusters?
* Monitoring Metrics
- What are the most critical metrics for diagnosing S3 issues?
* RocksDB Tuning and Archiving
- How should we manage RocksDB in large-scale setups?
- Should we periodically split/archive the DB?
- How important is RocksDB latency monitoring?
* OS-Level Monitoring
- Besides standard exporter logs, what else should we monitor (e.g.,
network latency, OSD-to-OSD throughput)?
- we are planning to create a separate exporter than the ceph_exporter
to export some specific metrics (mostly disk and NW related)
* Design Mistakes to Avoid
- What are the common mistakes (e.g., settings that block easy upgrades)
that we should be aware of when planning a large-scale S3 service?
* Vendor Relations
- Is there a resource to check disk types, endurance, wearout timelines,
etc. When we order new HW the vendor doesn't provide us any data?
- Maybe there is a group of Ceph users holding some internal information
about different vendors and their HW, if not I would be happy to start
this..
- What's the best approach when dealing with hardware vendors, getting
the cheapest or getting the reliable is the right approach?
* Scheduled Operations
- Are there standard Ansible playbooks or procedures for recurring
maintenance tasks?
* Unsupported Ceph Versions
- Some large deployments stick with Ceph 16.x (unsupported).
- Why do they prefer being alone on the old version and backporting
features themselves? this seems very dangerous (there has to be a valid
reason why?)
* Cluster Health Monitoring
- Do large deployments rely only on HEALTH_OK?
- Or do they combine metrics for a custom health status definition?
* Security Best Practices
- What's the best approach for securing the cluster?
- How should we handle disk encryption, key rotation (frequency/method),
and secure disposal?
* Diversity of Pools
- If our service is focused purely on S3, should we still consider using
other pool types (RBD, CephFS)?
* Disk Size Diversity
- Is it problematic to have OSDs with different disk sizes?
* Multisite Clusters
- In a multisite setup, if replication is enabled for some pools, is the
performance impact localized to those pools, or does it affect the
entire cluster?
* WAF or Smart Proxy for RGW nodes
- Our idea is to have RGW on each of our OSD node, (Monitor and Mgr
would be excluded on any operation), what would be the best idea besides
having a LDB in front of these RGW services. How are the others
designing RGW, is it safe enough to just use it directly?
I would highly appreciate your answers/opinions on these topics. I also
assume that most of these comments would come from real use cases and
experiences.
Thanks in advance,
_
Senol Colak
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io