> > We’re running a 30+ node cluster (+plan to add more) to provide S3 services > to our consumers.
These days that’s medium scale ;) > > Each node has: > - 64 Cores, 300+ GB RAM (Monitors have less memory, we have 5 monitors > separated from OSD nodes) > - 14 NVMe (14TB) or 14 HDD (10TB) disks 64 cores == 128 threads? Intel or AMD? > - Each OSD node also hosts RGW services(pods). > - HP/Dell/Lenovo hardware Suggest disabling deep C-states, e.g. TuneD latency-performance > - roughly 3 CPU cores and 8GB RAM are allocated per OSD sizing. Cores or threads? With Ceph my sense is that if the CPU does hyperthreading, it should be enabled, so that would mean 128 vcores / threads per node? That works out to ~~ 8 threads per OSD plus OS and other daemons. Which is ample, especially for HDD OSDs. Assuming you have enough HDD nodes and they are ideally in multiple racks, I might consider running more than one RGW on each of those to exploit the cores, and labeling them so that prom, grafana, etc. if deployed favor those nodes. > > * OS Tuning for Ceph Nodes > - WriteCache is disabled for NVMe disks. On the drives? I tend to think of that as more important for HDDs. > - No multi-socket nodes (no NUMA configuration). > - No other workloads running on Ceph nodes. > - Plan is to keep the default installation unless strongly advised otherwise. > - Is there anything else critical we should tune (kernel parameters, etc.)? https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/ Is a good place to start. If you’re using AMD CPUs, disable IOMMU. Jack up Linux sysctl and nf_conntrack params. With AMD CPUs NPS settings can still make a difference even if only 1S. I suggest not applying K8s `limits` and letting the OSDs autotune osd_memory_target, if that’s possible with Rook. > > * Buying New NVMe Disks > - When we order disks, we request 2 batches per disk group (e.g., 100 disks > from Batch1, 100 from Batch2). We are looking for more professional > recommendations. Is it a good idea to have different batches or shall we > combine two different vendor disk groups? It’s a classic dream to have a different mfg for each failure domain, but procurement reality often makes that infeasible. I would suggest ensuring that you can get firmware updates either from the drive mfg(s) or the chassis vendor(s). Firmware updates on SSDs especially can be VERY important. > > * Tuning Ceph for Large Scale S3 > - What default parameters should be adjusted for a large-scale S3 deployment? Run at least one RGW on every node. Set mon_max_pg_per_osd to 1000 Set mon_target_pg_per_osd to 200 for the HDD OSDs and 300 for the NVMe OSDs. Ensure that the rgw log, meta, index pools use a CRUSH rule constrained to only the ssd/nvme device class, and that the pools you use for the HDDs likewise constrain. > > * Customer Limitations > - In theory we don't need any limitations, but we don't want to see > non-sharding objects, etc. , If we allow everything our customers will find > ways to abuse this. What parameters will protect us from operational > difficulties. Pay attention to any large omaps that arise and deal with them before they paint you into a corner. Find out your expected object size distribution. If you have a lot of small objects, say <256KB, you might consider multiple storage classes, e.g. an R3 default SC for small objects, with larger objects directed to an EC and/or HDD pool. > - Are there standard limitations we should enforce, like maximum object size, > maximum number of objects per bucket, etc. Small objects stress HDDs and are ideally placed on SSDs. Large objects handle EC and HDDs better. > > * Inventory Management and Netbox Integration > - Is it a good practice to integrate Netbox into Ceph operations? Netbox is good stuff. I love it for hosts. > - Like, should we register all disks in Netbox for easier daily maintenance, > adding automation to the netbox and replacing the disks via inventory changes > on netbox ? Interesting idea, though I’m not sure how Netbox would factor into maint. I usually find that getting drive model/SN/firmware into Prometheus does most of what I need. > * Why Some Large Deployments Avoid Rook > - I mostly see Rook deployments for Openstack HCI deployments only (maybe I > am wrong on this) > - Is it about operational complexity, or is Rook seen as unsuitable for very > large clusters? If you plan to use Ceph entirely *within* the K8s cluster, Rook can be a natural fit, though beware that depending on your SDN choices you might have MTU size issues. Rook does make it challenging to set certain things, but this is constantly improving. > > * RocksDB Tuning and Archiving > - How should we manage RocksDB in large-scale setups? > - Should we periodically split/archive the DB? > - How important is RocksDB latency monitoring? That’s a very complex topic. My sense is to leave RocksDB tuning to the experts within the Ceph community and roll with the defaults. It’s possible to really shoot yourself in the foot. Are you planning to offload HDD WAL+DB to SSDs, or leave them on the HDDs? For the most part Ceph will do the needful re RocksDB. If you experience a massive deletion you might benefit from some manual online compaction. > > * OS-Level Monitoring > - Besides standard exporter logs, what else should we monitor (e.g., network > latency, OSD-to-OSD throughput)? > - we are planning to create a separate exporter than the ceph_exporter to > export some specific metrics (mostly disk and NW related) node_exporter gives you a lot of that out of the box. Using one or the other SMART exporter can be additionally valuable. You can try smartctl_exporter, or I can give you a no-strings script I whipped up. > * Design Mistakes to Avoid > - What are the common mistakes (e.g., settings that block easy upgrades) that > we should be aware of when planning a large-scale S3 service? Using HDDs ;) Be sure that EVERY pool uses a CRUSH rule that specifies the device class, including the .mgr pool, which when last I ran Rook automatically was created without one. I had to edit the CRUSH map manually to address this, which was straightforward. Use the HDDs only for bucket data pool(S). EC is common, I’d start with 4,2 according to your needs, but again very small objects can be hotspots and are best placed on replicated pools, especially SSD. https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?gid=358760253#gid=358760253 > > * Vendor Relations > - Is there a resource to check disk types, endurance, wearout timelines, etc. > When we order new HW the vendor doesn't provide us any data? Not sure what you mean by this. Mostly a read-intensive SSD is fine for Ceph, paying more for mixed-use isn’t worth it. Exporters as described above can collect information. If using Solidigm and/or Samsung SSDs, at least, the timed workload feature of the firmware can help predict lifetime. > * Unsupported Ceph Versions > - Some large deployments stick with Ceph 16.x (unsupported). > - Why do they prefer being alone on the old version and backporting features > themselves? this seems very dangerous (there has to be a valid reason why?) Some organizations are using non-containerized deployments for various reasons, and may be constrained by an organizational requirement to stick with a really old OS. > > * Cluster Health Monitoring > - Do large deployments rely only on HEALTH_OK? Oh no. > - Or do they combine metrics for a custom health status definition? Alertmanager comes with a rich set of rules right out of the box. > > * Security Best Practices > - What's the best approach for securing the cluster? > - How should we handle disk encryption, key rotation (frequency/method), and > secure disposal? SED / Opal are complicated, I know there are schemes out there there for managing the keys but am not personally familiar with them. Depending on your needs you might enable OSD-level encryption. > > * Diversity of Pools > - If our service is focused purely on S3, should we still consider using > other pool types (RBD, CephFS)? Only deploy what you need. > > * Disk Size Diversity > - Is it problematic to have OSDs with different disk sizes? Not at all, though there are benefits to keeping each node’s aggregate capacity for each storage class more or less comparable. > > * WAF or Smart Proxy for RGW nodes > - Our idea is to have RGW on each of our OSD node, (Monitor and Mgr would be > excluded on any operation), what would be the best idea besides having a LDB > in front of these RGW services. How are the others designing RGW, is it safe > enough to just use it directly? You will need some manner of LB to spread workload across multiple RGWs. You can use your own haproxy/ngix setup, appliances like F5 or Citrix, or let Ceph / Rook deploy an ingress service. > > I would highly appreciate your answers/opinions on these topics. I also > assume that most of these comments would come from real use cases and > experiences. > > Thanks in advance, > _ > Senol Colak > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io