Hi We recently experienced an incident in one of our production Ceph clusters and would greatly appreciate the community’s input.
Our configuration uses Erasure Coding with a 4+2 profile for the main S3 data pools. During the incident, two storage hosts that contained several OSDs each became unavailable simultaneously. As a result, a large number of PGs entered an inactive state because the number of remaining fragments was insufficient for reconstruction under the EC 4+2 profile. This triggered a cascading failure in the RGW service layer: • Incoming S3 requests continued flowing into HAProxy. • RGW attempted to process requests that required I/O to inactive PGs. • I/O stalls caused excessive queue buildup and memory usage on RGWs. • Multiple RGW processes were terminated by the OOM Killer. • RGW daemons repeatedly restarted and were again selected by HAProxy. • The cycle continued, leading to severe service disruption. This behavior continued until the failed hosts were recovered and PGs regained a clean and active state. One interesting observation was that a separate set of RGW daemons dedicated only to internal consumers continued operating normally. Their traffic load was significantly lower, which prevented queue saturation and memory exhaustion. We are seeking guidance on two key questions: 1. In an EC 4+2 pool design, is such a cascading RGW failure considered inevitable if enough hosts fail to cause PG unavailability, or are there established best practices to prevent this behavior? 2. Based on our observation that internal RGWs remained healthy, is isolating customers by assigning separate RGWs to high-traffic and low-traffic groups considered a recommended architectural approach? Thank you in advance for any advice and shared experiences. Best regards, Ramin _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
