Hi

We recently experienced an incident in one of our production Ceph clusters
and would greatly appreciate the community’s input.

Our configuration uses Erasure Coding with a 4+2 profile for the main S3
data pools. During the incident, two storage hosts that contained several
OSDs each became unavailable simultaneously. As a result, a large number of
PGs entered an inactive state because the number of remaining fragments was
insufficient for reconstruction under the EC 4+2 profile.

This triggered a cascading failure in the RGW service layer:

• Incoming S3 requests continued flowing into HAProxy.
• RGW attempted to process requests that required I/O to inactive PGs.
• I/O stalls caused excessive queue buildup and memory usage on RGWs.
• Multiple RGW processes were terminated by the OOM Killer.
• RGW daemons repeatedly restarted and were again selected by HAProxy.
• The cycle continued, leading to severe service disruption.

This behavior continued until the failed hosts were recovered and PGs
regained a clean and active state.

One interesting observation was that a separate set of RGW daemons
dedicated only to internal consumers continued operating normally. Their
traffic load was significantly lower, which prevented queue saturation and
memory exhaustion.

We are seeking guidance on two key questions:

1. In an EC 4+2 pool design, is such a cascading RGW failure considered
inevitable if enough hosts fail to cause PG unavailability, or are there
established best practices to prevent this behavior?
2. Based on our observation that internal RGWs remained healthy, is
isolating customers by assigning separate RGWs to high-traffic and
low-traffic groups considered a recommended architectural approach?

Thank you in advance for any advice and shared experiences.

Best regards,
Ramin
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to