[ceph-users] Re: RGW Cascading Failures After Simultaneous Host Loss

Eugen Block Tue, 28 Oct 2025 04:25:16 -0700

Hi,

first, you could have prevented inactive PGs by temporarily reducingmin_size to 4 (default is k + 1). But don't leave min_size at k,that's too dangerous.

1. In an EC 4+2 pool design, is such a cascading RGW failure considered
inevitable if enough hosts fail to cause PG unavailability, or are there
established best practices to prevent this behavior?

I am not surprised about the RGW failure, although we haven't seenthat happening on any cluster yet. How many OSD hosts do you have intotal? You could add more nodes and use a different EC profile to beable to sustain the loss of two hosts without service interruption.But it really depends on your actual resiliency requirements. Now thatyou lost two and had a service disruption, it might be a good idea toreconsider.

2. Based on our observation that internal RGWs remained healthy, is
isolating customers by assigning separate RGWs to high-traffic and
low-traffic groups considered a recommended architectural approach?

I don't think it makes a difference, inactive PGs stall IO. Maybethere just wasn't enough client traffic to overload the internal RGWs,which looks as if they were "healthy"? But from a general perspective,it can make sense to dedicate RGWs to specific tasks.


Regards,
Eugen

Zitat von Ramin Najjarbashi <[email protected]>:

Hi

We recently experienced an incident in one of our production Ceph clusters
and would greatly appreciate the community’s input.

Our configuration uses Erasure Coding with a 4+2 profile for the main S3
data pools. During the incident, two storage hosts that contained several
OSDs each became unavailable simultaneously. As a result, a large number of
PGs entered an inactive state because the number of remaining fragments was
insufficient for reconstruction under the EC 4+2 profile.

This triggered a cascading failure in the RGW service layer:

• Incoming S3 requests continued flowing into HAProxy.
• RGW attempted to process requests that required I/O to inactive PGs.
• I/O stalls caused excessive queue buildup and memory usage on RGWs.
• Multiple RGW processes were terminated by the OOM Killer.
• RGW daemons repeatedly restarted and were again selected by HAProxy.
• The cycle continued, leading to severe service disruption.

This behavior continued until the failed hosts were recovered and PGs
regained a clean and active state.

One interesting observation was that a separate set of RGW daemons
dedicated only to internal consumers continued operating normally. Their
traffic load was significantly lower, which prevented queue saturation and
memory exhaustion.

We are seeking guidance on two key questions:

1. In an EC 4+2 pool design, is such a cascading RGW failure considered
inevitable if enough hosts fail to cause PG unavailability, or are there
established best practices to prevent this behavior?
2. Based on our observation that internal RGWs remained healthy, is
isolating customers by assigning separate RGWs to high-traffic and
low-traffic groups considered a recommended architectural approach?

Thank you in advance for any advice and shared experiences.

Best regards,
Ramin
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]



_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: RGW Cascading Failures After Simultaneous Host Loss

Reply via email to