[ceph-users] Re: RGW Cascading Failures After Simultaneous Host Loss

Ramin Najjarbashi Tue, 28 Oct 2025 05:06:25 -0700

Hello Eugen,

Thank you very much for your detailed explanation and helpful guidance.
Your point regarding temporarily lowering min_size to 4 in order to avoid
inactive PGs is well-taken. We fully agree that keeping min_size at k would
introduce unacceptable risk, so this would only be considered as a
controlled short-term measure during failure events.
Regarding resiliency, we currently have limited hardware capacity, which
restricts our ability to either add more OSD nodes or adjust the EC profile
immediately. However, based on this incident, we clearly recognize the need
to revisit our failure domain design and EC configuration in the near
future.
Your clarification about RGW segmentation was also very useful. It makes
sense that the internal RGWs only appeared healthy due to lower traffic,
not due to architectural isolation. We will view that observation more
carefully going forward.


Thank you again for taking the time to share your experience. We truly
appreciate the community support and will incorporate these recommendations
into our next mitigation plan.

Best regards,
ًRamin

On Tue, Oct 28, 2025 at 2:55 PM Eugen Block <[email protected]> wrote:

> Hi,
>
> first, you could have prevented inactive PGs by temporarily reducing
> min_size to 4 (default is k + 1). But don't leave min_size at k,
> that's too dangerous.
>
> > 1. In an EC 4+2 pool design, is such a cascading RGW failure considered
> > inevitable if enough hosts fail to cause PG unavailability, or are there
> > established best practices to prevent this behavior?
>
> I am not surprised about the RGW failure, although we haven't seen
> that happening on any cluster yet. How many OSD hosts do you have in
> total? You could add more nodes and use a different EC profile to be
> able to sustain the loss of two hosts without service interruption.
> But it really depends on your actual resiliency requirements. Now that
> you lost two and had a service disruption, it might be a good idea to
> reconsider.
>
> > 2. Based on our observation that internal RGWs remained healthy, is
> > isolating customers by assigning separate RGWs to high-traffic and
> > low-traffic groups considered a recommended architectural approach?
>
> I don't think it makes a difference, inactive PGs stall IO. Maybe
> there just wasn't enough client traffic to overload the internal RGWs,
> which looks as if they were "healthy"? But from a general perspective,
> it can make sense to dedicate RGWs to specific tasks.
>
> Regards,
> Eugen
>
> Zitat von Ramin Najjarbashi <[email protected]>:
>
> > Hi
> >
> > We recently experienced an incident in one of our production Ceph
> clusters
> > and would greatly appreciate the community’s input.
> >
> > Our configuration uses Erasure Coding with a 4+2 profile for the main S3
> > data pools. During the incident, two storage hosts that contained several
> > OSDs each became unavailable simultaneously. As a result, a large number
> of
> > PGs entered an inactive state because the number of remaining fragments
> was
> > insufficient for reconstruction under the EC 4+2 profile.
> >
> > This triggered a cascading failure in the RGW service layer:
> >
> > • Incoming S3 requests continued flowing into HAProxy.
> > • RGW attempted to process requests that required I/O to inactive PGs.
> > • I/O stalls caused excessive queue buildup and memory usage on RGWs.
> > • Multiple RGW processes were terminated by the OOM Killer.
> > • RGW daemons repeatedly restarted and were again selected by HAProxy.
> > • The cycle continued, leading to severe service disruption.
> >
> > This behavior continued until the failed hosts were recovered and PGs
> > regained a clean and active state.
> >
> > One interesting observation was that a separate set of RGW daemons
> > dedicated only to internal consumers continued operating normally. Their
> > traffic load was significantly lower, which prevented queue saturation
> and
> > memory exhaustion.
> >
> > We are seeking guidance on two key questions:
> >
> > 1. In an EC 4+2 pool design, is such a cascading RGW failure considered
> > inevitable if enough hosts fail to cause PG unavailability, or are there
> > established best practices to prevent this behavior?
> > 2. Based on our observation that internal RGWs remained healthy, is
> > isolating customers by assigning separate RGWs to high-traffic and
> > low-traffic groups considered a recommended architectural approach?
> >
> > Thank you in advance for any advice and shared experiences.
> >
> > Best regards,
> > Ramin
> > _______________________________________________
> > ceph-users mailing list -- [email protected]
> > To unsubscribe send an email to [email protected]
>
>
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: RGW Cascading Failures After Simultaneous Host Loss

Reply via email to