Sounds like the confusion is indeed stemming from anchoring on behaviors of 502 and 504 as established prior art while not applying the same initial decision factors to 503.
If we don't trust 5xx error codes, then wouldn't it make the most sense to > throw CommitStateUnknown for any 5xx response? I think this is the crux of the issue, and I agree with this statement entirely. It appears most prudent to treat any 5xx response as something that requires higher-level reconciliation of state before attempting and retries, so in general: 1. Any low-level client retries should not happen in response to 5xx errors 2. These should propagate out as CommitStateUnknown consistently I think there's some variation in how folks are interpreting the meaning of being unable to "trust 5xx error codes" though -- certainly, if 503 is more universally understood to mean a *guarantee* that the backend *did not* and *will not* process the contents of the request, then we're talking about "not trusting that the server returned the correct 5xx error code". On the other hand, if the 503 RFC is understood to *not make guarantees* about whether the backend did or did not start processing contents, then we're talking about "don't trust your assumptions about server state after receiving a 503". Based on just a cursory search from different service domains, it looks like 503 does not generally provide guarantees about no processing. For example: 1. At least one managed service observes the Istio layer returning 503s during a maintenance event when the underlying k8s application did not yet produce a response at all 2. https://www.hostwinds.com/blog/503-error-common-causes-and-how-to-fix-it lists "3. Backend Fetch Failed", "4. Connection Timeout", and "7. Out of Server Resources" as 503 causes, all of which are semantically closer to a 502 than a 429 (where 429 is more similar to the listed "Server Overloaded" or "Server Under Maintenance" items) 3. https://support.google.com/business/thread/10553916/started-getting-this-error-today-googlejsonresponseexception-503-service-unavailable?hl=en Shows some APIs explicitly stating 503 as a "partially processed" state: "code" : 503, "errors" : [ { "domain" : "global", "message" : "The service encountered a transient error. Part of the request may have succeeeded. Refer to details for which fields failed.", "reason" : "backendError" } ], My two cents would be to go ahead with consistent handling of 500, 502, 503, and 504 at least. 501 and 505 might be universally understood to be fast-rejections, but I don't feel strongly about those. On Wed, Jun 18, 2025 at 6:03 PM Prashant Singh <prashant010...@gmail.com> wrote: > EDIT: corrected some typos: > > Hey Ryan, > My reasoning was the following: we retry on 502 / 504 as well here > <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/ExponentialHttpRequestRetryStrategy.java#L86>, > we have a spec definition stating here > <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070> > that > it can be unknown, so says the RFC > <https://datatracker.ietf.org/doc/html/rfc9110#name-502-bad-gateway> as > something went wrong in the middle of processing the request. So we retry > with 502 and 504 we can conflict with ourselves, as we don't know when we > receive the status of 429 409 is it because we retried on 502 and then > got 429 409 or something else happened, so I thought it's better to > throw the commit state unknown if we know the commit has been retried. > > The Iceberg users who accompanied this was with 504 : slack thread - > https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219 > > We have similar handling for glue as we did here > https://github.com/apache/iceberg/pull/7198 as there are internal http > retries over the glue SDK. > > and hence we did it this way. I do agree 429 / 503 is unnecessary being > wrapped and we can miss the opportunity to clean, but it doesn't hurt as > alternative is we can capture all the codes we got and inspect the > permutation and combination of the status code, just wanted to avoid that, > plus i admit there was a bug in our service which we are chasing and fixing > as well where its throwing 503 instead of 502.... but the thing is even > with that the problem issue will still exist due to retries if they are not > handled properly. > > That was the thought process of mine while doing that, and happy to > incorporate your feedback : > > Maybe another alternative is that we take out 502 / 504 from the retry > code and just retry on 429 / 503 and clean up only when 409 or 503 or 501 > are the final status code ? rest all are unknown > > Best, > Prashant Singh > > On Wed, Jun 18, 2025 at 4:21 PM Prashant Singh <prashant010...@gmail.com> > wrote: > >> Hey Ryan, >> My reasoning was the following: we retry on 502 / 504 as well here >> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/ExponentialHttpRequestRetryStrategy.java#L86>, >> we have a spec definition stating here >> <https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1070> >> that >> it can be unknown, so says the RFC >> <https://datatracker.ietf.org/doc/html/rfc9110#name-502-bad-gateway> as >> something went wrong in the middle of processing the request. So we retry >> with 502 and 504 we can conflict with ourselves, as we don't know when we >> receive the status of 429 is it because we retried on 502 and then got 429 >> or something else happened, so I thought it's better to throw the commit >> state unknown if we know the commit has been retried. >> >> The Iceberg users who accompanied this was with 504 : slack thread - >> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219 >> >> We have similar handling for glue as we did here >> https://github.com/apache/iceberg/pull/7198 as there are internal http >> retries over the glue SDK. >> >> and hence we did it this way. I do agree 429 / 503 is unnecessary being >> wrapped and we can miss the opportunity to clean, but it doesn't hurt as >> alternative is we can capture all the codes we got and inspect the >> permutation and combination of the status code, just wanted to avoid that, >> plus i admit there was a bug in our service which we are chasing and fixing >> as well where its throwing 503 instead of 502.... but the thing is even >> with that the problem issue will still exist due to retries if they are not >> handled properly. >> >> That was the thought process of mine while doing that, and happy to >> incorporate your feedback : >> >> Maybe another alternative is that we take out 502 / 504 from the retry >> code and just retry on 429 / 503 and clean up only when 409 or 503 or 501 >> are the final status code ? rest all are unknown >> >> Best, >> Prashant Singh >> >> On Wed, Jun 18, 2025 at 12:22 PM Ryan Blue <rdb...@gmail.com> wrote: >> >>> Are we confident that the REST client fix is the right path? >>> >>> If I understand correctly, the issue is that there is a load balancer >>> that is converting a 500 response to a 503 that the client will >>> automatically retry, even though the 500 response is not retry-able. Then >>> the retry results in a 409 because the commit conflicts with itself. >>> >>> I'm fairly concerned about a fix like this that is operating on the >>> assumption that we cannot trust the 5xx error code that was sent to the >>> client. If we don't trust 5xx error codes, then wouldn't it make the most >>> sense to throw CommitStateUnknown for any 5xx response? >>> >>> On Tue, Jun 17, 2025 at 1:00 PM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> Sorry I didn't get to reply here, I think the fix Ajantha is >>>> contributing is extremely important but probably out of scope for a patch >>>> release. Because it will take a bit of manual intervention to fix after >>>> jumping to the next version I think we should save this for 1.10.0 which >>>> also should come out pretty soon. >>>> >>>> On Tue, Jun 17, 2025 at 1:16 PM Prashant Singh < >>>> prashant010...@gmail.com> wrote: >>>> >>>>> Thank you for confirming over slack Ajantha, I also double checked >>>>> with Russell offline, this seems to be a behaviour change which can't go >>>>> in >>>>> a patch release, maybe 1.10 then. So I think we should be good for now. >>>>> That being said, I will start working on getting a RC out with just >>>>> this commit >>>>> <https://github.com/singhpk234/iceberg/commit/14ba08e672df5061f8ac0978ea81f10535da2246> >>>>> on top of 1.9.1, and will start a thread for that accordingly ! >>>>> >>>>> Best, >>>>> Prashant Singh >>>>> >>>>> On Tue, Jun 17, 2025 at 8:44 AM Prashant Singh < >>>>> prashant010...@gmail.com> wrote: >>>>> >>>>>> Hey Ajantha, >>>>>> >>>>>> I was going to wait for 2-3 days before cutting an RC anyway :), to >>>>>> see if anyone has an objection or some more *critical* changes to >>>>>> get in. Thank you for the heads up ! >>>>>> >>>>>> Best, >>>>>> Prashant Singh >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jun 16, 2025 at 11:02 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I have a PR that needs to be included for 1.9.2 as well! >>>>>>> >>>>>>> I was about to start a release discussion for 1.9.2 tomorrow. >>>>>>> >>>>>>> It is from an oversight during partition stats refactoring (in >>>>>>> 1.9.0). We missed that field ids are tracked in spec during refactoring! >>>>>>> Because of it, schema field ids are not as per partition stats spec. >>>>>>> Solution is to implicitly recompute stats when old stats are detected >>>>>>> with >>>>>>> invalid field id. Including the fix in 1.9.2 can reduce the issue >>>>>>> impact as >>>>>>> new stats will have valid schema ids and also handle corrupted stats. >>>>>>> Feature is not widely integrated yet and engine integrations are in >>>>>>> progress for 1.10.0. But we still want to reduce the impact of the >>>>>>> issue. I >>>>>>> have chatted with Peter and Russell about this. >>>>>>> >>>>>>> PR will be shared tomorrow. >>>>>>> >>>>>>> Just giving heads up to wait a day or two for a release cut. >>>>>>> >>>>>>> - Ajantha >>>>>>> >>>>>>> On Mon, Jun 16, 2025 at 11:25 PM Steven Wu <stevenz...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> +1 for a 1.9.2 release >>>>>>>> >>>>>>>> On Mon, Jun 16, 2025 at 10:53 AM Prashant Singh < >>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hey Kevin, >>>>>>>>> This goes well before 1.8, if you will see the issue that my PR >>>>>>>>> refers to is reported from iceberg 1.7, It has been there since the >>>>>>>>> beginning of the IRC client. >>>>>>>>> We were having similar debates on if we should patch all the >>>>>>>>> releases, but I think this requires more wider discussions, but 1.9.2 >>>>>>>>> sounds like the immediate next steps ! >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Prashant Singh >>>>>>>>> >>>>>>>>> On Mon, Jun 16, 2025 at 10:42 AM Kevin Liu <kevinjq...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Prashant, >>>>>>>>>> >>>>>>>>>> This sounds like a good reason to do a patch release. I'm +1 >>>>>>>>>> >>>>>>>>>> Do you know if this is also affecting other minor versions? Do we >>>>>>>>>> also need to patch 1.8.x? >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Kevin Liu >>>>>>>>>> >>>>>>>>>> On Mon, Jun 16, 2025 at 10:30 AM Prashant Singh < >>>>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hey all, >>>>>>>>>>> >>>>>>>>>>> A couple of users have recently reported a *critical table >>>>>>>>>>> corruption issue* when using the Iceberg REST catalog client. >>>>>>>>>>> This issue has existed from the beginning and can result in >>>>>>>>>>> *committed >>>>>>>>>>> snapshot data being incorrectly deleted*, effectively rendering >>>>>>>>>>> the table unusable (i.e., corrupted). The root cause is >>>>>>>>>>> *self-conflict >>>>>>>>>>> during internal HTTP client retries*. >>>>>>>>>>> >>>>>>>>>>> More context can be found here: >>>>>>>>>>> [1] >>>>>>>>>>> https://github.com/apache/iceberg/pull/12818#issue-3000325526 >>>>>>>>>>> [2] >>>>>>>>>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219 >>>>>>>>>>> >>>>>>>>>>> We’ve seen this issue hit by users in the wider community and >>>>>>>>>>> also internally on our end. I’ve submitted a fix here (now merged) : >>>>>>>>>>> https://github.com/apache/iceberg/pull/12818 >>>>>>>>>>> >>>>>>>>>>> Given the *severity* and the *broad version impact*, I’d like >>>>>>>>>>> to propose cutting a *1.9.2 patch release* to include this fix. >>>>>>>>>>> >>>>>>>>>>> I'm also happy to *volunteer as the release manager* for this >>>>>>>>>>> release. >>>>>>>>>>> >>>>>>>>>>> Looking forward to hearing the community's thoughts. >>>>>>>>>>> >>>>>>>>>>> Best regards, >>>>>>>>>>> Prashant Singh >>>>>>>>>>> >>>>>>>>>>