Re: [DISCUSS] Mark 503 error code as non-retryable for updateTable

Daniel Weeks Thu, 14 Aug 2025 10:00:50 -0700

I would agree with making 503 non-retryable by default.  Though I don't
necessarily agree with the spec change in the PR.  I feel like the only
thing that needs to be updated is the component definition in the spec to
say that clients may only retry if a `Retry-After` header is present.
Otherwise, they should treat it as non-retryable.


I believe we only included that language due to services like S3 using "503
Service Unavailable" as opposed to the more appropriate "429 Too Many
Requests" for throttling, but I don't think 503 is a good practice to
propagate for this scenario.  We don't call out 429 in the spec, but most
clients support this by default and I believe the reference implementation
does as well.

-Dan




On Mon, Aug 11, 2025 at 5:26 PM huaxin gao <[email protected]> wrote:

> I'm in favor of marking 503 as non-retryable for updateTable to avoid
> possible data corruption. Even though 503 can sometimes mean throttling, I
> think treating it as CommitStateUnknown is the safer choice for commit
> operations. We can look into better handling for throttling cases, like
> using 429, in the future.
>
> On Tue, Aug 5, 2025 at 11:00 AM Steve Loughran <[email protected]>
> wrote:
>
>>  503 is  throttling on many (but not all) of the AWS services (s3,
>> lambda, redshift..) - though not dynamodb or others that return 400 + some
>> message.
>>
>> If aws were to deploy an endpoint, they'd have to decide what to do. or
>> even better, the spec could say "there is a specific error code for
>> throttling, "429: Too Many Requests" to be returned in such a situation.
>>
>> On Sat, 2 Aug 2025 at 03:32, Dennis Huo <[email protected]> wrote:
>>
>>> PR LGTM, thanks for following up on this!
>>>
>>> Reiterating my comments in the PR, IMO especially now that Russell's
>>> https://github.com/apache/iceberg/pull/13449 is merged so that all the
>>> read-only methods can still perform low-level retries, this change makes
>>> sense as a safer default for commit behaviors.
>>>
>>> I could imagine a few potential follow-up features if people want to
>>> achieve better availability in situations where more out-of-band knowledge
>>> of specific server behaviors can be used on a case-by-case basis:
>>>
>>> 1. If anyone feels very strongly that in a given service environment
>>> some particular 5xx code really is safely retriable without reconciliation,
>>> we could consider extracting a runtime config that lists the non-retriable
>>> HTTP codes instead of only having the hard-coded switch statement coming at
>>> the cost of some code complexity
>>> 2. We could make it easier to inject a
>>> dynamically-classloaded HttpRequestRetryStrategy in the HttpClient (
>>> https://github.com/apache/iceberg/blob/a8fdb23682a7c083ea6ff9873f1531dd9d465aa7/core/src/main/java/org/apache/iceberg/rest/HTTPClient.java#L127)
>>> - it seems there's already some precedent for injectable helpers here such
>>> as rest.client.tls.configurer-impl -- custom impls could either be as
>>> simple as using a different set of 5xx non-retriable codes, or as complex
>>> as cooperating with other customized client-side logic to use an end-to-end
>>> idempotency request-uuid to distinguish safe retries
>>> 3. We could consider letting other response headers such as
>>> "Retry-After" (
>>> https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Retry-After)
>>> be a heuristic to interpret the *intent* of the server to communicate a
>>> "graceful unavailability" that is safe to retry, vs an unexpected failure
>>> where state is unknown
>>>
>>> In any case though, it does seem prudent to start with the safe default.
>>>
>>> On Wed, Jul 23, 2025 at 1:22 PM Prashant Singh <[email protected]>
>>> wrote:
>>>
>>>> Hey All,
>>>> Presently we treat 503 status code in spec here <http:///> as if
>>>> the request was not tried at all. While doing 1.9.2 it was brought up that
>>>> this exception might not be true and some of the very common services like
>>>> Istio and Envoy were brought up as examples.
>>>>
>>>> As a result I wanted to formally bring up disabling retries on 503 on
>>>> updateTable and treating that as CommitStateUnknown similar to what we do
>>>> for 500, 502, 504. The rationale for this being 503 doesn't necessarily
>>>> mean the request was not tried at all and systems like ENVOY can return 503
>>>> in the middle of processing.
>>>>
>>>> Adding detailed analysis of my colleague Dennis here :
>>>> https://lists.apache.org/thread/h8xowcntomh7r5woz24cv27v3o8r9wgd
>>>>
>>>> It seemed like most people participating in the thread above agreed,
>>>> starting this thread to see if there are objections to it in a wider forum.
>>>>
>>>> Meanwhile, I have filed a PR for the same
>>>> [1] https://github.com/apache/iceberg/pull/13619
>>>>
>>>> Please let us know your thoughts and provide your feedback !
>>>>
>>>> Best,
>>>> Prashant Singh
>>>>
>>>>
>>>>
>>>>

Re: [DISCUSS] Mark 503 error code as non-retryable for updateTable

Reply via email to