I'm in favor of marking 503 as non-retryable for updateTable to avoid possible data corruption. Even though 503 can sometimes mean throttling, I think treating it as CommitStateUnknown is the safer choice for commit operations. We can look into better handling for throttling cases, like using 429, in the future.
On Tue, Aug 5, 2025 at 11:00 AM Steve Loughran <ste...@cloudera.com.invalid> wrote: > 503 is throttling on many (but not all) of the AWS services (s3, lambda, > redshift..) - though not dynamodb or others that return 400 + some message. > > If aws were to deploy an endpoint, they'd have to decide what to do. or > even better, the spec could say "there is a specific error code for > throttling, "429: Too Many Requests" to be returned in such a situation. > > On Sat, 2 Aug 2025 at 03:32, Dennis Huo <huoi...@gmail.com> wrote: > >> PR LGTM, thanks for following up on this! >> >> Reiterating my comments in the PR, IMO especially now that Russell's >> https://github.com/apache/iceberg/pull/13449 is merged so that all the >> read-only methods can still perform low-level retries, this change makes >> sense as a safer default for commit behaviors. >> >> I could imagine a few potential follow-up features if people want to >> achieve better availability in situations where more out-of-band knowledge >> of specific server behaviors can be used on a case-by-case basis: >> >> 1. If anyone feels very strongly that in a given service environment some >> particular 5xx code really is safely retriable without reconciliation, we >> could consider extracting a runtime config that lists the non-retriable >> HTTP codes instead of only having the hard-coded switch statement coming at >> the cost of some code complexity >> 2. We could make it easier to inject a >> dynamically-classloaded HttpRequestRetryStrategy in the HttpClient ( >> https://github.com/apache/iceberg/blob/a8fdb23682a7c083ea6ff9873f1531dd9d465aa7/core/src/main/java/org/apache/iceberg/rest/HTTPClient.java#L127) >> - it seems there's already some precedent for injectable helpers here such >> as rest.client.tls.configurer-impl -- custom impls could either be as >> simple as using a different set of 5xx non-retriable codes, or as complex >> as cooperating with other customized client-side logic to use an end-to-end >> idempotency request-uuid to distinguish safe retries >> 3. We could consider letting other response headers such as "Retry-After" >> ( >> https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Retry-After) >> be a heuristic to interpret the *intent* of the server to communicate a >> "graceful unavailability" that is safe to retry, vs an unexpected failure >> where state is unknown >> >> In any case though, it does seem prudent to start with the safe default. >> >> On Wed, Jul 23, 2025 at 1:22 PM Prashant Singh <prashant010...@gmail.com> >> wrote: >> >>> Hey All, >>> Presently we treat 503 status code in spec here <http:///> as if >>> the request was not tried at all. While doing 1.9.2 it was brought up that >>> this exception might not be true and some of the very common services like >>> Istio and Envoy were brought up as examples. >>> >>> As a result I wanted to formally bring up disabling retries on 503 on >>> updateTable and treating that as CommitStateUnknown similar to what we do >>> for 500, 502, 504. The rationale for this being 503 doesn't necessarily >>> mean the request was not tried at all and systems like ENVOY can return 503 >>> in the middle of processing. >>> >>> Adding detailed analysis of my colleague Dennis here : >>> https://lists.apache.org/thread/h8xowcntomh7r5woz24cv27v3o8r9wgd >>> >>> It seemed like most people participating in the thread above agreed, >>> starting this thread to see if there are objections to it in a wider forum. >>> >>> Meanwhile, I have filed a PR for the same >>> [1] https://github.com/apache/iceberg/pull/13619 >>> >>> Please let us know your thoughts and provide your feedback ! >>> >>> Best, >>> Prashant Singh >>> >>> >>> >>>