Are we confident that the REST client fix is the right path?

If I understand correctly, the issue is that there is a load balancer that
is converting a 500 response to a 503 that the client will automatically
retry, even though the 500 response is not retry-able. Then the retry
results in a 409 because the commit conflicts with itself.

I'm fairly concerned about a fix like this that is operating on the
assumption that we cannot trust the 5xx error code that was sent to the
client. If we don't trust 5xx error codes, then wouldn't it make the most
sense to throw CommitStateUnknown for any 5xx response?

On Tue, Jun 17, 2025 at 1:00 PM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> Sorry I didn't get to reply here, I think the fix Ajantha is contributing
> is extremely important but probably out of scope for a patch release.
> Because it will take a bit of manual intervention to fix after jumping to
> the next version I think we should save this for 1.10.0 which also should
> come out pretty soon.
>
> On Tue, Jun 17, 2025 at 1:16 PM Prashant Singh <prashant010...@gmail.com>
> wrote:
>
>> Thank you for confirming over slack Ajantha, I also double checked with
>> Russell offline, this seems to be a behaviour change which can't go in a
>> patch release, maybe 1.10 then. So I think we should be good for now.
>> That being said, I will start working on getting a RC out with just this
>> commit
>> <https://github.com/singhpk234/iceberg/commit/14ba08e672df5061f8ac0978ea81f10535da2246>
>> on top of 1.9.1, and will start a thread for that accordingly !
>>
>> Best,
>> Prashant Singh
>>
>> On Tue, Jun 17, 2025 at 8:44 AM Prashant Singh <prashant010...@gmail.com>
>> wrote:
>>
>>> Hey Ajantha,
>>>
>>> I was going to wait for 2-3 days before cutting an RC anyway :),  to see
>>> if anyone has an objection or some more *critical* changes to get in.
>>> Thank you for the heads up !
>>>
>>> Best,
>>> Prashant Singh
>>>
>>>
>>>
>>>
>>> On Mon, Jun 16, 2025 at 11:02 AM Ajantha Bhat <ajanthab...@gmail.com>
>>> wrote:
>>>
>>>> I have a PR that needs to be included for 1.9.2 as well!
>>>>
>>>> I was about to start a release discussion for 1.9.2 tomorrow.
>>>>
>>>> It is from an oversight during partition stats refactoring (in 1.9.0).
>>>> We missed that field ids are tracked in spec during refactoring! Because of
>>>> it, schema field ids are not as per partition stats spec. Solution is to
>>>> implicitly recompute stats when old stats are detected with invalid field
>>>> id. Including the fix in 1.9.2 can reduce the issue impact as new stats
>>>> will have valid schema ids and also handle corrupted stats. Feature is not
>>>> widely integrated yet and engine integrations are in progress for 1.10.0.
>>>> But we still want to reduce the impact of the issue. I have chatted with
>>>> Peter and Russell about this.
>>>>
>>>> PR will be shared tomorrow.
>>>>
>>>> Just giving heads up to wait a day or two for a release cut.
>>>>
>>>> - Ajantha
>>>>
>>>> On Mon, Jun 16, 2025 at 11:25 PM Steven Wu <stevenz...@gmail.com>
>>>> wrote:
>>>>
>>>>> +1 for a 1.9.2 release
>>>>>
>>>>> On Mon, Jun 16, 2025 at 10:53 AM Prashant Singh <
>>>>> prashant010...@gmail.com> wrote:
>>>>>
>>>>>> Hey Kevin,
>>>>>> This goes well before 1.8, if you will see the issue that my PR
>>>>>> refers to is reported from iceberg 1.7, It has been there since the
>>>>>> beginning of the IRC client.
>>>>>> We were having similar debates on if we should patch all the
>>>>>> releases, but I think this requires more wider discussions, but 1.9.2
>>>>>> sounds like the immediate next steps !
>>>>>>
>>>>>> Best,
>>>>>> Prashant Singh
>>>>>>
>>>>>> On Mon, Jun 16, 2025 at 10:42 AM Kevin Liu <kevinjq...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Prashant,
>>>>>>>
>>>>>>> This sounds like a good reason to do a patch release. I'm +1
>>>>>>>
>>>>>>> Do you know if this is also affecting other minor versions? Do we
>>>>>>> also need to patch 1.8.x?
>>>>>>>
>>>>>>> Best,
>>>>>>> Kevin Liu
>>>>>>>
>>>>>>> On Mon, Jun 16, 2025 at 10:30 AM Prashant Singh <
>>>>>>> prashant010...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hey all,
>>>>>>>>
>>>>>>>> A couple of users have recently reported a *critical table
>>>>>>>> corruption issue* when using the Iceberg REST catalog client. This
>>>>>>>> issue has existed from the beginning and can result in *committed
>>>>>>>> snapshot data being incorrectly deleted*, effectively rendering
>>>>>>>> the table unusable (i.e., corrupted). The root cause is *self-conflict
>>>>>>>> during internal HTTP client retries*.
>>>>>>>>
>>>>>>>> More context can be found here:
>>>>>>>> [1] https://github.com/apache/iceberg/pull/12818#issue-3000325526
>>>>>>>> [2]
>>>>>>>> https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1747992294134219
>>>>>>>>
>>>>>>>> We’ve seen this issue hit by users in the wider community and also
>>>>>>>> internally on our end. I’ve submitted a fix here (now merged) :
>>>>>>>> https://github.com/apache/iceberg/pull/12818
>>>>>>>>
>>>>>>>> Given the *severity* and the *broad version impact*, I’d like to
>>>>>>>> propose cutting a *1.9.2 patch release* to include this fix.
>>>>>>>>
>>>>>>>> I'm also happy to *volunteer as the release manager* for this
>>>>>>>> release.
>>>>>>>>
>>>>>>>> Looking forward to hearing the community's thoughts.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Prashant Singh
>>>>>>>>
>>>>>>>

Reply via email to