Re: [DISCUSS] Refreshing storage credentials for staged table creation

Maninder Parmar Thu, 26 Mar 2026 15:19:28 -0700

Hello everyone!

I have completed the POC to support storage credential refresh as discussed
in the proposal
<https://docs.google.com/document/d/1R1K6X7qYqvIFkPG3m1neV5Mvy8rwWJvhSFr8DgJgQ-E/edit?tab=t.0>
during
the community syncs. Please review and provide feedback on the PR#15280
<https://github.com/apache/iceberg/pull/15280/changes>. The POC is scoped
only to S3 storage for now but covers all the critical aspects for
credential refresh, these include:
1. Interface/Class changes required to accomodate the storage refresh token
changes and validate key flows (credential refresh, loadTable and
serializations across different boundaries).
2. Test primitives like InMemoryRestCatalog that could support staged table
creation with credential refresh and E2E validation with spark.


Looking forward to feedback!

Thanks,
Maninder



On Wed, Mar 18, 2026 at 11:35 AM Maninder Parmar <
[email protected]> wrote:

> Thanks for the feedback during the community sync!
>
> I would summarize the key discussions and decisions:
>
>    - Omit the requirement to support credential refresh on the loadTable
>    API. The storage credentials should be refreshable only on the
>    loadCredentials endpoint.
>    - The spec will now move towards the prototype phase to ensure
>    downstream implementation risks are minimized. In particular:
>       - Understanding the implications of using config map within
>       StorageCredentials or creating a separate typed field
>       - Prototyping would be limited to S3 which is the most complex
>       storage provider implementation and have the required
>       VendedCredentialProvider implementation that should be extended
>    - There was also a discussion to support sending fresh storage
>    credentials as part of the commit table API. It is out of scope for this
>    effort and Daniel Weeks will send a PR for it.
>
>
> On Tue, Mar 17, 2026 at 5:41 PM Maninder Parmar <
> [email protected]> wrote:
>
>> Hello community!
>>
>> I have updated the proposal
>> <https://docs.google.com/document/d/1R1K6X7qYqvIFkPG3m1neV5Mvy8rwWJvhSFr8DgJgQ-E/edit?tab=t.0>
>>  and
>> the PR #15280 <https://github.com/apache/iceberg/pull/15280> based on
>> the feedback during the last catalog community sync. This ensures that all
>> the requirements surfaced so far are being handled in the new proposal.
>> Please take some time to provide feedback.
>>
>> To summarize in the thread, the key requirements for the proposal to
>> satisfy are the following:
>>
>>    - *Generalizability : *The storage refresh mechanism should NOT be a
>>    staging only concept but instead should integrate with existing
>>    StorageCredential mechanism and be reusable across any credential vending
>>    scenarios (staged tables, committed tables, scan planning etc.).
>>    - *Works without loadCredential API support*: Not all catalog
>>    implementations support the loadCredentials endpoint. The storage refresh
>>    mechanism must work with the existing loadTable endpoint.
>>    - *No server side state requirement*: The spec should not mandate
>>    maintaining server side state.
>>    - *Per credential refresh granularity*: Each storage credential
>>    should be independently refreshable. StorageCredentials allows specifying 
>> a
>>    set of locations, each of them should be refreshable independently.
>>
>>
>> Thanks,
>> Maninder
>>
>> On Wed, Feb 25, 2026 at 5:33 PM Maninder Parmar <
>> [email protected]> wrote:
>>
>>> Hi community,
>>>
>>> Thanks for the inputs during the catalog sync! I want to summarize the
>>> decisions and direction that was agreed on during the sync.
>>>
>>> *Direction*
>>> - We'll introduce a storage-refresh-token concept that integrates with
>>> the existing StorageCredential mechanism rather than being a
>>> staging-specific construct. This keeps the design reusable across different
>>> APIs going forward.
>>> - We agreed not to model this after the planId-based credential vending
>>> used in scan planning. The community is open to refactoring planId
>>> credential refresh to use the storage credential refresh token pattern in
>>> the future.
>>>
>>> *Discarded approaches*
>>> 1. table-uuid as the identifier - overloads a spec-level identifier for
>>> a purpose it wasn't designed for
>>> 2. Server-side state / sessions - adds operational complexity and some
>>> existing catalog implementations assume stateless staged table creation
>>> 3. Overloading OAuth scopes - conflates storage credential refresh with
>>> the OAuth layer
>>>
>>> I will share an updated design doc and spec PR reflecting this direction.
>>>
>>> On Tue, Feb 10, 2026 at 11:14 AM Maninder Parmar <
>>> [email protected]> wrote:
>>>
>>>> Thanks for reviewing the proposal Huaxin!
>>>>
>>>> *"Since stagingSession is in the URL and may show up in logs, should it
>>>> be treated as a secret token (hard to guess, short expiry)?"*
>>>> No, stagingSession is not a secret it is just an identifier for the
>>>> session. It is up to the catalog server implementation if it wants to
>>>> enforce if only the user who was issued the stagingSession or any user
>>>> with staginSession should call commit on the table. It can use existing
>>>> authentication mechanisms to enforce those constraints.
>>>>
>>>> *"If it leaks, can someone else use it, or is it restricted to the same
>>>> user/job that created the staged table?"*
>>>> Since it's not a secret but merely an identifier (just like planId)
>>>> there should not be a risk of leak. It's up to catalog server
>>>> implementation to restrict same user/job or not.
>>>>
>>>>
>>>> *"What happens if a CTAS job crashes or is cancelled after staging?
>>>> Does the stagingSession expire automatically, and is there a way to clean
>>>> up/abort the staged create?"*The lifecycle implementation of
>>>> stagingSession is up to the catalog servers. There are multiple strategies
>>>> that could be used here like automatically expiring the session after a few
>>>> hours if no updateTable call was made for that session or expiring active
>>>> sessions when one of them is committed etc.
>>>> There would not be any additional API surface area exposed to clients
>>>> to manage the session lifecycle, it is the responsibility of the catalog
>>>> server.
>>>>
>>>> Let me know if you have follow up questions.
>>>>
>>>>
>>>> On Mon, Feb 9, 2026 at 7:07 PM huaxin gao <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Maninder,
>>>>>
>>>>> Thanks for the proposal! It sounds like a good direction to me.
>>>>> Returning a stagingSession from stage-create and then reusing it for
>>>>> loadCredentials/loadTable feels consistent with the existing planId
>>>>> pattern, and it fixes a real CTAS problem.
>>>>>
>>>>> A few questions:
>>>>>
>>>>> Since stagingSession is in the URL and may show up in logs, should it
>>>>> be treated as a secret token (hard to guess, short expiry)?
>>>>>
>>>>> If it leaks, can someone else use it, or is it restricted to the same
>>>>> user/job that created the staged table?
>>>>>
>>>>> What happens if a CTAS job crashes or is cancelled after staging? Does
>>>>> the stagingSession expire automatically, and is there a way to clean
>>>>> up/abort the staged create?
>>>>>
>>>>> Would love to hear your thoughts on these.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Huaxin
>>>>>
>>>>> On Mon, Feb 9, 2026 at 4:30 PM Maninder Parmar <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hello iceberg community!
>>>>>>
>>>>>> I wanted to discuss the proposal for refreshing storage credentials
>>>>>> for staged table creation. The iceberg tables could be created either via
>>>>>> single step creation flow or a two step staged creation flow which is 
>>>>>> used
>>>>>> for implementing CTAS (Create table as select) statements. Currently, 
>>>>>> it's
>>>>>> not possible to refresh the credentials for staged tables since they are
>>>>>> not committed on the catalog and hence not visible to loadTable or
>>>>>> credential endpoint.
>>>>>> There has been prior discussion
>>>>>> <https://lists.apache.org/thread/q5n355d89nxbhywtlv3qhq7dchbyb67d> where
>>>>>> the community members have expressed the need for supporting this 
>>>>>> scenario.
>>>>>>
>>>>>> I have started a proposal
>>>>>> <https://docs.google.com/document/d/1R1K6X7qYqvIFkPG3m1neV5Mvy8rwWJvhSFr8DgJgQ-E/edit?tab=t.0>
>>>>>>  to
>>>>>> flush out the details to support this scenario building on the
>>>>>> precedence of credential vending support for scan planning.
>>>>>> The OpenAPI changes can be seen in PR #15280
>>>>>> <https://github.com/apache/iceberg/pull/15280>
>>>>>>
>>>>>> Looking forward to your feedback.
>>>>>>
>>>>>> Thanks,
>>>>>> Maninder
>>>>>>
>>>>>>  Proposal: Credential Refresh for Staged Table Creation
>>>>>> <https://drive.google.com/open?id=1R1K6X7qYqvIFkPG3m1neV5Mvy8rwWJvhSFr8DgJgQ-E>
>>>>>>
>>>>>

Re: [DISCUSS] Refreshing storage credentials for staged table creation

Reply via email to