Re: Dedicated sync for Iceberg Index Support

Péter Váry Wed, 25 Feb 2026 01:05:25 -0800

Dan kindly set up a dedicated public Slack channel (*#indexes)* for the
Secondary Index discussion.
You can find it here: https://apache-iceberg.slack.com/archives/C0AFDSU3EUU
Feel free to join if you’d like to participate in the discussion or simply
follow along.


Thanks,
Peter

Péter Váry <[email protected]> ezt írta (időpont: 2026. febr.
24., K, 12:52):

> We had an extended discussion on Slack with Dan, Steven, and Yufei about
> where index metadata should live. In particular, whether it should be
> stored directly in the table metadata or maintained in a dedicated index
> catalog. I tried to capture this discussion in the Layout
> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.4oz3yd6ngr3>
>  section
> of the document.
>
> Once the decision is made, this section can be shortened, but for now it
> is intentionally more detailed so that everyone can see the arguments that
> were discussed and so that those who could not participate synchronously
> can still follow and provide feedback offline.
>
> In short, we are currently *leaning toward storing index metadata in its
> own catalog*, while allowing REST catalogs to expose a composite endpoint
> that returns both table and index metadata in a single round trip. This is
> similar in spirit to the universal load endpoint discussed in the context
> of materialized view loading.
>
> Thanks,
> Peter
>
> Péter Váry <[email protected]> ezt írta (időpont: 2026. febr.
> 19., Cs, 14:06):
>
>> Thanks Huaxin for posting the recording and the meeting notes.
>>
>> I used this time to also address the questions collected during the sync:
>>
>>    - Collected some representative use cases. See the example use-cases
>>    
>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.i4gt8za99j9d>
>>  paragraph.
>>    Anyone should feel free to suggest their own.
>>    - Collected my thoughts about the writer requirements. See the writer
>>    requirements
>>    
>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.4b1p8r8nmfg1>
>>    paragraph.
>>    - Centralized the index maintenance related parts. See the index
>>    maintenance
>>    
>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.hw2nt44i0k8q>
>>    paragraph.
>>
>> Might be a bit premature but created a PR
>> <https://github.com/apache/iceberg/pull/15101> with the proposed index
>> catalog related changes, so the ones who are more code oriented could take
>> a look at it too.
>>
>> huaxin gao <[email protected]> ezt írta (időpont: 2026. febr. 19.,
>> Cs, 5:34):
>>
>>> Hi Everyone,
>>>
>>> Here are the recording and notes from the Iceberg Index Support Sync on
>>> 2/11.
>>>
>>> Recording: https://www.youtube.com/watch?v=3sFfQ0A50yk
>>>
>>> Notes:
>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.8041k7j2n7y3
>>>
>>> The meeting will move to biweekly, Mondays 9–10am PST, starting March 2.
>>>
>>> Since the sync, I updated the Bloom skipping index proposal
>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.5r5kl6k3fqwu>
>>> to address the discussion questions, specifically:
>>>
>>>
>>>    - Performance justification: when this helps (high-cardinality = /
>>>    IN, many data files, high object-store latency) and how it differs from
>>>    Parquet row-group Bloom filters (which still require opening the data 
>>> file).
>>>    - Cost / scalability: rough sizing (Bloom blob size per file, Puffin
>>>    file size), the planning cost trade-off (driver index reads vs executor
>>>    file opens), and mitigations via caching.
>>>    - Lifecycle / maintenance: incremental production as new data files
>>>    arrive, behavior when the index is missing/behind, and 
>>> sharding/compaction
>>>    plus cleanup to avoid accumulating too many small Puffin files over time.
>>>    - Writer expectations: inline (optional) vs asynchronous (primary)
>>>    index creation.
>>>
>>> I also implemented a Spark 4.1 POC
>>> <https://github.com/apache/iceberg/pull/15311> and a local benchmark to
>>> quantify both the pruning impact (plannedFiles → afterBloom) and the index
>>> read overhead (statsFiles, statsBytes, bloomPayloadBytes) for point
>>> predicates on high-cardinality columns. Please take a look and let me know
>>> if you have any questions or feedback.
>>>
>>> Thanks,
>>>
>>> Huaxin
>>>
>>> On Tue, Feb 10, 2026 at 1:43 PM huaxin gao <[email protected]>
>>> wrote:
>>>
>>>> Reminder for tomorrow's sync on Iceberg Index Support.
>>>>
>>>> Wednesday: Feb. 11 9:00 – 10:00am
>>>> Time zone: America/Los_Angeles
>>>> Google Meet joining info
>>>> Video call link: meet.google.com/nsp-ctyr-khk
>>>> Design doc:
>>>>
>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2
>>>>
>>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7
>>>>
>>>> Thanks,
>>>> Huaxin
>>>>
>>>>
>>>> On Tue, Feb 3, 2026 at 10:52 PM Péter Váry <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks Huaxin and Steven for organizing this. Looking forward to meet
>>>>> you all next week!
>>>>>
>>>>> On Wed, Feb 4, 2026, 02:48 Steven Wu <[email protected]> wrote:
>>>>>
>>>>>> We set up the dev calendar event with a new google meet link. Please
>>>>>> ignore the link from Huaxin's original email.
>>>>>>
>>>>>> The dev calendar has the correct info (including the new meeting
>>>>>> link)
>>>>>>
>>>>>> Iceberg Index Support Sync
>>>>>> Wednesday, February 11 · 9:00 – 10:00am
>>>>>> Time zone: America/Los_Angeles
>>>>>> Google Meet joining info
>>>>>> Video call link: https://meet.google.com/nsp-ctyr-khk
>>>>>>
>>>>>> On Tue, Feb 3, 2026 at 5:08 PM huaxin gao <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Sorry, I meant PST (not EST) :)
>>>>>>> Looking forward to the discussion!
>>>>>>>
>>>>>>> On Tue, Feb 3, 2026 at 4:58 PM Shawn Chang <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Huaxin,
>>>>>>>>
>>>>>>>> Thanks for starting the sync!
>>>>>>>>
>>>>>>>> The meeting seems to be 9-10AM PST on the dev events calendar
>>>>>>>> <https://calendar.google.com/calendar/u/0?cid=MzkwNWQ0OTJmMWI0NTBiYTA3MTJmMmFlNmFmYTc2ZWI3NTdmMTNkODUyMjBjYzAzYWE0NTI3ODg1YWRjNTYyOUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t>,
>>>>>>>> not EST. Maybe it's a typo?
>>>>>>>> Otherwise, looking forward to the discussion!
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Shawn
>>>>>>>>
>>>>>>>> On Tue, Feb 3, 2026 at 9:18 AM huaxin gao <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>> I'd like to start a dedicated sync to discuss Iceberg Index
>>>>>>>>> support. Here is the existing discussion thread:
>>>>>>>>> https://lists.apache.org/thread/fzqk3jjf0xpj5m4cfqb3v4c65p0t04ty.
>>>>>>>>>
>>>>>>>>> To ground the discussion, here are the two proposals:
>>>>>>>>>
>>>>>>>>>    - Peter's proposal
>>>>>>>>>    
>>>>>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2>
>>>>>>>>>  (overall
>>>>>>>>>    index support)
>>>>>>>>>    - My proposal
>>>>>>>>>    
>>>>>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7>
>>>>>>>>>    (bloom filter skipping index)
>>>>>>>>>
>>>>>>>>> Time slot: Every 3 weeks, Wednesdays at 9 AM to 10 AM EST,
>>>>>>>>> starting next Wednesday (2/11). After FileFormat sync finishes, we 
>>>>>>>>> plan to
>>>>>>>>> use that slot and switch to every other Monday, 9 AM to 10 AM EST.
>>>>>>>>>
>>>>>>>>> Meet link: https://meet.google.com/fjn-tyze-mko
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Huaxin
>>>>>>>>>
>>>>>>>>

Re: Dedicated sync for Iceberg Index Support

Reply via email to