Re: [DISCUSS] Secondary Indexes (Phase 1): Bloom filter skipping index (Puffin, snapshot-scoped)

huaxin gao Mon, 12 Jan 2026 17:13:47 -0800

Hi Peter,

Thanks for the clarification. I will align the secondary index proposal
accordingly.


Looking forward to the collaboration!

Best,
Huaxin

On Mon, Jan 12, 2026 at 2:54 AM Péter Váry <[email protected]>
wrote:

> Cool!
> Happy to collaborate on this!
>
> > keep only minimal snapshot references in table metadata and move the
> richer index definition and lifecycle into catalog‑managed index metadata
> exposed via the REST APIs.
>
> In my second iteration, I moved the snapshot references into the index
> metadata [1]. This allows the query engine to fetch indexes in parallel
> with the table metadata using *catalog.listIndexes*, where each returned
> *BaseIndex* already includes the available table snapshots.
> With that information, the engine can immediately determine whether a
> given index is applicable for the query by checking the index type, index
> columns, and the associated table snapshots.
> If the engine decides to use a particular index, it can then retrieve the
> corresponding DetailedIndex, which contains all additional details required
> by the engine.
> For Bloom filter indexes specifically, the *IndexSnapshots* could store
> the correct Puffin file path for each table snapshot in their snapshot
> properties.
>
> [1] - Iceberg indexes / Index Metadata / Snapshot -
> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.r3lv3a6k06hy
>
> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 12., H,
> 2:27):
>
>> Hi Peter,
>>
>>
>> Thanks a lot for sharing the proposal in [1] and for the detailed design.
>> The catalog‑managed index framework there looks like a better long‑term
>> direction than keeping full index definitions in table metadata.
>>
>>
>> The current Bloom‑filter draft describes indexes in table metadata so
>> planners can discover them during planning and map table snapshots to
>> Puffin files with Bloom filters, but that wiring can be changed easily to
>> the catalog‑based model in [1]: keep only minimal snapshot references in
>> table metadata and move the richer index definition and lifecycle into
>> catalog‑managed index metadata exposed via the REST APIs. In that model,
>> the Bloom‑filter file‑skipping index would be one concrete `IndexType`
>> whose data lives in Puffin files, with engines discovering and loading it
>> through the catalog (`listIndexes`, `loadIndex`, etc.).
>>
>>
>> Agree that the Bloom‑filter index would be an excellent candidate and a
>> very good fit as the first index type to implement in this framework, and
>> the proposal will be updated to follow the catalog‑based approach.
>>
>>
>> Best,
>>
>> Huaxin
>>
>>
>>
>>
>>
>> On Fri, Jan 9, 2026 at 11:59 AM Péter Váry <[email protected]>
>> wrote:
>>
>>> Hi Huaxin,
>>>
>>> This is a very interesting topic. We’re also working on an index
>>> proposal [1] that aligns closely with yours in many areas. In an earlier
>>> iteration, I considered adding index metadata directly to the table
>>> metadata as well. After some back-and-forth, we ultimately moved to a
>>> different approach, where the catalog exposes an API to fetch the indexes
>>> for a given table.
>>>
>>> This has several advantages—for example, it avoids increasing the size
>>> of the table metadata and is more consistent with existing practices where
>>> UDFs, views, and materialized views each have their own specifications and
>>> metadata.
>>>
>>> After reading your proposal, I think the bloom filter index would be an
>>> excellent candidate and a very good fit as a first index type to implement,
>>> helping us evaluate the viability of the metadata approach.
>>>
>>> Please take a look and let me know what you think.
>>> Thanks,
>>> Peter
>>>
>>> [1] -
>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0
>>>
>>>
>>> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 8.,
>>> Cs, 17:27):
>>>
>>>> Hi Iceberg community,
>>>>
>>>> I’d like to request feedback on a proposal
>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0>
>>>> to introduce secondary indexes to Apache Iceberg with a narrow, incremental
>>>> scope.
>>>>
>>>> Phase 1 adds file-skipping indexes based on per-column Bloom filters,
>>>> stored in Puffin and referenced from table metadata so query engines can
>>>> use them during planning to prune data files. Indexes are advisory-only and
>>>> snapshot-scoped. The proposal is fully backward compatible: engines that
>>>> don’t understand the new metadata fields ignore them.
>>>>
>>>> I’d appreciate any feedback, questions, or concerns on the overall
>>>> direction and design.
>>>>
>>>> Best,
>>>>
>>>> Huaxin
>>>>
>>>

Re: [DISCUSS] Secondary Indexes (Phase 1): Bloom filter skipping index (Puffin, snapshot-scoped)

Reply via email to