Re: [DISCUSS] Secondary Indexes (Phase 1): Bloom filter skipping index (Puffin, snapshot-scoped)

Péter Váry Tue, 13 Jan 2026 05:35:15 -0800

Hi Vaibhav,

We currently have the generic Index proposal, which outlines how Iceberg
indexes can be stored and accessed by query engines. It defines the
structure and handling of index metadata. To properly validate the design,
we are proposing to implement a few concrete index types. This will help us
identify gaps and refine the overall approach.


In the documents, we outlined four initial index types:

   - Bloom filter index – covered in Huaxin’s document
   - B‑Tree index – backed by a materialized view
   - Full‑text index – backed by a materialized view
   - IVF index – backed by a materialized view

The advantage of these index types is that many of their underlying
components already exist, or are covered by other ongoing proposals. This
means we can implement them—and even use them—with relatively low effort.

Additional index types can be introduced later by the community. Once the
index metadata model is in place, adding new index implementations becomes
straightforward.

We don’t yet have exact timelines for the roadmap. Our first step is to
build community consensus around the proposal; implementation can begin
once we have alignment.

I hope this clarifies things. If you have any further questions, please let
me know.

Thanks,
Peter

Vaibhav Kumar <[email protected]> ezt írta (időpont: 2026. jan. 13.,
K, 12:23):

> Hi Peter/Huaxin,
>
> This is a very interesting topic—thank you for sharing all the
> documentation. I have a few questions I hope you can clarify:
>
> Does this mean that the three types of indexes—B-Tree, Full-Text, and
> IVF—can all be addressed through the use of materialized views? Or are
> there scenarios where dedicated index structures are still necessary? Doc
> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0>
> referred
>
> I’m also interested in the current roadmap for secondary indexes. Are
> there any concrete plans or timelines for their introduction in upcoming
> releases? Additionally, is there a draft or active pull request for this
> feature? I am happy to collaborate on this topic.
>
> Thank you in advance for your insights!
>
> Regards,
> Vaibhav
>
>
> On Tue, Jan 13, 2026 at 6:43 AM huaxin gao <[email protected]> wrote:
>
>> Hi Peter,
>>
>> Thanks for the clarification. I will align the secondary index proposal
>> accordingly.
>>
>> Looking forward to the collaboration!
>>
>> Best,
>> Huaxin
>>
>> On Mon, Jan 12, 2026 at 2:54 AM Péter Váry <[email protected]>
>> wrote:
>>
>>> Cool!
>>> Happy to collaborate on this!
>>>
>>> > keep only minimal snapshot references in table metadata and move the
>>> richer index definition and lifecycle into catalog‑managed index metadata
>>> exposed via the REST APIs.
>>>
>>> In my second iteration, I moved the snapshot references into the index
>>> metadata [1]. This allows the query engine to fetch indexes in parallel
>>> with the table metadata using *catalog.listIndexes*, where each
>>> returned *BaseIndex* already includes the available table snapshots.
>>> With that information, the engine can immediately determine whether a
>>> given index is applicable for the query by checking the index type, index
>>> columns, and the associated table snapshots.
>>> If the engine decides to use a particular index, it can then retrieve
>>> the corresponding DetailedIndex, which contains all additional details
>>> required by the engine.
>>> For Bloom filter indexes specifically, the *IndexSnapshots* could store
>>> the correct Puffin file path for each table snapshot in their snapshot
>>> properties.
>>>
>>> [1] - Iceberg indexes / Index Metadata / Snapshot -
>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.r3lv3a6k06hy
>>>
>>> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 12.,
>>> H, 2:27):
>>>
>>>> Hi Peter,
>>>>
>>>>
>>>> Thanks a lot for sharing the proposal in [1] and for the detailed
>>>> design. The catalog‑managed index framework there looks like a better
>>>> long‑term direction than keeping full index definitions in table metadata.
>>>>
>>>>
>>>> The current Bloom‑filter draft describes indexes in table metadata so
>>>> planners can discover them during planning and map table snapshots to
>>>> Puffin files with Bloom filters, but that wiring can be changed easily to
>>>> the catalog‑based model in [1]: keep only minimal snapshot references in
>>>> table metadata and move the richer index definition and lifecycle into
>>>> catalog‑managed index metadata exposed via the REST APIs. In that model,
>>>> the Bloom‑filter file‑skipping index would be one concrete `IndexType`
>>>> whose data lives in Puffin files, with engines discovering and loading it
>>>> through the catalog (`listIndexes`, `loadIndex`, etc.).
>>>>
>>>>
>>>> Agree that the Bloom‑filter index would be an excellent candidate and a
>>>> very good fit as the first index type to implement in this framework, and
>>>> the proposal will be updated to follow the catalog‑based approach.
>>>>
>>>>
>>>> Best,
>>>>
>>>> Huaxin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jan 9, 2026 at 11:59 AM Péter Váry <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Huaxin,
>>>>>
>>>>> This is a very interesting topic. We’re also working on an index
>>>>> proposal [1] that aligns closely with yours in many areas. In an earlier
>>>>> iteration, I considered adding index metadata directly to the table
>>>>> metadata as well. After some back-and-forth, we ultimately moved to a
>>>>> different approach, where the catalog exposes an API to fetch the indexes
>>>>> for a given table.
>>>>>
>>>>> This has several advantages—for example, it avoids increasing the size
>>>>> of the table metadata and is more consistent with existing practices where
>>>>> UDFs, views, and materialized views each have their own specifications and
>>>>> metadata.
>>>>>
>>>>> After reading your proposal, I think the bloom filter index would be
>>>>> an excellent candidate and a very good fit as a first index type to
>>>>> implement, helping us evaluate the viability of the metadata approach.
>>>>>
>>>>> Please take a look and let me know what you think.
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>> [1] -
>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0
>>>>>
>>>>>
>>>>> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 8.,
>>>>> Cs, 17:27):
>>>>>
>>>>>> Hi Iceberg community,
>>>>>>
>>>>>> I’d like to request feedback on a proposal
>>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0>
>>>>>> to introduce secondary indexes to Apache Iceberg with a narrow, 
>>>>>> incremental
>>>>>> scope.
>>>>>>
>>>>>> Phase 1 adds file-skipping indexes based on per-column Bloom filters,
>>>>>> stored in Puffin and referenced from table metadata so query engines can
>>>>>> use them during planning to prune data files. Indexes are advisory-only 
>>>>>> and
>>>>>> snapshot-scoped. The proposal is fully backward compatible: engines that
>>>>>> don’t understand the new metadata fields ignore them.
>>>>>>
>>>>>> I’d appreciate any feedback, questions, or concerns on the overall
>>>>>> direction and design.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Huaxin
>>>>>>
>>>>>

Re: [DISCUSS] Secondary Indexes (Phase 1): Bloom filter skipping index (Puffin, snapshot-scoped)

Reply via email to