Cool! Happy to collaborate on this! > keep only minimal snapshot references in table metadata and move the richer index definition and lifecycle into catalog‑managed index metadata exposed via the REST APIs.
In my second iteration, I moved the snapshot references into the index metadata [1]. This allows the query engine to fetch indexes in parallel with the table metadata using *catalog.listIndexes*, where each returned *BaseIndex* already includes the available table snapshots. With that information, the engine can immediately determine whether a given index is applicable for the query by checking the index type, index columns, and the associated table snapshots. If the engine decides to use a particular index, it can then retrieve the corresponding DetailedIndex, which contains all additional details required by the engine. For Bloom filter indexes specifically, the *IndexSnapshots* could store the correct Puffin file path for each table snapshot in their snapshot properties. [1] - Iceberg indexes / Index Metadata / Snapshot - https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.r3lv3a6k06hy huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 12., H, 2:27): > Hi Peter, > > > Thanks a lot for sharing the proposal in [1] and for the detailed design. > The catalog‑managed index framework there looks like a better long‑term > direction than keeping full index definitions in table metadata. > > > The current Bloom‑filter draft describes indexes in table metadata so > planners can discover them during planning and map table snapshots to > Puffin files with Bloom filters, but that wiring can be changed easily to > the catalog‑based model in [1]: keep only minimal snapshot references in > table metadata and move the richer index definition and lifecycle into > catalog‑managed index metadata exposed via the REST APIs. In that model, > the Bloom‑filter file‑skipping index would be one concrete `IndexType` > whose data lives in Puffin files, with engines discovering and loading it > through the catalog (`listIndexes`, `loadIndex`, etc.). > > > Agree that the Bloom‑filter index would be an excellent candidate and a > very good fit as the first index type to implement in this framework, and > the proposal will be updated to follow the catalog‑based approach. > > > Best, > > Huaxin > > > > > > On Fri, Jan 9, 2026 at 11:59 AM Péter Váry <[email protected]> > wrote: > >> Hi Huaxin, >> >> This is a very interesting topic. We’re also working on an index proposal >> [1] that aligns closely with yours in many areas. In an earlier iteration, >> I considered adding index metadata directly to the table metadata as well. >> After some back-and-forth, we ultimately moved to a different approach, >> where the catalog exposes an API to fetch the indexes for a given table. >> >> This has several advantages—for example, it avoids increasing the size of >> the table metadata and is more consistent with existing practices where >> UDFs, views, and materialized views each have their own specifications and >> metadata. >> >> After reading your proposal, I think the bloom filter index would be an >> excellent candidate and a very good fit as a first index type to implement, >> helping us evaluate the viability of the metadata approach. >> >> Please take a look and let me know what you think. >> Thanks, >> Peter >> >> [1] - >> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0 >> >> >> huaxin gao <[email protected]> ezt írta (időpont: 2026. jan. 8., >> Cs, 17:27): >> >>> Hi Iceberg community, >>> >>> I’d like to request feedback on a proposal >>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0> >>> to introduce secondary indexes to Apache Iceberg with a narrow, incremental >>> scope. >>> >>> Phase 1 adds file-skipping indexes based on per-column Bloom filters, >>> stored in Puffin and referenced from table metadata so query engines can >>> use them during planning to prune data files. Indexes are advisory-only and >>> snapshot-scoped. The proposal is fully backward compatible: engines that >>> don’t understand the new metadata fields ignore them. >>> >>> I’d appreciate any feedback, questions, or concerns on the overall >>> direction and design. >>> >>> Best, >>> >>> Huaxin >>> >>
