Re: Re: Dedicated sync for Iceberg Index Support

Péter Váry Wed, 18 Mar 2026 23:42:46 -0700

Key takeaways from the general index discussion at the May 16 meeting.
Thanks to everyone who participated! The recording is available here:
https://www.youtube.com/watch?v=btmjhtRWUCE


   - Q: Do we need to tie index types to the algorithms used to access them?
   - A: From a specification perspective, the goal is to define the
   storage-level data layout so it can be shared across engines. Engines are
   free to interpret and use the data as they see fit, but the on-disk data
   layout itself must be strictly defined and interoperable.

   - Q: Should we introduce an additional abstraction layer (e.g., Vector
   Index) with sub-types such as IVF and DiskANN?
   - A: This is possible if we decide it is beneficial. I explored
   potential naming, but it is not yet clear how such a layer would be used in
   practice.
   *Question to Yingyi Bu*: could you provide examples where this
   additional layer would be useful? Should this abstraction be defined at the
   spec level, or is it better handled at the engine level?
   My initial idea was that users would create a generic Vector Index and
   let the engine choose the concrete implementation. However, this would
   limit user control and users likely need to specify the exact index
   representation, which implies they must be aware of the available
   representations.



   - Q: Do we want to allow extensibility for index types?
   - A: Yes. The intent is to support a small set of well-defined index
   types while allowing experimentation with new ones. If a new index type
   proves broadly useful, a follow-up proposal can standardize it and
   incorporate it into the spec.



   - Q: Do we allow multiple versions of an index for the same table
   snapshot?
   - A: Yes. Older index versions must be retained for readers that have
   already started using them, while new readers should automatically use the
   latest available version



   - Q: Do we need to use materialized views for these indexes?
   - A: No. These indexes are primarily examples, and different types may
   require different storage methods. However, the Primary Key, Containing,
   and parts of the IVF indexes can be structured as Iceberg tables. This
   allows engines to read them natively; in some cases, Iceberg planners can
   automatically redirect queries to the index table without engine
   modifications. Furthermore, index maintenance for these tables can leverage
   existing materialized view maintenance workflows. Other index types may
   instead rely on Puffin files or alternative storage approaches.



   - Q: How should index metadata be accessed? Should we add explicit
   pointers for the indexes in the table metadata?
   - A: We did not have sufficient time to fully explore and conclude this
   topic.
   *Question for Yufei Gu*: Did I understand correctly that your main
   concern stems from endpoint resolution from a REST Catalog perspective?
   Specifically, if indexes are exposed under a URI such as
   v1/{prefix}/namespaces/{namespace}/tables/{table}/indexes/{index}, would
   this make it more difficult for the REST Catalog to resolve and route
   requests to the appropriate endpoint?


Suhas Jayaram Subramanya via dev <[email protected]> ezt írta
(időpont: 2026. márc. 13., P, 23:32):

> Hi everyone,
>
> Here's a proposal for native Vector Index support in Iceberg tables --
> https://docs.google.com/document/d/1KL4qLOwdqnhOcqTc0EjO1O16NV3M3c-gZCEINDWw4lA/edit?usp=sharing
>
> We've been working on this proposal with Peter internally at Microsoft and
> he suggested we post it here to bring this to the community's attention,
> ahead of the next Secondary Index Sync.
>
>
>
>
>
> Thanks,
>
> Suhas
>
> On 2026/02/19 04:34:34 huaxin gao wrote:
> > Hi Everyone,
> >
> > Here are the recording and notes from the Iceberg Index Support Sync on
> > 2/11.
> >
> > Recording: https://www.youtube.com/watch?v=3sFfQ0A50yk
> >
> > Notes:
> >
> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.8041k7j2n7y3
> >
> > The meeting will move to biweekly, Mondays 9–10am PST, starting March 2.
> >
> > Since the sync, I updated the Bloom skipping index proposal
> > <
> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.5r5kl6k3fqwu
> >
> > to address the discussion questions, specifically:
> >
> >
> > - Performance justification: when this helps (high-cardinality = / IN,
> > many data files, high object-store latency) and how it differs from
> Parquet
> > row-group Bloom filters (which still require opening the data file).
> > - Cost / scalability: rough sizing (Bloom blob size per file, Puffin
> > file size), the planning cost trade-off (driver index reads vs executor
> > file opens), and mitigations via caching.
> > - Lifecycle / maintenance: incremental production as new data files
> > arrive, behavior when the index is missing/behind, and
> sharding/compaction
> > plus cleanup to avoid accumulating too many small Puffin files over time.
> > - Writer expectations: inline (optional) vs asynchronous (primary) index
> > creation.
> >
> > I also implemented a Spark 4.1 POC
> > <https://github.com/apache/iceberg/pull/15311> and a local benchmark to
> > quantify both the pruning impact (plannedFiles → afterBloom) and the
> index
> > read overhead (statsFiles, statsBytes, bloomPayloadBytes) for point
> > predicates on high-cardinality columns. Please take a look and let me
> know
> > if you have any questions or feedback.
> >
> > Thanks,
> >
> > Huaxin
> >
> > On Tue, Feb 10, 2026 at 1:43 PM huaxin gao <[email protected]> wrote:
> >
> > > Reminder for tomorrow's sync on Iceberg Index Support.
> > >
> > > Wednesday: Feb. 11 9:00 – 10:00am
> > > Time zone: America/Los_Angeles
> > > Google Meet joining info
> > > Video call link: meet.google.com/nsp-ctyr-khk
> > > Design doc:
> > >
> > >
> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2
> > >
> > >
> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7
> > >
> > > Thanks,
> > > Huaxin
> > >
> > >
> > > On Tue, Feb 3, 2026 at 10:52 PM Péter Váry <[email protected]>
> > > wrote:
> > >
> > >> Thanks Huaxin and Steven for organizing this. Looking forward to meet
> you
> > >> all next week!
> > >>
> > >> On Wed, Feb 4, 2026, 02:48 Steven Wu <[email protected]> wrote:
> > >>
> > >>> We set up the dev calendar event with a new google meet link. Please
> > >>> ignore the link from Huaxin's original email.
> > >>>
> > >>> The dev calendar has the correct info (including the new meeting
> link)
> > >>>
> > >>> Iceberg Index Support Sync
> > >>> Wednesday, February 11 · 9:00 – 10:00am
> > >>> Time zone: America/Los_Angeles
> > >>> Google Meet joining info
> > >>> Video call link: https://meet.google.com/nsp-ctyr-khk
> > >>>
> > >>> On Tue, Feb 3, 2026 at 5:08 PM huaxin gao <[email protected]>
> > >>> wrote:
> > >>>
> > >>>> Sorry, I meant PST (not EST) :)
> > >>>> Looking forward to the discussion!
> > >>>>
> > >>>> On Tue, Feb 3, 2026 at 4:58 PM Shawn Chang <[email protected]>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Huaxin,
> > >>>>>
> > >>>>> Thanks for starting the sync!
> > >>>>>
> > >>>>> The meeting seems to be 9-10AM PST on the dev events calendar
> > >>>>> <
> https://calendar.google.com/calendar/u/0?cid=MzkwNWQ0OTJmMWI0NTBiYTA3MTJmMmFlNmFmYTc2ZWI3NTdmMTNkODUyMjBjYzAzYWE0NTI3ODg1YWRjNTYyOUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t
> >,
> > >>>>> not EST. Maybe it's a typo?
> > >>>>> Otherwise, looking forward to the discussion!
> > >>>>>
> > >>>>> Best,
> > >>>>> Shawn
> > >>>>>
> > >>>>> On Tue, Feb 3, 2026 at 9:18 AM huaxin gao <[email protected]>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi all,
> > >>>>>> I'd like to start a dedicated sync to discuss Iceberg Index
> support.
> > >>>>>> Here is the existing discussion thread:
> > >>>>>> https://lists.apache.org/thread/fzqk3jjf0xpj5m4cfqb3v4c65p0t04ty.
> > >>>>>>
> > >>>>>> To ground the discussion, here are the two proposals:
> > >>>>>>
> > >>>>>> - Peter's proposal
> > >>>>>> <
> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2>
> (overall
> > >>>>>> index support)
> > >>>>>> - My proposal
> > >>>>>> <
> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7
> >
> > >>>>>> (bloom filter skipping index)
> > >>>>>>
> > >>>>>> Time slot: Every 3 weeks, Wednesdays at 9 AM to 10 AM EST,
> starting
> > >>>>>> next Wednesday (2/11). After FileFormat sync finishes, we plan to
> use that
> > >>>>>> slot and switch to every other Monday, 9 AM to 10 AM EST.
> > >>>>>>
> > >>>>>> Meet link: https://meet.google.com/fjn-tyze-mko
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Huaxin
> > >>>>>>
> > >>>>>
> >
>
>
>

Re: Re: Dedicated sync for Iceberg Index Support

Reply via email to