Hi all, I built a standalone PoC to validate that the basic index structure works: that we can build a PK index, convert equality deletes to position deletes through it, and have every converted delete land on the correct live row. I ran it up to *100M keys*.
*Headline: the structure works.* The index builds over up to 100M keys, the eq-delete → position-delete conversion resolved correctly at *every* size (100% of converted deletes mapped to the right live row), and the resulting position deletes are *~8× cheaper to apply* at query time than the equality deletes they replace. Beyond correctness, the run also shows how the index’s *maintenance* cost scales, comparing copy-on-write (COW, rewrite touched leaves) vs an append/merge (MOR) option, under a realistic mixed CDC checkpoint (1,000 insert + 500 update + 500 delete), local wall-clock: keys EQ baseline INDEX (COW) % of 60s (COW) INDEX (MOR) % of 60s (MOR) correct 5M 6 ms 6.7s 11.2% 2.2s 3.7% PASS 20M 8 ms 24.2s 40.4% 6.4s 10.6% PASS 50M 7 ms 51.6s 86.1% 12.2s 20.4% PASS *100M* 6 ms *75.0s* 125% (BEHIND) *16.9s* 28.2% (keeps up) PASS COW maintenance crosses the 60 s checkpoint around 100M (75 s/cycle, 125%); MOR stays at ~28% and keeps pace; the equality-delete baseline is ~6 ms and flat. So the structure works, but *COW alone can’t sustain scattered CDC at hundreds of millions of keys on a single writer*. It’s worth allowing a merge-on-read / update-file maintenance option alongside COW (or sharding the index across parallel writers). *Full write-up, all tables, and the in-region reality-check:* link <https://docs.google.com/document/d/1G3zxbW8X0eU3UrouslZfp42bBc9CvgJGnJyDONCB4PU/edit?tab=t.0> Feedback welcome, especially on the spec direction (whether to allow a merge-on-read / update-file maintenance option alongside COW) and on the read-side modeling. Thanks, Huaxin On Tue, Jun 9, 2026 at 5:45 PM huaxin gao <[email protected]> wrote: > Sorry, we've skipped posting a few of the dedicated index-sync summaries > to the mailing list; you can find those in the Google doc > <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.8041k7j2n7y3> > and the Slack channel. Here's yesterday's summary: > > *Decided* > > - > > Index vs. table (what we agreed): > - > > Reuse table implementation/library code and a near-identical spec — > the commit path will be custom regardless, so reuse isn't the deciding > factor. > - > > An index is not a table from a user/API view: loading or writing an > index as a table must fail(it would violate index invariants). > - > > The spec forbids most table behaviors: no overlapping files, one > mandatory transform sort order, no column updates, no partition spec. > - > > Delete vectors: reuse Iceberg's existing DV — benchmarks showed no new > delete format is worth introducing. > - > > Incremental updates: start with copy-on-write only (no update files). > For object-store-sized leaves, a full leaf rewrite is about as cheap as > maintaining an overlay update file + DV, so we'll skip the MOR machinery > for now and add it later only if benchmarks prove we need it (likely just > the very-large-leaf case). > - > > Validate the spec first: build a quick, hand-wired prototype (Parquet > files structured per the spec) and benchmark it on real scales before > formalizing. > > *Leaning, not final* > > - > > Indexes are likely separate catalog objects, linked from the table by > storing just an identifier (like materialized views) and not visible in > LIST > TABLES. > - > > We'll need a commit path for indexes, but simpler than tables (no > stage-create). > > *Still open* > > > - > > Permissions model — separate vs. inherited (action: look at what real > DBs do for index permissions). > - > > REST/catalog RPC design — minimize round-trips; index metadata ideally > returned with LOAD TABLE. Catalog RPC cost may dominate Parquet IO, so > this needs real design. > - > > Scale modeling — target rows-per-leaf vs. leaf size vs. metadata-file > count. > - > > DDL-on-index semantics (reuse table schema-update actions or separate) > > > Thanks, > Huaxin > > On Wed, Apr 22, 2026 at 8:47 AM Péter Váry <[email protected]> > wrote: > >> Hi All, >> >> TL;DR >> We still need to validate with ADLS and S3, but based on the local tests, >> the MPHF approach looks more promising if we can tolerate larger files and >> longer index maintenance times. >> >> Details: >> Here are the results from the local experiments on my Mac. I removed >> unnecessary statistics from the Parquet files and tested different row >> group sizes: >> >> - For an index file with 1M records, a row group size of 5,000 >> appears to be the sweet spot. >> - For 10M records, 10,000 rows per row group works best. >> >> If you have additional ideas for optimizing Parquet-based indexes, I’d be >> very interested to hear them. >> The test code is available on this branch: >> https://github.com/pvary/iceberg/tree/leaf_bench >> >> Best results: >> *1m records/file* >> >> - Parquet - 5000 row/RowGroup >> - Read: 1191 µs - 1 file open, 3 seek, 123KB read per lookup >> - Write: 1.7 s, 15 MB >> - MPHF >> - Read: 202 µs - 1 file open, 1 seek, 282KB read per lookup >> - Write: 0.8 s, 34 MB >> >> *10m records/file* >> >> - Parquet - 10000 row/RowGroup >> - Read: 4168 µs - 1 file open, 3 seek, 395KB read per lookup >> - Write: 19.5s s, 144 MB >> - MPHF >> - Read: 1086 µs - 1 file open, 1 seek, 2.8 MB (2812KB) read per >> lookup >> - Write: 6.5 s, 34 MB, 353 MB >> >> Below are the full results. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> *Benchmark (indexType) (keyType) >> (numRows) Mode Cnt Score Error >> UnitsInvertedIndexBenchmark.lookup PARQUET_1000 LONG >> 1000000 ss 10000 3285.284 ± 5.138 >> us/opInvertedIndexBenchmark.lookup:bytesRead PARQUET_1000 LONG >> 1000000 ss 10000 2522168989.000 >> #InvertedIndexBenchmark.lookup:openStreams PARQUET_1000 LONG >> 1000000 ss 10000 10000.000 >> #InvertedIndexBenchmark.lookup:seeks PARQUET_1000 LONG >> 1000000 ss 10000 30000.000 >> #InvertedIndexBenchmark.lookup PARQUET_1000 LONG >> 10000000 ss 10000 35449.614 ± 34.673 >> us/opInvertedIndexBenchmark.lookup:bytesRead PARQUET_1000 LONG >> 10000000 ss 10000 24302649201.000 >> #InvertedIndexBenchmark.lookup:openStreams PARQUET_1000 LONG >> 10000000 ss 10000 10000.000 >> #InvertedIndexBenchmark.lookup:seeks PARQUET_1000 LONG >> 10000000 ss 10000 30000.000 >> #InvertedIndexBenchmark.lookup PARQUET_5000 LONG >> 1000000 ss 10000 1191.959 ± 4.169 >> us/opInvertedIndexBenchmark.lookup:bytesRead PARQUET_5000 LONG >> 1000000 ss 10000 1230877229.000 >> #InvertedIndexBenchmark.lookup:openStreams PARQUET_5000 LONG >> 1000000 ss 10000 10000.000 >> #InvertedIndexBenchmark.lookup:seeks PARQUET_5000 LONG >> 1000000 ss 10000 30000.000 >> #InvertedIndexBenchmark.lookup PARQUET_5000 LONG >> 10000000 ss 10000 7236.447 ± 10.374 >> us/opInvertedIndexBenchmark.lookup:bytesRead PARQUET_5000 LONG >> 10000000 ss 10000 5650715973.000 >> #InvertedIndexBenchmark.lookup:openStreams PARQUET_5000 LONG >> 10000000 ss 10000 10000.000 >> #InvertedIndexBenchmark.lookup:seeks PARQUET_5000 LONG >> 10000000 ss 10000 30000.000 >> #InvertedIndexBenchmark.lookup PARQUET_10000 LONG >> 1000000 ss 10000 1349.946 ± 7.834 >> us/opInvertedIndexBenchmark.lookup:bytesRead PARQUET_10000 LONG >> 1000000 ss 10000 1730219377.000 >> #InvertedIndexBenchmark.lookup:openStreams PARQUET_10000 LONG >> 1000000 ss 10000 10000.000 >> #InvertedIndexBenchmark.lookup:seeks PARQUET_10000 LONG >> 1000000 ss 10000 30000.000 >> #InvertedIndexBenchmark.lookup PARQUET_10000 LONG >> 10000000 ss 10000 4168.635 ± 11.051 >> us/opInvertedIndexBenchmark.lookup:bytesRead PARQUET_10000 LONG >> 10000000 ss 10000 3946341532.000 >> #InvertedIndexBenchmark.lookup:openStreams PARQUET_10000 LONG >> 10000000 ss 10000 10000.000 >> #InvertedIndexBenchmark.lookup:seeks PARQUET_10000 LONG >> 10000000 ss 10000 30000.000 >> #InvertedIndexBenchmark.lookup PARQUET_50000 LONG >> 1000000 ss 10000 4736.466 ± 38.179 >> us/opInvertedIndexBenchmark.lookup:bytesRead PARQUET_50000 LONG >> 1000000 ss 10000 7427413541.000 >> #InvertedIndexBenchmark.lookup:openStreams PARQUET_50000 LONG >> 1000000 ss 10000 10000.000 >> #InvertedIndexBenchmark.lookup:seeks PARQUET_50000 LONG >> 1000000 ss 10000 30000.000 >> #InvertedIndexBenchmark.lookup PARQUET_50000 LONG >> 10000000 ss 10000 4979.031 ± 34.708 >> us/opInvertedIndexBenchmark.lookup:bytesRead PARQUET_50000 LONG >> 10000000 ss 10000 7694887636.000 >> #InvertedIndexBenchmark.lookup:openStreams PARQUET_50000 LONG >> 10000000 ss 10000 10000.000 >> #InvertedIndexBenchmark.lookup:seeks PARQUET_50000 LONG >> 10000000 ss 10000 30000.000 >> #InvertedIndexBenchmark.lookup MPHF LONG >> 1000000 ss 10000 202.571 ± 2.336 >> us/opInvertedIndexBenchmark.lookup:bytesRead MPHF LONG >> 1000000 ss 10000 2821570000.000 >> #InvertedIndexBenchmark.lookup:openStreams MPHF LONG >> 1000000 ss 10000 10000.000 >> #InvertedIndexBenchmark.lookup:seeks MPHF LONG >> 1000000 ss 10000 10000.000 >> #InvertedIndexBenchmark.lookup MPHF LONG >> 10000000 ss 10000 1086.957 ± 4.524 >> us/opInvertedIndexBenchmark.lookup:bytesRead MPHF LONG >> 10000000 ss 10000 28119460000.000 >> #InvertedIndexBenchmark.lookup:openStreams MPHF LONG >> 10000000 ss 10000 10000.000 >> #InvertedIndexBenchmark.lookup:seeks MPHF LONG >> 10000000 ss 10000 10000.000 >> #InvertedIndexBenchmark.write PARQUET_1000 LONG >> 1000000 ss 3 1720731.014 ± 876636.004 >> us/opInvertedIndexBenchmark.write:indexFileBytes PARQUET_1000 LONG >> 1000000 ss 3 46453317.000 >> #InvertedIndexBenchmark.write PARQUET_1000 LONG >> 10000000 ss 3 18547947.876 ± 12258125.307 >> us/opInvertedIndexBenchmark.write:indexFileBytes PARQUET_1000 LONG >> 10000000 ss 3 452655675.000 >> #InvertedIndexBenchmark.write PARQUET_5000 LONG >> 1000000 ss 3 1718345.583 ± 1103928.016 >> us/opInvertedIndexBenchmark.write:indexFileBytes PARQUET_5000 LONG >> 1000000 ss 3 44845788.000 >> #InvertedIndexBenchmark.write PARQUET_5000 LONG >> 10000000 ss 3 18604229.931 ± 2668361.915 >> us/opInvertedIndexBenchmark.write:indexFileBytes PARQUET_5000 LONG >> 10000000 ss 3 435388818.000 >> #InvertedIndexBenchmark.write PARQUET_10000 LONG >> 1000000 ss 3 1761555.389 ± 535857.675 >> us/opInvertedIndexBenchmark.write:indexFileBytes PARQUET_10000 LONG >> 1000000 ss 3 44536635.000 >> #InvertedIndexBenchmark.write PARQUET_10000 LONG >> 10000000 ss 3 19501588.264 ± 2130054.558 >> us/opInvertedIndexBenchmark.write:indexFileBytes PARQUET_10000 LONG >> 10000000 ss 3 433189623.000 >> #InvertedIndexBenchmark.write PARQUET_50000 LONG >> 1000000 ss 3 1936624.889 ± 6601363.985 >> us/opInvertedIndexBenchmark.write:indexFileBytes PARQUET_50000 LONG >> 1000000 ss 3 44264655.000 >> #InvertedIndexBenchmark.write PARQUET_50000 LONG >> 10000000 ss 3 20471742.278 ± 10705206.310 >> us/opInvertedIndexBenchmark.write:indexFileBytes PARQUET_50000 LONG >> 10000000 ss 3 431311305.000 >> #InvertedIndexBenchmark.write MPHF LONG >> 1000000 ss 3 896573.958 ± 1408024.851 >> us/opInvertedIndexBenchmark.write:indexFileBytes MPHF LONG >> 1000000 ss 3 102846369.000 >> #InvertedIndexBenchmark.write MPHF LONG >> 10000000 ss 3 6509348.875 ± 15519975.479 >> us/opInvertedIndexBenchmark.write:indexFileBytes MPHF LONG >> 10000000 ss 3 1058435733.000 #* >> >> huaxin gao <[email protected]> ezt írta (időpont: 2026. ápr. 21., >> K, 20:53): >> >>> Hi all, >>> >>> In recent secondary index sync meetings, the discussion converged on the >>> need to define what an index is from first principles before settling on >>> physical layout. >>> >>> To address that, Peter and I have drafted a requirements document for a >>> key lookup index (renamed from "primary key index" to avoid implying >>> uniqueness enforcement), the goal is to nail down one well-scoped index >>> type first. >>> >>> Doc: Key Lookup Index Requirements >>> <https://docs.google.com/document/d/1e0zxK-jA0LBDq8YQlQgFipTHelDFiga8lCkgDTmYub8/edit?tab=t.0#heading=h.8shrgabvl19> >>> >>> It covers requirements, three design options (manifest + sorted Parquet, >>> hash + sorted Parquet, hash + MPHF) and open questions. We will add >>> preliminary benchmark results shortly. >>> >>> Feedback welcome — inline in the doc, on this thread, or at the next >>> index sync. >>> >>> Thanks, >>> >>> Huaxin >>> >>> On Mon, Apr 13, 2026 at 7:22 AM Steven Wu <[email protected]> wrote: >>> >>>> Do we need the special index identifier that was originally proposed? A >>>> generic CatalogObjectIdentifier (with namespace and name) would be >>>> consistent with all object types in the catalog. I have a discussion thread >>>> on the generic identifier topic: [DISCUSS] REST Spec: generic >>>> CatalogObjectIdentifier. >>>> >>>> Should we add an indexes array field to table metadata? It only >>>> contains a list of index object identifiers. It doesn't contain any index >>>> metadata which should live in the index objects. Yufei was trying to bring >>>> this up at the end of the first sync. But we didn't get enough time to >>>> really discuss it. It will be great to discuss this as the first agenda >>>> item today. >>>> >>>> On Mon, Apr 13, 2026 at 3:17 AM Péter Váry <[email protected]> >>>> wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> We had several engaging discussions at the Iceberg Summit, and it was >>>>> great to finally catch up with many of you in person. We truly missed >>>>> those >>>>> who couldn’t attend, hopefully we’ll all meet again at the next summit. >>>>> >>>>> To keep the conversation going, Huaxin and I have put together the >>>>> agenda for our next meeting. As a reminder, we’ll meet on *April >>>>> 13th, 9:00–10:00 AM *PDT (6:00–7:00 PM CEST). >>>>> >>>>> Proposed agenda: >>>>> >>>>> - Continue first-principles index design discussion from Mar 30 >>>>> - *Index Ownership and Write Responsibility* >>>>> - Should writers be allowed to update indexes, or >>>>> - Should all index writes be handled exclusively by the >>>>> Index Maintenance process? >>>>> - If writers can update indexes then we need to define what >>>>> guarantees are required (compaction, file splitting, layout >>>>> expectations)? >>>>> - If only Index Maintenance updates indexes then we only >>>>> need to define what observable properties should be exposed to >>>>> consumers? >>>>> Like: >>>>> - Expected max files for a single key >>>>> - Current max files for a single key >>>>> - Deletes allowed/present >>>>> - Sorted by >>>>> - Partitioned by >>>>> - *Specification Scope: What Belongs in the Spec?* >>>>> - Related to the ownership question above >>>>> - Light spec: Just define that the index table should be >>>>> optimized for retrieval by key columns and the index columns >>>>> should be >>>>> contained in the table. This could give us more flexibility if >>>>> better >>>>> organization methods come up, or >>>>> - Detailed spec: We could define the max number of files per >>>>> index to read for a single key, or even the partitioning and the >>>>> exact sort >>>>> order. This could allow more use-cases for a given index, like >>>>> joins or >>>>> cardinality estimations. >>>>> - I would go for light spec for the main types (PK, >>>>> Containing) and only the Index Maintenance processes should >>>>> update the >>>>> Indexes, as for many use-cases the details are not important, >>>>> and writers >>>>> will very rarely update the Indexes themselves. >>>>> - *Logical Placement of Indexes* >>>>> - Index as a child object of an Iceberg Table, or >>>>> - Index as a first‑class entity under >>>>> /namespace/indexes/{index} >>>>> - Based on the discussions on the summit we are leaning in >>>>> this direction. This means the index id should be unique in the >>>>> namespace >>>>> but helps the catalog implementations quite a bit >>>>> - *Physical Placement of Index Data* >>>>> - I don’t think we should specify this. We should have a >>>>> base location for the index, but can rely on the catalog >>>>> implementations to >>>>> decide on their own, like they do with the tables, views, udfs. >>>>> - *Iceberg Reader Based indexes* (Containing indexes and >>>>> potentially PK indexes). These are the indexes which could be read >>>>> by the >>>>> existing Iceberg readers. We might decide to store the PK index >>>>> similarly >>>>> to an Iceberg Table and treat it as a reader based index. >>>>> - What are the table properties/features exposed to the >>>>> readers >>>>> - Maybe just some behavioral descriptors for the >>>>> optimizer to decide if the index could be used or should be >>>>> skipped, like: >>>>> - Expected max files for a single key >>>>> - max files for a single key >>>>> - Deletes allowed/present >>>>> - Sorted by >>>>> - Partitioned by >>>>> - The Tasks when reading the index based on the filters >>>>> and projection >>>>> - What are the table properties/features exposed to the >>>>> Index Maintenance. I think this could be internal to the Index >>>>> Maintenance >>>>> process and might not be exposed through the spec. The Index >>>>> Maintenance >>>>> process could handle this as a standard Iceberg Table and could >>>>> be based on >>>>> the Table Maintenance process, but there might be some totally >>>>> different >>>>> processes. >>>>> - It should be possible to add properties to an index defined >>>>> by the Index Maintenance process which could be used and updated in >>>>> the >>>>> next Index Maintenance run. >>>>> - *PK index storage format benchmark results* >>>>> - Flat Parquet (baseline) >>>>> - BTree with Parquet leaves >>>>> - Vortex >>>>> - *Open items / next steps* >>>>> >>>>> Thanks, >>>>> Peter >>>>> >>>>> huaxin gao <[email protected]> ezt írta (időpont: 2026. márc. >>>>> 23., H, 3:03): >>>>> >>>>>> Hi everyone, I wanted to share an update on the primary key index >>>>>> work. >>>>>> Since there are still open questions on whether bloom filter indexes >>>>>> fit in the secondary index framework or should be treated as extended >>>>>> stats, I've shifted focus to the primary key index since it's a clearer >>>>>> fit >>>>>> for the framework. >>>>>> I've put together a proposal for a primary key reverse-lookup index >>>>>> that maps each key to its physical location (file_path, row_position). It >>>>>> enables: >>>>>> >>>>>> - Scan-time file pruning for point lookups >>>>>> - Converting key-based deletes into position deletes (eliminating >>>>>> equality deletes for Flink CDC) >>>>>> - Accelerating Spark MERGE INTO by replacing full-table joins >>>>>> with direct file lookups >>>>>> >>>>>> Proposal: >>>>>> https://docs.google.com/document/d/1HuhCZ0n2FqDh8yqQb9oEj1CPM5yXpEsMPGZno2aSf8E/edit?tab=t.0#heading=h.tbevg4q0m9 >>>>>> Feedback welcome! >>>>>> Thanks, >>>>>> Huaxin >>>>>> >>>>>> On Wed, Mar 18, 2026 at 11:42 PM Péter Váry < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Key takeaways from the general index discussion at the May 16 >>>>>>> meeting. >>>>>>> Thanks to everyone who participated! The recording is available >>>>>>> here: https://www.youtube.com/watch?v=btmjhtRWUCE >>>>>>> >>>>>>> - Q: Do we need to tie index types to the algorithms used to >>>>>>> access them? >>>>>>> - A: From a specification perspective, the goal is to define the >>>>>>> storage-level data layout so it can be shared across engines. >>>>>>> Engines are >>>>>>> free to interpret and use the data as they see fit, but the on-disk >>>>>>> data >>>>>>> layout itself must be strictly defined and interoperable. >>>>>>> >>>>>>> - Q: Should we introduce an additional abstraction layer (e.g., >>>>>>> Vector Index) with sub-types such as IVF and DiskANN? >>>>>>> - A: This is possible if we decide it is beneficial. I explored >>>>>>> potential naming, but it is not yet clear how such a layer would be >>>>>>> used in >>>>>>> practice. >>>>>>> *Question to Yingyi Bu*: could you provide examples where this >>>>>>> additional layer would be useful? Should this abstraction be defined >>>>>>> at the >>>>>>> spec level, or is it better handled at the engine level? >>>>>>> My initial idea was that users would create a generic Vector >>>>>>> Index and let the engine choose the concrete implementation. >>>>>>> However, this >>>>>>> would limit user control and users likely need to specify the exact >>>>>>> index >>>>>>> representation, which implies they must be aware of the available >>>>>>> representations. >>>>>>> >>>>>>> >>>>>>> >>>>>>> - Q: Do we want to allow extensibility for index types? >>>>>>> - A: Yes. The intent is to support a small set of well-defined >>>>>>> index types while allowing experimentation with new ones. If a new >>>>>>> index >>>>>>> type proves broadly useful, a follow-up proposal can standardize it >>>>>>> and >>>>>>> incorporate it into the spec. >>>>>>> >>>>>>> >>>>>>> >>>>>>> - Q: Do we allow multiple versions of an index for the same >>>>>>> table snapshot? >>>>>>> - A: Yes. Older index versions must be retained for readers that >>>>>>> have already started using them, while new readers should >>>>>>> automatically use >>>>>>> the latest available version >>>>>>> >>>>>>> >>>>>>> >>>>>>> - Q: Do we need to use materialized views for these indexes? >>>>>>> - A: No. These indexes are primarily examples, and different >>>>>>> types may require different storage methods. However, the Primary >>>>>>> Key, >>>>>>> Containing, and parts of the IVF indexes can be structured as Iceberg >>>>>>> tables. This allows engines to read them natively; in some cases, >>>>>>> Iceberg >>>>>>> planners can automatically redirect queries to the index table >>>>>>> without >>>>>>> engine modifications. Furthermore, index maintenance for these >>>>>>> tables can >>>>>>> leverage existing materialized view maintenance workflows. Other >>>>>>> index >>>>>>> types may instead rely on Puffin files or alternative storage >>>>>>> approaches. >>>>>>> >>>>>>> >>>>>>> >>>>>>> - Q: How should index metadata be accessed? Should we add >>>>>>> explicit pointers for the indexes in the table metadata? >>>>>>> - A: We did not have sufficient time to fully explore and >>>>>>> conclude this topic. >>>>>>> *Question for Yufei Gu*: Did I understand correctly that your >>>>>>> main concern stems from endpoint resolution from a REST Catalog >>>>>>> perspective? Specifically, if indexes are exposed under a URI such as >>>>>>> v1/{prefix}/namespaces/{namespace}/tables/{table}/indexes/{index}, >>>>>>> would >>>>>>> this make it more difficult for the REST Catalog to resolve and route >>>>>>> requests to the appropriate endpoint? >>>>>>> >>>>>>> >>>>>>> Suhas Jayaram Subramanya via dev <[email protected]> ezt írta >>>>>>> (időpont: 2026. márc. 13., P, 23:32): >>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> Here's a proposal for native Vector Index support in Iceberg tables >>>>>>>> -- >>>>>>>> https://docs.google.com/document/d/1KL4qLOwdqnhOcqTc0EjO1O16NV3M3c-gZCEINDWw4lA/edit?usp=sharing >>>>>>>> >>>>>>>> We've been working on this proposal with Peter internally at >>>>>>>> Microsoft and he suggested we post it here to bring this to the >>>>>>>> community's >>>>>>>> attention, ahead of the next Secondary Index Sync. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Suhas >>>>>>>> >>>>>>>> On 2026/02/19 04:34:34 huaxin gao wrote: >>>>>>>> > Hi Everyone, >>>>>>>> > >>>>>>>> > Here are the recording and notes from the Iceberg Index Support >>>>>>>> Sync on >>>>>>>> > 2/11. >>>>>>>> > >>>>>>>> > Recording: https://www.youtube.com/watch?v=3sFfQ0A50yk >>>>>>>> > >>>>>>>> > Notes: >>>>>>>> > >>>>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.8041k7j2n7y3 >>>>>>>> > >>>>>>>> > The meeting will move to biweekly, Mondays 9–10am PST, starting >>>>>>>> March 2. >>>>>>>> > >>>>>>>> > Since the sync, I updated the Bloom skipping index proposal >>>>>>>> > < >>>>>>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.5r5kl6k3fqwu >>>>>>>> > >>>>>>>> > to address the discussion questions, specifically: >>>>>>>> > >>>>>>>> > >>>>>>>> > - Performance justification: when this helps (high-cardinality = >>>>>>>> / IN, >>>>>>>> > many data files, high object-store latency) and how it differs >>>>>>>> from Parquet >>>>>>>> > row-group Bloom filters (which still require opening the data >>>>>>>> file). >>>>>>>> > - Cost / scalability: rough sizing (Bloom blob size per file, >>>>>>>> Puffin >>>>>>>> > file size), the planning cost trade-off (driver index reads vs >>>>>>>> executor >>>>>>>> > file opens), and mitigations via caching. >>>>>>>> > - Lifecycle / maintenance: incremental production as new data >>>>>>>> files >>>>>>>> > arrive, behavior when the index is missing/behind, and >>>>>>>> sharding/compaction >>>>>>>> > plus cleanup to avoid accumulating too many small Puffin files >>>>>>>> over time. >>>>>>>> > - Writer expectations: inline (optional) vs asynchronous >>>>>>>> (primary) index >>>>>>>> > creation. >>>>>>>> > >>>>>>>> > I also implemented a Spark 4.1 POC >>>>>>>> > <https://github.com/apache/iceberg/pull/15311> and a local >>>>>>>> benchmark to >>>>>>>> > quantify both the pruning impact (plannedFiles → afterBloom) and >>>>>>>> the index >>>>>>>> > read overhead (statsFiles, statsBytes, bloomPayloadBytes) for >>>>>>>> point >>>>>>>> > predicates on high-cardinality columns. Please take a look and >>>>>>>> let me know >>>>>>>> > if you have any questions or feedback. >>>>>>>> > >>>>>>>> > Thanks, >>>>>>>> > >>>>>>>> > Huaxin >>>>>>>> > >>>>>>>> > On Tue, Feb 10, 2026 at 1:43 PM huaxin gao <[email protected]> >>>>>>>> wrote: >>>>>>>> > >>>>>>>> > > Reminder for tomorrow's sync on Iceberg Index Support. >>>>>>>> > > >>>>>>>> > > Wednesday: Feb. 11 9:00 – 10:00am >>>>>>>> > > Time zone: America/Los_Angeles >>>>>>>> > > Google Meet joining info >>>>>>>> > > Video call link: meet.google.com/nsp-ctyr-khk >>>>>>>> > > Design doc: >>>>>>>> > > >>>>>>>> > > >>>>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2 >>>>>>>> > > >>>>>>>> > > >>>>>>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7 >>>>>>>> > > >>>>>>>> > > Thanks, >>>>>>>> > > Huaxin >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > On Tue, Feb 3, 2026 at 10:52 PM Péter Váry <[email protected]> >>>>>>>> > > wrote: >>>>>>>> > > >>>>>>>> > >> Thanks Huaxin and Steven for organizing this. Looking forward >>>>>>>> to meet you >>>>>>>> > >> all next week! >>>>>>>> > >> >>>>>>>> > >> On Wed, Feb 4, 2026, 02:48 Steven Wu <[email protected]> wrote: >>>>>>>> > >> >>>>>>>> > >>> We set up the dev calendar event with a new google meet link. >>>>>>>> Please >>>>>>>> > >>> ignore the link from Huaxin's original email. >>>>>>>> > >>> >>>>>>>> > >>> The dev calendar has the correct info (including the new >>>>>>>> meeting link) >>>>>>>> > >>> >>>>>>>> > >>> Iceberg Index Support Sync >>>>>>>> > >>> Wednesday, February 11 · 9:00 – 10:00am >>>>>>>> > >>> Time zone: America/Los_Angeles >>>>>>>> > >>> Google Meet joining info >>>>>>>> > >>> Video call link: https://meet.google.com/nsp-ctyr-khk >>>>>>>> > >>> >>>>>>>> > >>> On Tue, Feb 3, 2026 at 5:08 PM huaxin gao <[email protected]> >>>>>>>> > >>> wrote: >>>>>>>> > >>> >>>>>>>> > >>>> Sorry, I meant PST (not EST) :) >>>>>>>> > >>>> Looking forward to the discussion! >>>>>>>> > >>>> >>>>>>>> > >>>> On Tue, Feb 3, 2026 at 4:58 PM Shawn Chang <[email protected]> >>>>>>>> > >>>> wrote: >>>>>>>> > >>>> >>>>>>>> > >>>>> Hi Huaxin, >>>>>>>> > >>>>> >>>>>>>> > >>>>> Thanks for starting the sync! >>>>>>>> > >>>>> >>>>>>>> > >>>>> The meeting seems to be 9-10AM PST on the dev events >>>>>>>> calendar >>>>>>>> > >>>>> < >>>>>>>> https://calendar.google.com/calendar/u/0?cid=MzkwNWQ0OTJmMWI0NTBiYTA3MTJmMmFlNmFmYTc2ZWI3NTdmMTNkODUyMjBjYzAzYWE0NTI3ODg1YWRjNTYyOUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t >>>>>>>> >, >>>>>>>> > >>>>> not EST. Maybe it's a typo? >>>>>>>> > >>>>> Otherwise, looking forward to the discussion! >>>>>>>> > >>>>> >>>>>>>> > >>>>> Best, >>>>>>>> > >>>>> Shawn >>>>>>>> > >>>>> >>>>>>>> > >>>>> On Tue, Feb 3, 2026 at 9:18 AM huaxin gao <[email protected]> >>>>>>>> > >>>>> wrote: >>>>>>>> > >>>>> >>>>>>>> > >>>>>> Hi all, >>>>>>>> > >>>>>> I'd like to start a dedicated sync to discuss Iceberg >>>>>>>> Index support. >>>>>>>> > >>>>>> Here is the existing discussion thread: >>>>>>>> > >>>>>> >>>>>>>> https://lists.apache.org/thread/fzqk3jjf0xpj5m4cfqb3v4c65p0t04ty. >>>>>>>> > >>>>>> >>>>>>>> > >>>>>> To ground the discussion, here are the two proposals: >>>>>>>> > >>>>>> >>>>>>>> > >>>>>> - Peter's proposal >>>>>>>> > >>>>>> < >>>>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2> >>>>>>>> (overall >>>>>>>> > >>>>>> index support) >>>>>>>> > >>>>>> - My proposal >>>>>>>> > >>>>>> < >>>>>>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7 >>>>>>>> > >>>>>>>> > >>>>>> (bloom filter skipping index) >>>>>>>> > >>>>>> >>>>>>>> > >>>>>> Time slot: Every 3 weeks, Wednesdays at 9 AM to 10 AM EST, >>>>>>>> starting >>>>>>>> > >>>>>> next Wednesday (2/11). After FileFormat sync finishes, we >>>>>>>> plan to use that >>>>>>>> > >>>>>> slot and switch to every other Monday, 9 AM to 10 AM EST. >>>>>>>> > >>>>>> >>>>>>>> > >>>>>> Meet link: https://meet.google.com/fjn-tyze-mko >>>>>>>> > >>>>>> >>>>>>>> > >>>>>> Thanks, >>>>>>>> > >>>>>> Huaxin >>>>>>>> > >>>>>> >>>>>>>> > >>>>> >>>>>>>> > >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>
