Re: Re: Dedicated sync for Iceberg Index Support

huaxin gao Sat, 20 Jun 2026 11:12:54 -0700

Hi all,

I built a standalone PoC to validate that the basic index structure works:
that we can build a PK index, convert equality deletes to position deletes
through it, and have every converted delete land on the correct live row. I
ran it up to *100M keys*.


*Headline: the structure works.* The index builds over up to 100M keys, the
eq-delete → position-delete conversion resolved correctly at *every* size
(100% of converted deletes mapped to the right live row), and the resulting
position deletes are *~8× cheaper to apply* at query time than the equality
deletes they replace.

Beyond correctness, the run also shows how the index’s *maintenance* cost
scales, comparing copy-on-write (COW, rewrite touched leaves) vs an
append/merge (MOR) option, under a realistic mixed CDC checkpoint (1,000
insert + 500 update + 500 delete), local wall-clock:
keys EQ baseline INDEX (COW) % of 60s (COW) INDEX (MOR) % of 60s (MOR)
correct
5M 6 ms 6.7s 11.2% 2.2s 3.7% PASS
20M 8 ms 24.2s 40.4% 6.4s 10.6% PASS
50M 7 ms 51.6s 86.1% 12.2s 20.4% PASS
*100M* 6 ms *75.0s* 125% (BEHIND) *16.9s* 28.2% (keeps up) PASS

COW maintenance crosses the 60 s checkpoint around 100M (75 s/cycle, 125%);
MOR stays at ~28% and keeps pace; the equality-delete baseline is ~6 ms and
flat. So the structure works, but *COW alone can’t sustain scattered CDC at
hundreds of millions of keys on a single writer*. It’s worth allowing a
merge-on-read / update-file maintenance option alongside COW (or sharding
the index across parallel writers).

*Full write-up, all tables, and the in-region reality-check:* link
<https://docs.google.com/document/d/1G3zxbW8X0eU3UrouslZfp42bBc9CvgJGnJyDONCB4PU/edit?tab=t.0>

Feedback welcome, especially on the spec direction (whether to allow a
merge-on-read / update-file maintenance option alongside COW)  and on the
read-side modeling.

Thanks,
Huaxin

On Tue, Jun 9, 2026 at 5:45 PM huaxin gao <[email protected]> wrote:

> Sorry,  we've skipped posting a few of the dedicated index-sync summaries
> to the mailing list; you can find those in the Google doc
> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.8041k7j2n7y3>
> and the Slack channel. Here's yesterday's summary:
>
> *Decided*
>
>    -
>
>    Index vs. table (what we agreed):
>    -
>
>       Reuse table implementation/library code and a near-identical spec —
>       the commit path will be custom regardless, so reuse isn't the deciding
>       factor.
>       -
>
>       An index is not a table from a user/API view: loading or writing an
>       index as a table must fail(it would violate index invariants).
>       -
>
>       The spec forbids most table behaviors: no overlapping files, one
>       mandatory transform sort order, no column updates, no partition spec.
>       -
>
>    Delete vectors: reuse Iceberg's existing DV — benchmarks showed no new
>    delete format is worth introducing.
>    -
>
>    Incremental updates: start with copy-on-write only (no update files).
>    For object-store-sized leaves, a full leaf rewrite is about as cheap as
>    maintaining an overlay update file + DV, so we'll skip the MOR machinery
>    for now and add it later only if benchmarks prove we need it (likely just
>    the very-large-leaf case).
>    -
>
>    Validate the spec first: build a quick, hand-wired prototype (Parquet
>    files structured per the spec) and benchmark it on real scales before
>    formalizing.
>
> *Leaning, not final*
>
>    -
>
>    Indexes are likely separate catalog objects, linked from the table by
>    storing just an identifier (like materialized views) and not visible in 
> LIST
>    TABLES.
>    -
>
>    We'll need a commit path for indexes, but simpler than tables (no
>    stage-create).
>
> *Still open*
>
>
>    -
>
>    Permissions model — separate vs. inherited (action: look at what real
>    DBs do for index permissions).
>    -
>
>    REST/catalog RPC design — minimize round-trips; index metadata ideally
>    returned with LOAD TABLE. Catalog RPC cost may dominate Parquet IO, so
>    this needs real design.
>    -
>
>    Scale modeling — target rows-per-leaf vs. leaf size vs. metadata-file
>    count.
>    -
>
>    DDL-on-index semantics (reuse table schema-update actions or separate)
>
>
> Thanks,
> Huaxin
>
> On Wed, Apr 22, 2026 at 8:47 AM Péter Váry <[email protected]>
> wrote:
>
>> Hi All,
>>
>> TL;DR
>> We still need to validate with ADLS and S3, but based on the local tests,
>> the MPHF approach looks more promising if we can tolerate larger files and
>> longer index maintenance times.
>>
>> Details:
>> Here are the results from the local experiments on my Mac. I removed
>> unnecessary statistics from the Parquet files and tested different row
>> group sizes:
>>
>>    - For an index file with 1M records, a row group size of 5,000
>>    appears to be the sweet spot.
>>    - For 10M records, 10,000 rows per row group works best.
>>
>> If you have additional ideas for optimizing Parquet-based indexes, I’d be
>> very interested to hear them.
>> The test code is available on this branch:
>> https://github.com/pvary/iceberg/tree/leaf_bench
>>
>> Best results:
>> *1m records/file*
>>
>>    - Parquet - 5000 row/RowGroup
>>       - Read: 1191 µs - 1 file open, 3 seek, 123KB read per lookup
>>       - Write: 1.7 s, 15 MB
>>    - MPHF
>>       - Read: 202 µs - 1 file open, 1 seek,  282KB read per lookup
>>       - Write: 0.8 s, 34 MB
>>
>> *10m records/file*
>>
>>    - Parquet - 10000 row/RowGroup
>>       - Read: 4168 µs - 1 file open, 3 seek, 395KB read per lookup
>>       - Write: 19.5s s, 144 MB
>>    - MPHF
>>       - Read: 1086 µs - 1 file open, 1 seek,  2.8 MB (2812KB) read per
>>       lookup
>>       - Write: 6.5 s, 34 MB, 353 MB
>>
>> Below are the full results.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *Benchmark                                      (indexType)  (keyType)
>>  (numRows)  Mode    Cnt            Score          Error
>>  UnitsInvertedIndexBenchmark.lookup                 PARQUET_1000       LONG
>>    1000000    ss  10000         3285.284 ±        5.138
>>  us/opInvertedIndexBenchmark.lookup:bytesRead       PARQUET_1000       LONG
>>    1000000    ss  10000   2522168989.000
>> #InvertedIndexBenchmark.lookup:openStreams     PARQUET_1000       LONG
>>  1000000    ss  10000        10000.000
>> #InvertedIndexBenchmark.lookup:seeks           PARQUET_1000       LONG
>>  1000000    ss  10000        30000.000
>> #InvertedIndexBenchmark.lookup                 PARQUET_1000       LONG
>> 10000000    ss  10000        35449.614 ±       34.673
>>  us/opInvertedIndexBenchmark.lookup:bytesRead       PARQUET_1000       LONG
>>   10000000    ss  10000  24302649201.000
>> #InvertedIndexBenchmark.lookup:openStreams     PARQUET_1000       LONG
>> 10000000    ss  10000        10000.000
>> #InvertedIndexBenchmark.lookup:seeks           PARQUET_1000       LONG
>> 10000000    ss  10000        30000.000
>> #InvertedIndexBenchmark.lookup                 PARQUET_5000       LONG
>>  1000000    ss  10000         1191.959 ±        4.169
>>  us/opInvertedIndexBenchmark.lookup:bytesRead       PARQUET_5000       LONG
>>    1000000    ss  10000   1230877229.000
>> #InvertedIndexBenchmark.lookup:openStreams     PARQUET_5000       LONG
>>  1000000    ss  10000        10000.000
>> #InvertedIndexBenchmark.lookup:seeks           PARQUET_5000       LONG
>>  1000000    ss  10000        30000.000
>> #InvertedIndexBenchmark.lookup                 PARQUET_5000       LONG
>> 10000000    ss  10000         7236.447 ±       10.374
>>  us/opInvertedIndexBenchmark.lookup:bytesRead       PARQUET_5000       LONG
>>   10000000    ss  10000   5650715973.000
>> #InvertedIndexBenchmark.lookup:openStreams     PARQUET_5000       LONG
>> 10000000    ss  10000        10000.000
>> #InvertedIndexBenchmark.lookup:seeks           PARQUET_5000       LONG
>> 10000000    ss  10000        30000.000
>> #InvertedIndexBenchmark.lookup                PARQUET_10000       LONG
>>  1000000    ss  10000         1349.946 ±        7.834
>>  us/opInvertedIndexBenchmark.lookup:bytesRead      PARQUET_10000       LONG
>>    1000000    ss  10000   1730219377.000
>> #InvertedIndexBenchmark.lookup:openStreams    PARQUET_10000       LONG
>>  1000000    ss  10000        10000.000
>> #InvertedIndexBenchmark.lookup:seeks          PARQUET_10000       LONG
>>  1000000    ss  10000        30000.000
>> #InvertedIndexBenchmark.lookup                PARQUET_10000       LONG
>> 10000000    ss  10000         4168.635 ±       11.051
>>  us/opInvertedIndexBenchmark.lookup:bytesRead      PARQUET_10000       LONG
>>   10000000    ss  10000   3946341532.000
>> #InvertedIndexBenchmark.lookup:openStreams    PARQUET_10000       LONG
>> 10000000    ss  10000        10000.000
>> #InvertedIndexBenchmark.lookup:seeks          PARQUET_10000       LONG
>> 10000000    ss  10000        30000.000
>> #InvertedIndexBenchmark.lookup                PARQUET_50000       LONG
>>  1000000    ss  10000         4736.466 ±       38.179
>>  us/opInvertedIndexBenchmark.lookup:bytesRead      PARQUET_50000       LONG
>>    1000000    ss  10000   7427413541.000
>> #InvertedIndexBenchmark.lookup:openStreams    PARQUET_50000       LONG
>>  1000000    ss  10000        10000.000
>> #InvertedIndexBenchmark.lookup:seeks          PARQUET_50000       LONG
>>  1000000    ss  10000        30000.000
>> #InvertedIndexBenchmark.lookup                PARQUET_50000       LONG
>> 10000000    ss  10000         4979.031 ±       34.708
>>  us/opInvertedIndexBenchmark.lookup:bytesRead      PARQUET_50000       LONG
>>   10000000    ss  10000   7694887636.000
>> #InvertedIndexBenchmark.lookup:openStreams    PARQUET_50000       LONG
>> 10000000    ss  10000        10000.000
>> #InvertedIndexBenchmark.lookup:seeks          PARQUET_50000       LONG
>> 10000000    ss  10000        30000.000
>> #InvertedIndexBenchmark.lookup                         MPHF       LONG
>>  1000000    ss  10000          202.571 ±        2.336
>>  us/opInvertedIndexBenchmark.lookup:bytesRead               MPHF       LONG
>>    1000000    ss  10000   2821570000.000
>> #InvertedIndexBenchmark.lookup:openStreams             MPHF       LONG
>>  1000000    ss  10000        10000.000
>> #InvertedIndexBenchmark.lookup:seeks                   MPHF       LONG
>>  1000000    ss  10000        10000.000
>> #InvertedIndexBenchmark.lookup                         MPHF       LONG
>> 10000000    ss  10000         1086.957 ±        4.524
>>  us/opInvertedIndexBenchmark.lookup:bytesRead               MPHF       LONG
>>   10000000    ss  10000  28119460000.000
>> #InvertedIndexBenchmark.lookup:openStreams             MPHF       LONG
>> 10000000    ss  10000        10000.000
>> #InvertedIndexBenchmark.lookup:seeks                   MPHF       LONG
>> 10000000    ss  10000        10000.000
>> #InvertedIndexBenchmark.write                  PARQUET_1000       LONG
>>  1000000    ss      3      1720731.014 ±   876636.004
>>  us/opInvertedIndexBenchmark.write:indexFileBytes   PARQUET_1000       LONG
>>    1000000    ss      3     46453317.000
>> #InvertedIndexBenchmark.write                  PARQUET_1000       LONG
>> 10000000    ss      3     18547947.876 ± 12258125.307
>>  us/opInvertedIndexBenchmark.write:indexFileBytes   PARQUET_1000       LONG
>>   10000000    ss      3    452655675.000
>> #InvertedIndexBenchmark.write                  PARQUET_5000       LONG
>>  1000000    ss      3      1718345.583 ±  1103928.016
>>  us/opInvertedIndexBenchmark.write:indexFileBytes   PARQUET_5000       LONG
>>    1000000    ss      3     44845788.000
>> #InvertedIndexBenchmark.write                  PARQUET_5000       LONG
>> 10000000    ss      3     18604229.931 ±  2668361.915
>>  us/opInvertedIndexBenchmark.write:indexFileBytes   PARQUET_5000       LONG
>>   10000000    ss      3    435388818.000
>> #InvertedIndexBenchmark.write                 PARQUET_10000       LONG
>>  1000000    ss      3      1761555.389 ±   535857.675
>>  us/opInvertedIndexBenchmark.write:indexFileBytes  PARQUET_10000       LONG
>>    1000000    ss      3     44536635.000
>> #InvertedIndexBenchmark.write                 PARQUET_10000       LONG
>> 10000000    ss      3     19501588.264 ±  2130054.558
>>  us/opInvertedIndexBenchmark.write:indexFileBytes  PARQUET_10000       LONG
>>   10000000    ss      3    433189623.000
>> #InvertedIndexBenchmark.write                 PARQUET_50000       LONG
>>  1000000    ss      3      1936624.889 ±  6601363.985
>>  us/opInvertedIndexBenchmark.write:indexFileBytes  PARQUET_50000       LONG
>>    1000000    ss      3     44264655.000
>> #InvertedIndexBenchmark.write                 PARQUET_50000       LONG
>> 10000000    ss      3     20471742.278 ± 10705206.310
>>  us/opInvertedIndexBenchmark.write:indexFileBytes  PARQUET_50000       LONG
>>   10000000    ss      3    431311305.000
>> #InvertedIndexBenchmark.write                          MPHF       LONG
>>  1000000    ss      3       896573.958 ±  1408024.851
>>  us/opInvertedIndexBenchmark.write:indexFileBytes           MPHF       LONG
>>    1000000    ss      3    102846369.000
>> #InvertedIndexBenchmark.write                          MPHF       LONG
>> 10000000    ss      3      6509348.875 ± 15519975.479
>>  us/opInvertedIndexBenchmark.write:indexFileBytes           MPHF       LONG
>>   10000000    ss      3   1058435733.000                     #*
>>
>> huaxin gao <[email protected]> ezt írta (időpont: 2026. ápr. 21.,
>> K, 20:53):
>>
>>> Hi all,
>>>
>>> In recent secondary index sync meetings, the discussion converged on the
>>> need to define what an index is from first principles before settling on
>>> physical layout.
>>>
>>> To address that, Peter and I have drafted a requirements document for a
>>> key lookup index (renamed from "primary key index" to avoid implying
>>> uniqueness enforcement), the goal is to nail down one well-scoped index
>>> type first.
>>>
>>> Doc: Key Lookup Index Requirements
>>> <https://docs.google.com/document/d/1e0zxK-jA0LBDq8YQlQgFipTHelDFiga8lCkgDTmYub8/edit?tab=t.0#heading=h.8shrgabvl19>
>>>
>>> It covers requirements, three design options (manifest + sorted Parquet,
>>> hash + sorted Parquet, hash + MPHF) and open questions. We will add
>>> preliminary benchmark results shortly.
>>>
>>> Feedback welcome — inline in the doc, on this thread, or at the next
>>> index sync.
>>>
>>> Thanks,
>>>
>>> Huaxin
>>>
>>> On Mon, Apr 13, 2026 at 7:22 AM Steven Wu <[email protected]> wrote:
>>>
>>>> Do we need the special index identifier that was originally proposed? A
>>>> generic CatalogObjectIdentifier (with namespace and name) would be
>>>> consistent with all object types in the catalog. I have a discussion thread
>>>> on the generic identifier topic: [DISCUSS] REST Spec: generic
>>>> CatalogObjectIdentifier.
>>>>
>>>> Should we add an indexes array field to table metadata? It only
>>>> contains a list of index object identifiers. It doesn't contain any index
>>>> metadata which should live in the index objects. Yufei was trying to bring
>>>> this up at the end of the first sync. But we didn't get enough time to
>>>> really discuss it. It will be great to discuss this as the first agenda
>>>> item today.
>>>>
>>>> On Mon, Apr 13, 2026 at 3:17 AM Péter Váry <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> We had several engaging discussions at the Iceberg Summit, and it was
>>>>> great to finally catch up with many of you in person. We truly missed 
>>>>> those
>>>>> who couldn’t attend, hopefully we’ll all meet again at the next summit.
>>>>>
>>>>> To keep the conversation going, Huaxin and I have put together the
>>>>> agenda for our next meeting. As a reminder, we’ll meet on *April
>>>>> 13th, 9:00–10:00 AM *PDT (6:00–7:00 PM CEST).
>>>>>
>>>>> Proposed agenda:
>>>>>
>>>>>    - Continue first-principles index design discussion from Mar 30
>>>>>       - *Index Ownership and Write Responsibility*
>>>>>          - Should writers be allowed to update indexes, or
>>>>>          - Should all index writes be handled exclusively by the
>>>>>          Index Maintenance process?
>>>>>          - If writers can update indexes then we need to define what
>>>>>          guarantees are required (compaction, file splitting, layout 
>>>>> expectations)?
>>>>>          - If only Index Maintenance updates indexes then we only
>>>>>          need to define what observable properties should be exposed to 
>>>>> consumers?
>>>>>          Like:
>>>>>             - Expected max files for a single key
>>>>>             - Current max files for a single key
>>>>>             - Deletes allowed/present
>>>>>             - Sorted by
>>>>>             - Partitioned by
>>>>>          - *Specification Scope: What Belongs in the Spec?*
>>>>>          - Related to the ownership question above
>>>>>          - Light spec: Just define that the index table should be
>>>>>          optimized for retrieval by key columns and the index columns 
>>>>> should be
>>>>>          contained in the table. This could give us more flexibility if 
>>>>> better
>>>>>          organization methods come up, or
>>>>>          - Detailed spec: We could define the max number of files per
>>>>>          index to read for a single key, or even the partitioning and the 
>>>>> exact sort
>>>>>          order. This could allow more use-cases for a given index, like 
>>>>> joins or
>>>>>          cardinality estimations.
>>>>>          - I would go for light spec for the main types (PK,
>>>>>          Containing) and only the Index Maintenance processes should 
>>>>> update the
>>>>>          Indexes, as for many use-cases the details are not important, 
>>>>> and writers
>>>>>          will very rarely update the Indexes themselves.
>>>>>       - *Logical Placement of Indexes*
>>>>>          - Index as a child object of an Iceberg Table, or
>>>>>          - Index as a first‑class entity under
>>>>>          /namespace/indexes/{index}
>>>>>          - Based on the discussions on the summit we are leaning in
>>>>>          this direction. This means the index id should be unique in the 
>>>>> namespace
>>>>>          but helps the catalog implementations quite a bit
>>>>>       - *Physical Placement of Index Data*
>>>>>          - I don’t think we should specify this. We should have a
>>>>>          base location for the index, but can rely on the catalog 
>>>>> implementations to
>>>>>          decide on their own, like they do with the tables, views, udfs.
>>>>>       - *Iceberg Reader Based indexes* (Containing indexes and
>>>>>       potentially PK indexes). These are the indexes which could be read 
>>>>> by the
>>>>>       existing Iceberg readers. We might decide to store the PK index 
>>>>> similarly
>>>>>       to an Iceberg Table and treat it as a reader based index.
>>>>>          - What are the table properties/features exposed to the
>>>>>          readers
>>>>>             - Maybe just some behavioral descriptors for the
>>>>>             optimizer to decide if the index could be used or should be 
>>>>> skipped, like:
>>>>>                - Expected max files for a single key
>>>>>                - max files for a single key
>>>>>                - Deletes allowed/present
>>>>>                - Sorted by
>>>>>                - Partitioned by
>>>>>             - The Tasks when reading the index based on the filters
>>>>>             and projection
>>>>>          - What are the table properties/features exposed to the
>>>>>          Index Maintenance. I think this could be internal to the Index 
>>>>> Maintenance
>>>>>          process and might not be exposed through the spec. The Index 
>>>>> Maintenance
>>>>>          process could handle this as a standard Iceberg Table and could 
>>>>> be based on
>>>>>          the Table Maintenance process, but there might be some totally 
>>>>> different
>>>>>          processes.
>>>>>       - It should be possible to add properties to an index defined
>>>>>       by the Index Maintenance process which could be used and updated in 
>>>>> the
>>>>>       next Index Maintenance run.
>>>>>    - *PK index storage format benchmark results*
>>>>>       - Flat Parquet (baseline)
>>>>>       - BTree with Parquet leaves
>>>>>       - Vortex
>>>>>    - *Open items / next steps*
>>>>>
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>> huaxin gao <[email protected]> ezt írta (időpont: 2026. márc.
>>>>> 23., H, 3:03):
>>>>>
>>>>>> Hi everyone, I wanted to share an update on the primary key index
>>>>>> work.
>>>>>> Since there are still open questions on whether bloom filter indexes
>>>>>> fit in the secondary index framework or should be treated as extended
>>>>>> stats, I've shifted focus to the primary key index since it's a clearer 
>>>>>> fit
>>>>>> for the framework.
>>>>>> I've put together a proposal for a primary key reverse-lookup index
>>>>>> that maps each key to its physical location (file_path, row_position). It
>>>>>> enables:
>>>>>>
>>>>>>    - Scan-time file pruning for point lookups
>>>>>>    - Converting key-based deletes into position deletes (eliminating
>>>>>>    equality deletes for Flink CDC)
>>>>>>    - Accelerating Spark MERGE INTO by replacing full-table joins
>>>>>>    with direct file lookups
>>>>>>
>>>>>> Proposal:
>>>>>> https://docs.google.com/document/d/1HuhCZ0n2FqDh8yqQb9oEj1CPM5yXpEsMPGZno2aSf8E/edit?tab=t.0#heading=h.tbevg4q0m9
>>>>>> Feedback welcome!
>>>>>> Thanks,
>>>>>> Huaxin
>>>>>>
>>>>>> On Wed, Mar 18, 2026 at 11:42 PM Péter Váry <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Key takeaways from the general index discussion at the May 16
>>>>>>> meeting.
>>>>>>> Thanks to everyone who participated! The recording is available
>>>>>>> here: https://www.youtube.com/watch?v=btmjhtRWUCE
>>>>>>>
>>>>>>>    - Q: Do we need to tie index types to the algorithms used to
>>>>>>>    access them?
>>>>>>>    - A: From a specification perspective, the goal is to define the
>>>>>>>    storage-level data layout so it can be shared across engines. 
>>>>>>> Engines are
>>>>>>>    free to interpret and use the data as they see fit, but the on-disk 
>>>>>>> data
>>>>>>>    layout itself must be strictly defined and interoperable.
>>>>>>>
>>>>>>>    - Q: Should we introduce an additional abstraction layer (e.g.,
>>>>>>>    Vector Index) with sub-types such as IVF and DiskANN?
>>>>>>>    - A: This is possible if we decide it is beneficial. I explored
>>>>>>>    potential naming, but it is not yet clear how such a layer would be 
>>>>>>> used in
>>>>>>>    practice.
>>>>>>>    *Question to Yingyi Bu*: could you provide examples where this
>>>>>>>    additional layer would be useful? Should this abstraction be defined 
>>>>>>> at the
>>>>>>>    spec level, or is it better handled at the engine level?
>>>>>>>    My initial idea was that users would create a generic Vector
>>>>>>>    Index and let the engine choose the concrete implementation. 
>>>>>>> However, this
>>>>>>>    would limit user control and users likely need to specify the exact 
>>>>>>> index
>>>>>>>    representation, which implies they must be aware of the available
>>>>>>>    representations.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - Q: Do we want to allow extensibility for index types?
>>>>>>>    - A: Yes. The intent is to support a small set of well-defined
>>>>>>>    index types while allowing experimentation with new ones. If a new 
>>>>>>> index
>>>>>>>    type proves broadly useful, a follow-up proposal can standardize it 
>>>>>>> and
>>>>>>>    incorporate it into the spec.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - Q: Do we allow multiple versions of an index for the same
>>>>>>>    table snapshot?
>>>>>>>    - A: Yes. Older index versions must be retained for readers that
>>>>>>>    have already started using them, while new readers should 
>>>>>>> automatically use
>>>>>>>    the latest available version
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - Q: Do we need to use materialized views for these indexes?
>>>>>>>    - A: No. These indexes are primarily examples, and different
>>>>>>>    types may require different storage methods. However, the Primary 
>>>>>>> Key,
>>>>>>>    Containing, and parts of the IVF indexes can be structured as Iceberg
>>>>>>>    tables. This allows engines to read them natively; in some cases, 
>>>>>>> Iceberg
>>>>>>>    planners can automatically redirect queries to the index table 
>>>>>>> without
>>>>>>>    engine modifications. Furthermore, index maintenance for these 
>>>>>>> tables can
>>>>>>>    leverage existing materialized view maintenance workflows. Other 
>>>>>>> index
>>>>>>>    types may instead rely on Puffin files or alternative storage 
>>>>>>> approaches.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - Q: How should index metadata be accessed? Should we add
>>>>>>>    explicit pointers for the indexes in the table metadata?
>>>>>>>    - A: We did not have sufficient time to fully explore and
>>>>>>>    conclude this topic.
>>>>>>>    *Question for Yufei Gu*: Did I understand correctly that your
>>>>>>>    main concern stems from endpoint resolution from a REST Catalog
>>>>>>>    perspective? Specifically, if indexes are exposed under a URI such as
>>>>>>>    v1/{prefix}/namespaces/{namespace}/tables/{table}/indexes/{index}, 
>>>>>>> would
>>>>>>>    this make it more difficult for the REST Catalog to resolve and route
>>>>>>>    requests to the appropriate endpoint?
>>>>>>>
>>>>>>>
>>>>>>> Suhas Jayaram Subramanya via dev <[email protected]> ezt írta
>>>>>>> (időpont: 2026. márc. 13., P, 23:32):
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> Here's a proposal for native Vector Index support in Iceberg tables
>>>>>>>> --
>>>>>>>> https://docs.google.com/document/d/1KL4qLOwdqnhOcqTc0EjO1O16NV3M3c-gZCEINDWw4lA/edit?usp=sharing
>>>>>>>>
>>>>>>>> We've been working on this proposal with Peter internally at
>>>>>>>> Microsoft and he suggested we post it here to bring this to the 
>>>>>>>> community's
>>>>>>>> attention, ahead of the next Secondary Index Sync.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Suhas
>>>>>>>>
>>>>>>>> On 2026/02/19 04:34:34 huaxin gao wrote:
>>>>>>>> > Hi Everyone,
>>>>>>>> >
>>>>>>>> > Here are the recording and notes from the Iceberg Index Support
>>>>>>>> Sync on
>>>>>>>> > 2/11.
>>>>>>>> >
>>>>>>>> > Recording: https://www.youtube.com/watch?v=3sFfQ0A50yk
>>>>>>>> >
>>>>>>>> > Notes:
>>>>>>>> >
>>>>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.8041k7j2n7y3
>>>>>>>> >
>>>>>>>> > The meeting will move to biweekly, Mondays 9–10am PST, starting
>>>>>>>> March 2.
>>>>>>>> >
>>>>>>>> > Since the sync, I updated the Bloom skipping index proposal
>>>>>>>> > <
>>>>>>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.5r5kl6k3fqwu
>>>>>>>> >
>>>>>>>> > to address the discussion questions, specifically:
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > - Performance justification: when this helps (high-cardinality =
>>>>>>>> / IN,
>>>>>>>> > many data files, high object-store latency) and how it differs
>>>>>>>> from Parquet
>>>>>>>> > row-group Bloom filters (which still require opening the data
>>>>>>>> file).
>>>>>>>> > - Cost / scalability: rough sizing (Bloom blob size per file,
>>>>>>>> Puffin
>>>>>>>> > file size), the planning cost trade-off (driver index reads vs
>>>>>>>> executor
>>>>>>>> > file opens), and mitigations via caching.
>>>>>>>> > - Lifecycle / maintenance: incremental production as new data
>>>>>>>> files
>>>>>>>> > arrive, behavior when the index is missing/behind, and
>>>>>>>> sharding/compaction
>>>>>>>> > plus cleanup to avoid accumulating too many small Puffin files
>>>>>>>> over time.
>>>>>>>> > - Writer expectations: inline (optional) vs asynchronous
>>>>>>>> (primary) index
>>>>>>>> > creation.
>>>>>>>> >
>>>>>>>> > I also implemented a Spark 4.1 POC
>>>>>>>> > <https://github.com/apache/iceberg/pull/15311> and a local
>>>>>>>> benchmark to
>>>>>>>> > quantify both the pruning impact (plannedFiles → afterBloom) and
>>>>>>>> the index
>>>>>>>> > read overhead (statsFiles, statsBytes, bloomPayloadBytes) for
>>>>>>>> point
>>>>>>>> > predicates on high-cardinality columns. Please take a look and
>>>>>>>> let me know
>>>>>>>> > if you have any questions or feedback.
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> >
>>>>>>>> > Huaxin
>>>>>>>> >
>>>>>>>> > On Tue, Feb 10, 2026 at 1:43 PM huaxin gao <[email protected]>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > > Reminder for tomorrow's sync on Iceberg Index Support.
>>>>>>>> > >
>>>>>>>> > > Wednesday: Feb. 11 9:00 – 10:00am
>>>>>>>> > > Time zone: America/Los_Angeles
>>>>>>>> > > Google Meet joining info
>>>>>>>> > > Video call link: meet.google.com/nsp-ctyr-khk
>>>>>>>> > > Design doc:
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7
>>>>>>>> > >
>>>>>>>> > > Thanks,
>>>>>>>> > > Huaxin
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > > On Tue, Feb 3, 2026 at 10:52 PM Péter Váry <[email protected]>
>>>>>>>> > > wrote:
>>>>>>>> > >
>>>>>>>> > >> Thanks Huaxin and Steven for organizing this. Looking forward
>>>>>>>> to meet you
>>>>>>>> > >> all next week!
>>>>>>>> > >>
>>>>>>>> > >> On Wed, Feb 4, 2026, 02:48 Steven Wu <[email protected]> wrote:
>>>>>>>> > >>
>>>>>>>> > >>> We set up the dev calendar event with a new google meet link.
>>>>>>>> Please
>>>>>>>> > >>> ignore the link from Huaxin's original email.
>>>>>>>> > >>>
>>>>>>>> > >>> The dev calendar has the correct info (including the new
>>>>>>>> meeting link)
>>>>>>>> > >>>
>>>>>>>> > >>> Iceberg Index Support Sync
>>>>>>>> > >>> Wednesday, February 11 · 9:00 – 10:00am
>>>>>>>> > >>> Time zone: America/Los_Angeles
>>>>>>>> > >>> Google Meet joining info
>>>>>>>> > >>> Video call link: https://meet.google.com/nsp-ctyr-khk
>>>>>>>> > >>>
>>>>>>>> > >>> On Tue, Feb 3, 2026 at 5:08 PM huaxin gao <[email protected]>
>>>>>>>> > >>> wrote:
>>>>>>>> > >>>
>>>>>>>> > >>>> Sorry, I meant PST (not EST) :)
>>>>>>>> > >>>> Looking forward to the discussion!
>>>>>>>> > >>>>
>>>>>>>> > >>>> On Tue, Feb 3, 2026 at 4:58 PM Shawn Chang <[email protected]>
>>>>>>>> > >>>> wrote:
>>>>>>>> > >>>>
>>>>>>>> > >>>>> Hi Huaxin,
>>>>>>>> > >>>>>
>>>>>>>> > >>>>> Thanks for starting the sync!
>>>>>>>> > >>>>>
>>>>>>>> > >>>>> The meeting seems to be 9-10AM PST on the dev events
>>>>>>>> calendar
>>>>>>>> > >>>>> <
>>>>>>>> https://calendar.google.com/calendar/u/0?cid=MzkwNWQ0OTJmMWI0NTBiYTA3MTJmMmFlNmFmYTc2ZWI3NTdmMTNkODUyMjBjYzAzYWE0NTI3ODg1YWRjNTYyOUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t
>>>>>>>> >,
>>>>>>>> > >>>>> not EST. Maybe it's a typo?
>>>>>>>> > >>>>> Otherwise, looking forward to the discussion!
>>>>>>>> > >>>>>
>>>>>>>> > >>>>> Best,
>>>>>>>> > >>>>> Shawn
>>>>>>>> > >>>>>
>>>>>>>> > >>>>> On Tue, Feb 3, 2026 at 9:18 AM huaxin gao <[email protected]>
>>>>>>>> > >>>>> wrote:
>>>>>>>> > >>>>>
>>>>>>>> > >>>>>> Hi all,
>>>>>>>> > >>>>>> I'd like to start a dedicated sync to discuss Iceberg
>>>>>>>> Index support.
>>>>>>>> > >>>>>> Here is the existing discussion thread:
>>>>>>>> > >>>>>>
>>>>>>>> https://lists.apache.org/thread/fzqk3jjf0xpj5m4cfqb3v4c65p0t04ty.
>>>>>>>> > >>>>>>
>>>>>>>> > >>>>>> To ground the discussion, here are the two proposals:
>>>>>>>> > >>>>>>
>>>>>>>> > >>>>>> - Peter's proposal
>>>>>>>> > >>>>>> <
>>>>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2>
>>>>>>>> (overall
>>>>>>>> > >>>>>> index support)
>>>>>>>> > >>>>>> - My proposal
>>>>>>>> > >>>>>> <
>>>>>>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7
>>>>>>>> >
>>>>>>>> > >>>>>> (bloom filter skipping index)
>>>>>>>> > >>>>>>
>>>>>>>> > >>>>>> Time slot: Every 3 weeks, Wednesdays at 9 AM to 10 AM EST,
>>>>>>>> starting
>>>>>>>> > >>>>>> next Wednesday (2/11). After FileFormat sync finishes, we
>>>>>>>> plan to use that
>>>>>>>> > >>>>>> slot and switch to every other Monday, 9 AM to 10 AM EST.
>>>>>>>> > >>>>>>
>>>>>>>> > >>>>>> Meet link: https://meet.google.com/fjn-tyze-mko
>>>>>>>> > >>>>>>
>>>>>>>> > >>>>>> Thanks,
>>>>>>>> > >>>>>> Huaxin
>>>>>>>> > >>>>>>
>>>>>>>> > >>>>>
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>

Re: Re: Dedicated sync for Iceberg Index Support

Reply via email to