Dan kindly set up a dedicated public Slack channel (*#indexes)* for the Secondary Index discussion. You can find it here: https://apache-iceberg.slack.com/archives/C0AFDSU3EUU Feel free to join if you’d like to participate in the discussion or simply follow along.
Thanks, Peter Péter Váry <[email protected]> ezt írta (időpont: 2026. febr. 24., K, 12:52): > We had an extended discussion on Slack with Dan, Steven, and Yufei about > where index metadata should live. In particular, whether it should be > stored directly in the table metadata or maintained in a dedicated index > catalog. I tried to capture this discussion in the Layout > <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.4oz3yd6ngr3> > section > of the document. > > Once the decision is made, this section can be shortened, but for now it > is intentionally more detailed so that everyone can see the arguments that > were discussed and so that those who could not participate synchronously > can still follow and provide feedback offline. > > In short, we are currently *leaning toward storing index metadata in its > own catalog*, while allowing REST catalogs to expose a composite endpoint > that returns both table and index metadata in a single round trip. This is > similar in spirit to the universal load endpoint discussed in the context > of materialized view loading. > > Thanks, > Peter > > Péter Váry <[email protected]> ezt írta (időpont: 2026. febr. > 19., Cs, 14:06): > >> Thanks Huaxin for posting the recording and the meeting notes. >> >> I used this time to also address the questions collected during the sync: >> >> - Collected some representative use cases. See the example use-cases >> >> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.i4gt8za99j9d> >> paragraph. >> Anyone should feel free to suggest their own. >> - Collected my thoughts about the writer requirements. See the writer >> requirements >> >> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.4b1p8r8nmfg1> >> paragraph. >> - Centralized the index maintenance related parts. See the index >> maintenance >> >> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.0#heading=h.hw2nt44i0k8q> >> paragraph. >> >> Might be a bit premature but created a PR >> <https://github.com/apache/iceberg/pull/15101> with the proposed index >> catalog related changes, so the ones who are more code oriented could take >> a look at it too. >> >> huaxin gao <[email protected]> ezt írta (időpont: 2026. febr. 19., >> Cs, 5:34): >> >>> Hi Everyone, >>> >>> Here are the recording and notes from the Iceberg Index Support Sync on >>> 2/11. >>> >>> Recording: https://www.youtube.com/watch?v=3sFfQ0A50yk >>> >>> Notes: >>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.8041k7j2n7y3 >>> >>> The meeting will move to biweekly, Mondays 9–10am PST, starting March 2. >>> >>> Since the sync, I updated the Bloom skipping index proposal >>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.5r5kl6k3fqwu> >>> to address the discussion questions, specifically: >>> >>> >>> - Performance justification: when this helps (high-cardinality = / >>> IN, many data files, high object-store latency) and how it differs from >>> Parquet row-group Bloom filters (which still require opening the data >>> file). >>> - Cost / scalability: rough sizing (Bloom blob size per file, Puffin >>> file size), the planning cost trade-off (driver index reads vs executor >>> file opens), and mitigations via caching. >>> - Lifecycle / maintenance: incremental production as new data files >>> arrive, behavior when the index is missing/behind, and >>> sharding/compaction >>> plus cleanup to avoid accumulating too many small Puffin files over time. >>> - Writer expectations: inline (optional) vs asynchronous (primary) >>> index creation. >>> >>> I also implemented a Spark 4.1 POC >>> <https://github.com/apache/iceberg/pull/15311> and a local benchmark to >>> quantify both the pruning impact (plannedFiles → afterBloom) and the index >>> read overhead (statsFiles, statsBytes, bloomPayloadBytes) for point >>> predicates on high-cardinality columns. Please take a look and let me know >>> if you have any questions or feedback. >>> >>> Thanks, >>> >>> Huaxin >>> >>> On Tue, Feb 10, 2026 at 1:43 PM huaxin gao <[email protected]> >>> wrote: >>> >>>> Reminder for tomorrow's sync on Iceberg Index Support. >>>> >>>> Wednesday: Feb. 11 9:00 – 10:00am >>>> Time zone: America/Los_Angeles >>>> Google Meet joining info >>>> Video call link: meet.google.com/nsp-ctyr-khk >>>> Design doc: >>>> >>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2 >>>> >>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7 >>>> >>>> Thanks, >>>> Huaxin >>>> >>>> >>>> On Tue, Feb 3, 2026 at 10:52 PM Péter Váry <[email protected]> >>>> wrote: >>>> >>>>> Thanks Huaxin and Steven for organizing this. Looking forward to meet >>>>> you all next week! >>>>> >>>>> On Wed, Feb 4, 2026, 02:48 Steven Wu <[email protected]> wrote: >>>>> >>>>>> We set up the dev calendar event with a new google meet link. Please >>>>>> ignore the link from Huaxin's original email. >>>>>> >>>>>> The dev calendar has the correct info (including the new meeting >>>>>> link) >>>>>> >>>>>> Iceberg Index Support Sync >>>>>> Wednesday, February 11 · 9:00 – 10:00am >>>>>> Time zone: America/Los_Angeles >>>>>> Google Meet joining info >>>>>> Video call link: https://meet.google.com/nsp-ctyr-khk >>>>>> >>>>>> On Tue, Feb 3, 2026 at 5:08 PM huaxin gao <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Sorry, I meant PST (not EST) :) >>>>>>> Looking forward to the discussion! >>>>>>> >>>>>>> On Tue, Feb 3, 2026 at 4:58 PM Shawn Chang <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Huaxin, >>>>>>>> >>>>>>>> Thanks for starting the sync! >>>>>>>> >>>>>>>> The meeting seems to be 9-10AM PST on the dev events calendar >>>>>>>> <https://calendar.google.com/calendar/u/0?cid=MzkwNWQ0OTJmMWI0NTBiYTA3MTJmMmFlNmFmYTc2ZWI3NTdmMTNkODUyMjBjYzAzYWE0NTI3ODg1YWRjNTYyOUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t>, >>>>>>>> not EST. Maybe it's a typo? >>>>>>>> Otherwise, looking forward to the discussion! >>>>>>>> >>>>>>>> Best, >>>>>>>> Shawn >>>>>>>> >>>>>>>> On Tue, Feb 3, 2026 at 9:18 AM huaxin gao <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> I'd like to start a dedicated sync to discuss Iceberg Index >>>>>>>>> support. Here is the existing discussion thread: >>>>>>>>> https://lists.apache.org/thread/fzqk3jjf0xpj5m4cfqb3v4c65p0t04ty. >>>>>>>>> >>>>>>>>> To ground the discussion, here are the two proposals: >>>>>>>>> >>>>>>>>> - Peter's proposal >>>>>>>>> >>>>>>>>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2> >>>>>>>>> (overall >>>>>>>>> index support) >>>>>>>>> - My proposal >>>>>>>>> >>>>>>>>> <https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7> >>>>>>>>> (bloom filter skipping index) >>>>>>>>> >>>>>>>>> Time slot: Every 3 weeks, Wednesdays at 9 AM to 10 AM EST, >>>>>>>>> starting next Wednesday (2/11). After FileFormat sync finishes, we >>>>>>>>> plan to >>>>>>>>> use that slot and switch to every other Monday, 9 AM to 10 AM EST. >>>>>>>>> >>>>>>>>> Meet link: https://meet.google.com/fjn-tyze-mko >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Huaxin >>>>>>>>> >>>>>>>>
