Re: [DISCUSS] v4 - One file commits

Péter Váry Wed, 04 Feb 2026 01:00:47 -0800

I fully agree with Anton and Steven that we need benchmarks before choosing
any direction.


I ran some preliminary column‑stitching benchmarks last summer:

   - Results are available in the doc:
   
https://docs.google.com/document/d/1OHuZ6RyzZvCOQ6UQoV84GzwVp3UPiu_cfXClsOi03ww
   - Code is here: https://github.com/apache/iceberg/pull/13306

I’ve summarized the most relevant results at the end of this email. They
show roughly a 10% slowdown on the read path with column stitching in
similar scenarios when using local SSDs. I expect that in real deployments
the metadata read cost will mostly be driven by blob I/O (assuming no
caching). If blob access becomes the dominant factor in read latency,
multithreaded fetching should be able to absorb the overhead introduced by
column stitching, resulting in latency similar to the single‑file layout
(unless IO is already the bottleneck)

We should definitely rerun the benchmarks once we have a clearer
understanding of the intended usage patterns.
Thanks,
Peter


The relevant(ish) results are for 100 columns, with 2 families with 50-50
columns and local read:

The base is:
MultiThreadedParquetBenchmark.read        100           0            false
   ss   20   3.739 ±  0.096   s/op

The read for single threaded:
MultiThreadedParquetBenchmark.read        100           2            false
   ss   20   4.036 ±  0.082   s/op

The read for multi threaded:
MultiThreadedParquetBenchmark.read        100           2             true
   ss   20   4.063 ±  0.080   s/op

Steven Wu <[email protected]> ezt írta (időpont: 2026. febr. 3., K,
23:27):

>
> I agree with Anton in this
> <https://docs.google.com/document/d/1jZy4g6UDi3hdblpkSzDnqgzgATFKFoMaHmt4nNH8M7o/edit?disco=AAAByzDx21w>
> comment thread that we probably need to run benchmarks for a few common
> scenarios to guide this decision. We need to write down detailed plans for
> those scenarios and what are we measuring. Also ideally, we want to measure
> using the V4 metadata structure (like Parquet manifest file, column stats
> structs, adaptive tree). There are PoC PRs available for column stats,
> Parquet manifest, and root manifest. It would probably be tricky to piece
> them together to run the benchmark considering the PoC status. We also need
> the column stitching capability on the read path to test the column file
> approach.
>
> On Tue, Feb 3, 2026 at 1:53 PM Anoop Johnson <[email protected]> wrote:
>
>> I'm in favor of co-located DV metadata with column file override and not
>> doing affiliated/unaffiliated delete manifests. This is conceptually
>> similar to strictly affiliated delete manifests with positional joins, and
>> will halve the number of I/Os when there is no DV column override. It is
>> simpler to implement
>> and will speed up reads.
>>
>> Unaffiliated DV manifests are flexible for writers. They reduce the
>> chance of physical conflicts when there are concurrent large/random deletes
>> that change DVs on different files in the same manifest. But the
>> flexibility comes at a read-time cost. If the number of unaffiliated DVs
>> exceeds a threshold, it could cause driver OOMs or require distributed join
>> to pair up DVs with data files. With colocated metadata, manifest DVs can
>> reduce the chance of conflicts up to a certain write size.
>>
>> I assume we will still support unaffiliated manifests for equality
>> deletes, but perhaps we can restrict it to just equality deletes.
>>
>> -Anoop
>>
>>
>> On Mon, Feb 2, 2026 at 4:27 PM Anton Okolnychyi <[email protected]>
>> wrote:
>>
>>> I added the approach with column files to the doc.
>>>
>>> To sum up, separate data and delete manifests with affinity
>>> would perform somewhat on par with co-located DV metadata (a.k.a. direct
>>> assignment) if we add support for column files when we need to replace most
>>> or all DVs (use case 1). That said, the support for direct assignment with
>>> in-line metadata DVs can help us avoid unaffiliated delete manifests when
>>> we need to replace a few DVs (use case 2).
>>>
>>> So the key question is whether we want to allow unaffiliated delete
>>> manifests with DVs... If we don't, then we would likely want to have
>>> co-located DV metadata and must support efficient column updates not to
>>> regress compared to V2 and V3 for large MERGE jobs that modify a small set
>>> of records for most files.
>>>
>>> пн, 2 лют. 2026 р. о 13:20 Anton Okolnychyi <[email protected]>
>>> пише:
>>>
>>>> Anoop, correct, if we keep data and delete manifests separate, there is
>>>> a better way to combine the entries and we should NOT rely on the
>>>> referenced data file path. Reconciling by implicit position will reduce the
>>>> size of the DV entry (no need to store the referenced data file path) and
>>>> will improve the planning performance (no equals/hashCode on the path).
>>>>
>>>> Steven, I agree. Most notes in the doc pre-date discussions we had on
>>>> column updates. You are right, given that we are gravitating towards a
>>>> native way to handle column updates, it seems logical to use the same
>>>> approach for replacing DVs, since they’re essentially column updates. Let
>>>> me add one more approach to the doc based on what Anurag and Peter have so
>>>> far.
>>>>
>>>> нд, 1 лют. 2026 р. о 20:59 Steven Wu <[email protected]> пише:
>>>>
>>>>> Anton, thanks for raising this. I agree this deserves another look. I
>>>>> added a comment in your doc that we can potentially apply the column 
>>>>> update
>>>>> proposal for data file update to the manifest file updates as well, to
>>>>> colocate the data DV and data manifest files. Data DVs can be a
>>>>> separate column in the data manifest file and updated separately in a
>>>>> column file. This is the same as the coalesced positional join that Anoop
>>>>> mentioned.
>>>>>
>>>>> On Sun, Feb 1, 2026 at 4:14 PM Anoop Johnson <[email protected]> wrote:
>>>>>
>>>>>> Thank you for raising this, Anton. I had a similar observation while
>>>>>> prototyping <https://github.com/apache/iceberg/pull/14533> the
>>>>>> adaptive metadata tree. The overhead of doing a path-based hash join of a
>>>>>> data manifest with the affiliated delete manifest is high: my estimate 
>>>>>> was
>>>>>> that the join adds about 5-10% overhead. The hash table build/probe alone
>>>>>> takes about 5 ms for manifests with 25K entries. There are engines that 
>>>>>> can
>>>>>> do vectorized hash joins that can lower this, but the overhead and
>>>>>> complexity of a SIMD-friendly hash join is non-trivial.
>>>>>>
>>>>>> An alternative to relying on the external file feature in Parquet, is
>>>>>> to make affiliated manifests order-preserving: ie DVs in an affiliated
>>>>>> delete manifest must appear in the same position as the corresponding 
>>>>>> data
>>>>>> file in the data manifest the delete manifest is affiliated to.  If a 
>>>>>> data
>>>>>> file does not have a DV, the DV manifest must store a NULL. This would
>>>>>> allow us to do positional joins, which are much faster. If we wanted, we
>>>>>> could even have multiple affiliated DV manifests for a data manifest and
>>>>>> the reader would do a COALESCED positional join (i.e. pick the first
>>>>>> non-null value as the DV). It puts the sorting responsibility to the
>>>>>> writers, but it might be a reasonable tradeoff.
>>>>>>
>>>>>> Also, the options don't necessarily have to be mutually exclusive. We
>>>>>> could still allow affiliated DVs to be "folded" into data manifest (e.g. 
>>>>>> by
>>>>>> background optimization jobs or the writer itself). That might be the
>>>>>> optimal choice for read-heavy tables because it will halve the number of
>>>>>> I/Os readers have to make.
>>>>>>
>>>>>> Best,
>>>>>> Anoop
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 30, 2026 at 6:03 PM Anton Okolnychyi <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I had a chance to catch up on some of the V4 discussions. Given that
>>>>>>> we are getting rid of the manifest list and switching to Parquet, I 
>>>>>>> wanted
>>>>>>> to re-evaluate the possibility of direct DV assignment that we 
>>>>>>> discarded in
>>>>>>> V3 to avoid regressions. I have put together my thoughts in a doc [1].
>>>>>>>
>>>>>>> TL;DR:
>>>>>>>
>>>>>>> - I think the current V4 proposal that keeps data and delete
>>>>>>> manifests separate but introduces affinity is a solid choice for cases 
>>>>>>> when
>>>>>>> we need to replace DVs in many / most files. I outlined an approach with
>>>>>>> column-split Parquet files but it doesn't improve the performance and 
>>>>>>> takes
>>>>>>> dependency on a portion of the Parquet spec that is not really 
>>>>>>> implemented.
>>>>>>> - Pushing unaffiliated DVs directly into the root to replace a small
>>>>>>> set of DVs is going to be fast on write but does require resolving where
>>>>>>> those DVs apply at read time. Using inline metadata DVs with 
>>>>>>> column-split
>>>>>>> Parquet files is a little more promising in this case as it allows to 
>>>>>>> avoid
>>>>>>> unaffiliated DVs. That said, it again relies on something Parquet 
>>>>>>> doesn't
>>>>>>> implement right now, requires changing maintenance operations, and 
>>>>>>> yields
>>>>>>> minimal benefits.
>>>>>>>
>>>>>>> All in all, the V4 proposal seems like a strict improvement over V3
>>>>>>> but I insist that we reconsider usage of the referenced data file path 
>>>>>>> when
>>>>>>> resolving DVs to data files.
>>>>>>>
>>>>>>> [1] -
>>>>>>> https://docs.google.com/document/d/1jZy4g6UDi3hdblpkSzDnqgzgATFKFoMaHmt4nNH8M7o
>>>>>>>
>>>>>>> - Anton
>>>>>>>
>>>>>>> сб, 22 лист. 2025 р. о 13:37 Amogh Jahagirdar <[email protected]>
>>>>>>> пише:
>>>>>>>
>>>>>>>> Hey all,
>>>>>>>>
>>>>>>>> Here is the meeting recording
>>>>>>>> <https://drive.google.com/file/d/1lG9sM-JTwqcIgk7JsAryXXCc1vMnstJs/view?usp=sharing>
>>>>>>>>  and generated meeting summary
>>>>>>>> <https://docs.google.com/document/d/1e50p8TXL2e3CnUwKMOvm8F4s2PeVMiKWHPxhxOW1fIM/edit?usp=sharing>.
>>>>>>>> Thanks all for attending yesterday!
>>>>>>>>
>>>>>>>> On Thu, Nov 20, 2025 at 8:49 AM Amogh Jahagirdar <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey folks,
>>>>>>>>>
>>>>>>>>> I was out for some time, but set up a sync for tomorrow at 9am
>>>>>>>>> PST. For this discussion, I do think it would be great to focus on the
>>>>>>>>> manifest DV representation, factoring in analyses on bitmap 
>>>>>>>>> representation
>>>>>>>>> storage footprints, and the entry structure considering how we want to
>>>>>>>>> approach change detection. If there are other topics that people want 
>>>>>>>>> to
>>>>>>>>> highlight, please do bring those up as well!
>>>>>>>>>
>>>>>>>>> I also recognize that this is a bit short term scheduling, so
>>>>>>>>> please do reach out to me if this time is difficult to work with; 
>>>>>>>>> next week
>>>>>>>>> is the Thanksgiving holidays here, and since people would be 
>>>>>>>>> travelling/out
>>>>>>>>> I figured I'd try to schedule before then.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Oct 17, 2025 at 9:03 AM Amogh Jahagirdar <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey folks,
>>>>>>>>>>
>>>>>>>>>> Sorry for the delay, here's the recording link
>>>>>>>>>> <https://drive.google.com/file/d/1YOmPROXjAKYAWAcYxqAFHdADbqELVVf2/view>
>>>>>>>>>>   from
>>>>>>>>>> last week's discussion.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 10, 2025 at 9:44 AM Péter Váry <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Same here.
>>>>>>>>>>> Please record if you can.
>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 10, 2025, 17:39 Fokko Driesprong <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Amogh,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the write-up. Unfortunately, I won’t be able to
>>>>>>>>>>>> attend. Will it be recorded? Thanks!
>>>>>>>>>>>>
>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>> Fokko
>>>>>>>>>>>>
>>>>>>>>>>>> Op di 7 okt 2025 om 20:36 schreef Amogh Jahagirdar <
>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've setup time this Friday at 9am PST for another sync on
>>>>>>>>>>>>> single file commits. In terms of what would be great to focus on 
>>>>>>>>>>>>> for the
>>>>>>>>>>>>> discussion:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Whether it makes sense or not to eliminate the tuple, and
>>>>>>>>>>>>> instead representing the tuple via lower/upper boundaries. As a 
>>>>>>>>>>>>> reminder,
>>>>>>>>>>>>> one of the goals is to avoid tying a partition spec to a 
>>>>>>>>>>>>> manifest; in the
>>>>>>>>>>>>> root we can have a mix of files spanning different partition 
>>>>>>>>>>>>> specs, and
>>>>>>>>>>>>> even in leaf manifests avoiding this coupling can enable more
>>>>>>>>>>>>> desirable clustering of metadata.
>>>>>>>>>>>>> In the vast majority of cases, we could leverage the property
>>>>>>>>>>>>> that a file is effectively partitioned if the lower/upper for a 
>>>>>>>>>>>>> given field
>>>>>>>>>>>>> is equal. The nuance here is with the particular case of
>>>>>>>>>>>>> identity partitioned string/binary columns which can be truncated 
>>>>>>>>>>>>> in stats.
>>>>>>>>>>>>> One approach is to require that writers must not produce 
>>>>>>>>>>>>> truncated stats
>>>>>>>>>>>>> for identity partitioned columns. It's also important to keep in 
>>>>>>>>>>>>> mind that
>>>>>>>>>>>>> all of this is just for the purpose of reconstructing the 
>>>>>>>>>>>>> partition tuple,
>>>>>>>>>>>>> which is only required during equality delete matching. Another 
>>>>>>>>>>>>> area we
>>>>>>>>>>>>> need to cover as part of this is on exact bounds on stats. There 
>>>>>>>>>>>>> are other
>>>>>>>>>>>>> options here as well such as making all new equality deletes in 
>>>>>>>>>>>>> V4 be
>>>>>>>>>>>>> global and instead match based on bounds, or keeping the tuple 
>>>>>>>>>>>>> but each
>>>>>>>>>>>>> tuple is effectively based off a union schema of all partition 
>>>>>>>>>>>>> specs. I am
>>>>>>>>>>>>> adding a separate appendix section outlining the span of options 
>>>>>>>>>>>>> here and
>>>>>>>>>>>>> the different tradeoffs.
>>>>>>>>>>>>> Once we get this more to a conclusive state, I'll move a
>>>>>>>>>>>>> summarized version to the main doc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. @[email protected] <[email protected]> has updated
>>>>>>>>>>>>> the doc with a section
>>>>>>>>>>>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.rrpksmp8zkb#heading=h.qau0y5xkh9mn>
>>>>>>>>>>>>>  on
>>>>>>>>>>>>> how we can do change detection from the root in a variety of write
>>>>>>>>>>>>> scenarios. I've done a review on it, and it covers the cases I 
>>>>>>>>>>>>> would
>>>>>>>>>>>>> expect. It'd be good for folks to take a look and please give 
>>>>>>>>>>>>> feedback
>>>>>>>>>>>>> before we discuss. Thank you Steven for adding that section and 
>>>>>>>>>>>>> all the
>>>>>>>>>>>>> diagrams.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Sep 18, 2025 at 3:19 PM Amogh Jahagirdar <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey folks just following up from the discussion last Friday
>>>>>>>>>>>>>> with a summary and some next steps:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1.) For the various change detection cases, we concluded it's
>>>>>>>>>>>>>> best just to go through those in an offline manner on the doc 
>>>>>>>>>>>>>> since it's
>>>>>>>>>>>>>> hard to verify all that correctness in a large meeting setting.
>>>>>>>>>>>>>> 2.) We mostly discussed eliminating the partition tuple. On
>>>>>>>>>>>>>> the original proposal, I was mostly aiming for the ability to
>>>>>>>>>>>>>> re-constructing the tuple from the stats for the purpose of 
>>>>>>>>>>>>>> equality delete
>>>>>>>>>>>>>> matching (a file is partitioned if the lower and upper bounds 
>>>>>>>>>>>>>> are equal);
>>>>>>>>>>>>>> There's some nuance in how we need to handle identity partition 
>>>>>>>>>>>>>> values
>>>>>>>>>>>>>> since for string/binary they cannot be truncated. Another 
>>>>>>>>>>>>>> potential option
>>>>>>>>>>>>>> is to treat all equality deletes as effectively global and 
>>>>>>>>>>>>>> narrow their
>>>>>>>>>>>>>> application based on the stats values. This may require defining 
>>>>>>>>>>>>>> tight
>>>>>>>>>>>>>> bounds. I'm still collecting my thoughts on this one.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks folks! Please also let me know if any of the following
>>>>>>>>>>>>>> links are inaccessible for any reason.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Meeting recording link:
>>>>>>>>>>>>>> https://drive.google.com/file/d/1gv8TrR5xzqqNxek7_sTZkpbwQx1M3dhK/view
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Meeting summary:
>>>>>>>>>>>>>> https://docs.google.com/document/d/131N0CDpzZczURxitN0HGS7dTqRxQT_YS9jMECkGGvQU
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Sep 8, 2025 at 3:40 PM Amogh Jahagirdar <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Update: I moved the discussion time to this Friday at 9 am
>>>>>>>>>>>>>>> PST since I found out that quite a few folks involved in the 
>>>>>>>>>>>>>>> proposals will
>>>>>>>>>>>>>>> be out next week, and I also know some folks will also be out 
>>>>>>>>>>>>>>> the week
>>>>>>>>>>>>>>> after that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Amogh J
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Sep 8, 2025 at 8:57 AM Amogh Jahagirdar <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey folks sorry for the late follow up here,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks @Kevin Liu <[email protected]> for sharing the
>>>>>>>>>>>>>>>> recording link of the previous discussion! I've set up another 
>>>>>>>>>>>>>>>> sync for
>>>>>>>>>>>>>>>> next Tuesday 09/16 at 9am PST. This time I've set it up from 
>>>>>>>>>>>>>>>> my corporate
>>>>>>>>>>>>>>>> email so we can get recordings and transcriptions (and I've 
>>>>>>>>>>>>>>>> made sure to
>>>>>>>>>>>>>>>> keep the meeting invite open so we don't have to manually let 
>>>>>>>>>>>>>>>> people in).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In terms of next steps of areas which I think would be good
>>>>>>>>>>>>>>>> to focus on for establishing consensus:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. How do we model the manifest entry structure so that
>>>>>>>>>>>>>>>> changes to manifest DVs can be obtained easily from the root? 
>>>>>>>>>>>>>>>> There are a
>>>>>>>>>>>>>>>> few options here; the most promising approach is to keep an 
>>>>>>>>>>>>>>>> additional DV
>>>>>>>>>>>>>>>> which encodes the diff in additional positions which have been 
>>>>>>>>>>>>>>>> removed from
>>>>>>>>>>>>>>>> a leaf manifest.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2. Modeling partition transforms via expressions and
>>>>>>>>>>>>>>>> establishing a unified table ID space so that we can simplify 
>>>>>>>>>>>>>>>> how partition
>>>>>>>>>>>>>>>> tuples may be represented via stats and also have a way in the 
>>>>>>>>>>>>>>>> future to
>>>>>>>>>>>>>>>> store stats on any derived column. I have a short proposal
>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1oV8dapKVzB4pZy5pKHUCj5j9i2_1p37BJSeT7hyKPpg/edit?tab=t.0>
>>>>>>>>>>>>>>>>  for
>>>>>>>>>>>>>>>> this that probably still needs some tightening up on the 
>>>>>>>>>>>>>>>> expression
>>>>>>>>>>>>>>>> modeling itself (and some prototyping) but the general idea for
>>>>>>>>>>>>>>>> establishing a unified table ID space is covered. All feedback 
>>>>>>>>>>>>>>>> welcome!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Aug 25, 2025 at 1:34 PM Kevin Liu <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks Amogh. Looks like the recording for last week's
>>>>>>>>>>>>>>>>> sync is available on Youtube. Here's the link,
>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=uWm-p--8oVQ
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Kevin Liu
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Aug 12, 2025 at 9:10 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Just following up on this to give the community as to
>>>>>>>>>>>>>>>>>> where we're at and my proposed next steps.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I've been editing and merging the contents from our
>>>>>>>>>>>>>>>>>> proposal into the proposal
>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw>
>>>>>>>>>>>>>>>>>>  from
>>>>>>>>>>>>>>>>>> Russell and others. For any future comments on docs, please 
>>>>>>>>>>>>>>>>>> comment on the
>>>>>>>>>>>>>>>>>> linked proposal. I've also marked it on our doc in red text 
>>>>>>>>>>>>>>>>>> so it's clear
>>>>>>>>>>>>>>>>>> to redirect to the other proposal as a source of truth for 
>>>>>>>>>>>>>>>>>> comments.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In terms of next steps,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1. An important design decision point is around inline
>>>>>>>>>>>>>>>>>> manifest DVs, external manifest DVs or enabling both. I'm 
>>>>>>>>>>>>>>>>>> working on
>>>>>>>>>>>>>>>>>> measuring different approaches for representing the 
>>>>>>>>>>>>>>>>>> compressed DV
>>>>>>>>>>>>>>>>>> representation since that will inform how many entries can 
>>>>>>>>>>>>>>>>>> reasonably fit
>>>>>>>>>>>>>>>>>> in a small root manifest; from that we can derive 
>>>>>>>>>>>>>>>>>> implications on different
>>>>>>>>>>>>>>>>>> write patterns and determine the right approach for storing 
>>>>>>>>>>>>>>>>>> these manifest
>>>>>>>>>>>>>>>>>> DVs.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2. Another key point is around determining if/how we can
>>>>>>>>>>>>>>>>>> reasonably enable V4 to represent changes in the root 
>>>>>>>>>>>>>>>>>> manifest so that
>>>>>>>>>>>>>>>>>> readers can effectively just infer file level changes from 
>>>>>>>>>>>>>>>>>> the root.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 3. One of the aspects of the proposal is getting away
>>>>>>>>>>>>>>>>>> from partition tuple requirement in the root which currently 
>>>>>>>>>>>>>>>>>> holds us to
>>>>>>>>>>>>>>>>>> have associativity between a partition spec and a manifest. 
>>>>>>>>>>>>>>>>>> These aspects
>>>>>>>>>>>>>>>>>> can be modeled as essentially column stats which gives a lot 
>>>>>>>>>>>>>>>>>> of flexibility
>>>>>>>>>>>>>>>>>> into the organization of the manifest. There are important 
>>>>>>>>>>>>>>>>>> details around
>>>>>>>>>>>>>>>>>> field ID spaces here which tie into how the stats are 
>>>>>>>>>>>>>>>>>> structured. What
>>>>>>>>>>>>>>>>>> we're proposing here is to have a unified expression ID 
>>>>>>>>>>>>>>>>>> space that could
>>>>>>>>>>>>>>>>>> also benefit us for storing things like virtual columns down 
>>>>>>>>>>>>>>>>>> the line. I go
>>>>>>>>>>>>>>>>>> into this in the proposal but I'm working on separating the 
>>>>>>>>>>>>>>>>>> appropriate
>>>>>>>>>>>>>>>>>> parts so that the original proposal can mostly just focus on 
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> organization of the content metadata tree and not how we 
>>>>>>>>>>>>>>>>>> want to solve this
>>>>>>>>>>>>>>>>>> particular ID space problem.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 4. I'm planning on scheduling a recurring community sync
>>>>>>>>>>>>>>>>>> starting next Tuesday at 9am PST, every 2 weeks. If I get 
>>>>>>>>>>>>>>>>>> feedback from
>>>>>>>>>>>>>>>>>> folks that this time will never work, I can certainly 
>>>>>>>>>>>>>>>>>> adjust. For some
>>>>>>>>>>>>>>>>>> reason, I don't have the ability to add to the Iceberg Dev 
>>>>>>>>>>>>>>>>>> calendar, so
>>>>>>>>>>>>>>>>>> I'll figure that out and update the thread when the event is 
>>>>>>>>>>>>>>>>>> scheduled.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Jul 22, 2025 at 11:47 AM Russell Spitzer <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think this is a great way forward, starting out with
>>>>>>>>>>>>>>>>>>> this much parallel development shows that we have a lot of 
>>>>>>>>>>>>>>>>>>> consensus
>>>>>>>>>>>>>>>>>>> already :)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Jul 22, 2025 at 12:42 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hey folks, just following up on this. It looks like our
>>>>>>>>>>>>>>>>>>>> proposal and the proposal that @Russell Spitzer
>>>>>>>>>>>>>>>>>>>> <[email protected]> shared are pretty aligned.
>>>>>>>>>>>>>>>>>>>> I was just chatting with Russell about this, and we think 
>>>>>>>>>>>>>>>>>>>> it'd be best to
>>>>>>>>>>>>>>>>>>>> combine both proposals and have a singular large effort on 
>>>>>>>>>>>>>>>>>>>> this. I can also
>>>>>>>>>>>>>>>>>>>> set up a focused community discussion (similar to what 
>>>>>>>>>>>>>>>>>>>> we're doing on the
>>>>>>>>>>>>>>>>>>>> other V4 proposals) on this starting sometime next week 
>>>>>>>>>>>>>>>>>>>> just to get things
>>>>>>>>>>>>>>>>>>>> moving, if that works for people.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Mon, Jul 14, 2025 at 9:48 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hey Russell,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for sharing the proposal! A few of us (Ryan,
>>>>>>>>>>>>>>>>>>>>> Dan, Anoop and I) have also been working on a proposal 
>>>>>>>>>>>>>>>>>>>>> for an adaptive
>>>>>>>>>>>>>>>>>>>>> metadata tree structure as part of enabling more 
>>>>>>>>>>>>>>>>>>>>> efficient one file
>>>>>>>>>>>>>>>>>>>>> commits. From a read of the summary, it's great to see 
>>>>>>>>>>>>>>>>>>>>> that we're thinking
>>>>>>>>>>>>>>>>>>>>> along the same lines about how to tackle this fundamental 
>>>>>>>>>>>>>>>>>>>>> area!
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Here is our proposal:
>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0
>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, Jul 14, 2025 at 8:08 PM Russell Spitzer <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hey y'all!
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> We (Yi Fang, Steven Wu and Myself) wanted to share
>>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>> of the thoughts we had on how one-file commits could
>>>>>>>>>>>>>>>>>>>>>> work in Iceberg. This is pretty
>>>>>>>>>>>>>>>>>>>>>> much just a high level overview of the concepts we
>>>>>>>>>>>>>>>>>>>>>> think we need and how Iceberg would behave.
>>>>>>>>>>>>>>>>>>>>>> We haven't gone very far into the actual
>>>>>>>>>>>>>>>>>>>>>> implementation and changes that would need to occur in 
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> SDK to make this happen.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The high level summary is:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Manifest Lists are out
>>>>>>>>>>>>>>>>>>>>>> Root Manifests take their place
>>>>>>>>>>>>>>>>>>>>>>   A Root manifest can have data manifests, delete
>>>>>>>>>>>>>>>>>>>>>> manifests, manifest delete vectors, data delete vectors 
>>>>>>>>>>>>>>>>>>>>>> and data files
>>>>>>>>>>>>>>>>>>>>>>   Manifest delete vectors allow for modifying a
>>>>>>>>>>>>>>>>>>>>>> manifest without deleting it entirely
>>>>>>>>>>>>>>>>>>>>>>   Data files let you append without writing an
>>>>>>>>>>>>>>>>>>>>>> intermediary manifest
>>>>>>>>>>>>>>>>>>>>>>   Having child data and delete manifests lets you
>>>>>>>>>>>>>>>>>>>>>> still scale
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Please take a look if you like,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I'm excited to see what other proposals and Ideas are
>>>>>>>>>>>>>>>>>>>>>> floating around the community,
>>>>>>>>>>>>>>>>>>>>>> Russ
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 2, 2025 at 6:29 PM John Zhuge <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Very excited about the idea!
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 2, 2025 at 1:17 PM Anoop Johnson <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I'm very interested in this initiative. Micah
>>>>>>>>>>>>>>>>>>>>>>>> Kornfield and I presented
>>>>>>>>>>>>>>>>>>>>>>>> <https://youtu.be/4d4nqKkANdM?si=9TXgaUIXbq-l8idi&t=1405>
>>>>>>>>>>>>>>>>>>>>>>>> on high-throughput ingestion for Iceberg tables at the 
>>>>>>>>>>>>>>>>>>>>>>>> 2024 Iceberg Summit,
>>>>>>>>>>>>>>>>>>>>>>>> which leveraged Google infrastructure like Colossus 
>>>>>>>>>>>>>>>>>>>>>>>> for efficient appends.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> This new proposal is particularly exciting because
>>>>>>>>>>>>>>>>>>>>>>>> it offers significant advancements in commit latency 
>>>>>>>>>>>>>>>>>>>>>>>> and metadata storage
>>>>>>>>>>>>>>>>>>>>>>>> footprint. Furthermore, a consistent manifest 
>>>>>>>>>>>>>>>>>>>>>>>> structure promises to
>>>>>>>>>>>>>>>>>>>>>>>> simplify the design and codebase, which is a major 
>>>>>>>>>>>>>>>>>>>>>>>> benefit.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> A related idea I've been exploring is having a
>>>>>>>>>>>>>>>>>>>>>>>> loose affinity between data and delete manifests. 
>>>>>>>>>>>>>>>>>>>>>>>> While the current
>>>>>>>>>>>>>>>>>>>>>>>> separation of data and delete manifests in Iceberg is 
>>>>>>>>>>>>>>>>>>>>>>>> valuable for avoiding
>>>>>>>>>>>>>>>>>>>>>>>> data file rewrites (and stats updates) when deletes 
>>>>>>>>>>>>>>>>>>>>>>>> change, it does
>>>>>>>>>>>>>>>>>>>>>>>> necessitate a join operation during reads. I'd be keen 
>>>>>>>>>>>>>>>>>>>>>>>> to discuss
>>>>>>>>>>>>>>>>>>>>>>>> approaches that could potentially reduce this 
>>>>>>>>>>>>>>>>>>>>>>>> read-side cost while
>>>>>>>>>>>>>>>>>>>>>>>> retaining the benefits of separate manifests.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>> Anoop
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 13, 2025 at 11:06 AM Jagdeep Sidhu <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I am new to the Iceberg community but would love
>>>>>>>>>>>>>>>>>>>>>>>>> to participate in these discussions to reduce the 
>>>>>>>>>>>>>>>>>>>>>>>>> number of file writes,
>>>>>>>>>>>>>>>>>>>>>>>>> especially for small writes/commits.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thank you!
>>>>>>>>>>>>>>>>>>>>>>>>> -Jagdeep
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 5, 2025 at 4:02 PM Anurag
>>>>>>>>>>>>>>>>>>>>>>>>> Mantripragada <[email protected]>
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> We have been hitting all the metadata problems
>>>>>>>>>>>>>>>>>>>>>>>>>> you mentioned, Ryan. I’m on-board to help however I 
>>>>>>>>>>>>>>>>>>>>>>>>>> can to improve this
>>>>>>>>>>>>>>>>>>>>>>>>>> area.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> ~ Anurag Mantripragada
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Jun 3, 2025, at 2:22 AM, Huang-Hsiang Cheng
>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I am interested in this idea and looking forward
>>>>>>>>>>>>>>>>>>>>>>>>>> to collaboration.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>> Huang-Hsiang
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Jun 2, 2025, at 10:14 AM, namratha mk <
>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I am interested in contributing to this effort.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>> Namratha
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 1:36 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for kicking this thread off Ryan, I'm
>>>>>>>>>>>>>>>>>>>>>>>>>>> interested in helping out here! I've been working 
>>>>>>>>>>>>>>>>>>>>>>>>>>> on a proposal in this
>>>>>>>>>>>>>>>>>>>>>>>>>>> area and it would be great to collaborate with 
>>>>>>>>>>>>>>>>>>>>>>>>>>> different folks and exchange
>>>>>>>>>>>>>>>>>>>>>>>>>>> ideas here, since I think a lot of people are 
>>>>>>>>>>>>>>>>>>>>>>>>>>> interested in solving this
>>>>>>>>>>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 2:25 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Like Russell’s recent note, I’m starting a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> thread to connect those of us that are interested 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the idea of changing
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg’s metadata in v4 so that in most cases 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> committing a change only
>>>>>>>>>>>>>>>>>>>>>>>>>>>> requires writing one additional metadata file.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> *Idea: One-file commits*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> The current Iceberg metadata structure requires
>>>>>>>>>>>>>>>>>>>>>>>>>>>> writing at least one manifest and a new manifest 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> list to produce a new
>>>>>>>>>>>>>>>>>>>>>>>>>>>> snapshot. The goal of this work is to allow more 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> flexibility by allowing
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the manifest list layer to store data and delete 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> files. As a result, only
>>>>>>>>>>>>>>>>>>>>>>>>>>>> one file write would be needed before committing 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the new snapshot. In
>>>>>>>>>>>>>>>>>>>>>>>>>>>> addition, this work will also try to explore:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Avoiding small manifests that must be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>    read in parallel and later compacted (metadata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> maintenance changes)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Extend metadata skipping to use
>>>>>>>>>>>>>>>>>>>>>>>>>>>>    aggregated column ranges that are compatible 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> with geospatial data (manifest
>>>>>>>>>>>>>>>>>>>>>>>>>>>>    metadata)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Using soft deletes to avoid rewriting
>>>>>>>>>>>>>>>>>>>>>>>>>>>>    existing manifests (metadata DVs)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you’re interested in these problems, please
>>>>>>>>>>>>>>>>>>>>>>>>>>>> reply!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>> John Zhuge
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] v4 - One file commits

Reply via email to