Re: [DISCUSS] v4 - One file commits

Anoop Johnson Tue, 03 Feb 2026 13:53:58 -0800

I'm in favor of co-located DV metadata with column file override and not
doing affiliated/unaffiliated delete manifests. This is conceptually
similar to strictly affiliated delete manifests with positional joins, and
will halve the number of I/Os when there is no DV column override. It is
simpler to implement
and will speed up reads.


Unaffiliated DV manifests are flexible for writers. They reduce the chance
of physical conflicts when there are concurrent large/random deletes that
change DVs on different files in the same manifest. But the flexibility
comes at a read-time cost. If the number of unaffiliated DVs exceeds a
threshold, it could cause driver OOMs or require distributed join to pair
up DVs with data files. With colocated metadata, manifest DVs can reduce
the chance of conflicts up to a certain write size.

I assume we will still support unaffiliated manifests for equality deletes,
but perhaps we can restrict it to just equality deletes.

-Anoop


On Mon, Feb 2, 2026 at 4:27 PM Anton Okolnychyi <[email protected]>
wrote:

> I added the approach with column files to the doc.
>
> To sum up, separate data and delete manifests with affinity
> would perform somewhat on par with co-located DV metadata (a.k.a. direct
> assignment) if we add support for column files when we need to replace most
> or all DVs (use case 1). That said, the support for direct assignment with
> in-line metadata DVs can help us avoid unaffiliated delete manifests when
> we need to replace a few DVs (use case 2).
>
> So the key question is whether we want to allow unaffiliated delete
> manifests with DVs... If we don't, then we would likely want to have
> co-located DV metadata and must support efficient column updates not to
> regress compared to V2 and V3 for large MERGE jobs that modify a small set
> of records for most files.
>
> пн, 2 лют. 2026 р. о 13:20 Anton Okolnychyi <[email protected]> пише:
>
>> Anoop, correct, if we keep data and delete manifests separate, there is a
>> better way to combine the entries and we should NOT rely on the referenced
>> data file path. Reconciling by implicit position will reduce the size of
>> the DV entry (no need to store the referenced data file path) and will
>> improve the planning performance (no equals/hashCode on the path).
>>
>> Steven, I agree. Most notes in the doc pre-date discussions we had on
>> column updates. You are right, given that we are gravitating towards a
>> native way to handle column updates, it seems logical to use the same
>> approach for replacing DVs, since they’re essentially column updates. Let
>> me add one more approach to the doc based on what Anurag and Peter have so
>> far.
>>
>> нд, 1 лют. 2026 р. о 20:59 Steven Wu <[email protected]> пише:
>>
>>> Anton, thanks for raising this. I agree this deserves another look. I
>>> added a comment in your doc that we can potentially apply the column update
>>> proposal for data file update to the manifest file updates as well, to
>>> colocate the data DV and data manifest files. Data DVs can be a
>>> separate column in the data manifest file and updated separately in a
>>> column file. This is the same as the coalesced positional join that Anoop
>>> mentioned.
>>>
>>> On Sun, Feb 1, 2026 at 4:14 PM Anoop Johnson <[email protected]> wrote:
>>>
>>>> Thank you for raising this, Anton. I had a similar observation while
>>>> prototyping <https://github.com/apache/iceberg/pull/14533> the
>>>> adaptive metadata tree. The overhead of doing a path-based hash join of a
>>>> data manifest with the affiliated delete manifest is high: my estimate was
>>>> that the join adds about 5-10% overhead. The hash table build/probe alone
>>>> takes about 5 ms for manifests with 25K entries. There are engines that can
>>>> do vectorized hash joins that can lower this, but the overhead and
>>>> complexity of a SIMD-friendly hash join is non-trivial.
>>>>
>>>> An alternative to relying on the external file feature in Parquet, is
>>>> to make affiliated manifests order-preserving: ie DVs in an affiliated
>>>> delete manifest must appear in the same position as the corresponding data
>>>> file in the data manifest the delete manifest is affiliated to.  If a data
>>>> file does not have a DV, the DV manifest must store a NULL. This would
>>>> allow us to do positional joins, which are much faster. If we wanted, we
>>>> could even have multiple affiliated DV manifests for a data manifest and
>>>> the reader would do a COALESCED positional join (i.e. pick the first
>>>> non-null value as the DV). It puts the sorting responsibility to the
>>>> writers, but it might be a reasonable tradeoff.
>>>>
>>>> Also, the options don't necessarily have to be mutually exclusive. We
>>>> could still allow affiliated DVs to be "folded" into data manifest (e.g. by
>>>> background optimization jobs or the writer itself). That might be the
>>>> optimal choice for read-heavy tables because it will halve the number of
>>>> I/Os readers have to make.
>>>>
>>>> Best,
>>>> Anoop
>>>>
>>>>
>>>> On Fri, Jan 30, 2026 at 6:03 PM Anton Okolnychyi <[email protected]>
>>>> wrote:
>>>>
>>>>> I had a chance to catch up on some of the V4 discussions. Given that
>>>>> we are getting rid of the manifest list and switching to Parquet, I wanted
>>>>> to re-evaluate the possibility of direct DV assignment that we discarded 
>>>>> in
>>>>> V3 to avoid regressions. I have put together my thoughts in a doc [1].
>>>>>
>>>>> TL;DR:
>>>>>
>>>>> - I think the current V4 proposal that keeps data and delete manifests
>>>>> separate but introduces affinity is a solid choice for cases when we need
>>>>> to replace DVs in many / most files. I outlined an approach with
>>>>> column-split Parquet files but it doesn't improve the performance and 
>>>>> takes
>>>>> dependency on a portion of the Parquet spec that is not really 
>>>>> implemented.
>>>>> - Pushing unaffiliated DVs directly into the root to replace a small
>>>>> set of DVs is going to be fast on write but does require resolving where
>>>>> those DVs apply at read time. Using inline metadata DVs with column-split
>>>>> Parquet files is a little more promising in this case as it allows to 
>>>>> avoid
>>>>> unaffiliated DVs. That said, it again relies on something Parquet doesn't
>>>>> implement right now, requires changing maintenance operations, and yields
>>>>> minimal benefits.
>>>>>
>>>>> All in all, the V4 proposal seems like a strict improvement over V3
>>>>> but I insist that we reconsider usage of the referenced data file path 
>>>>> when
>>>>> resolving DVs to data files.
>>>>>
>>>>> [1] -
>>>>> https://docs.google.com/document/d/1jZy4g6UDi3hdblpkSzDnqgzgATFKFoMaHmt4nNH8M7o
>>>>>
>>>>> - Anton
>>>>>
>>>>> сб, 22 лист. 2025 р. о 13:37 Amogh Jahagirdar <[email protected]> пише:
>>>>>
>>>>>> Hey all,
>>>>>>
>>>>>> Here is the meeting recording
>>>>>> <https://drive.google.com/file/d/1lG9sM-JTwqcIgk7JsAryXXCc1vMnstJs/view?usp=sharing>
>>>>>>  and generated meeting summary
>>>>>> <https://docs.google.com/document/d/1e50p8TXL2e3CnUwKMOvm8F4s2PeVMiKWHPxhxOW1fIM/edit?usp=sharing>.
>>>>>> Thanks all for attending yesterday!
>>>>>>
>>>>>> On Thu, Nov 20, 2025 at 8:49 AM Amogh Jahagirdar <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey folks,
>>>>>>>
>>>>>>> I was out for some time, but set up a sync for tomorrow at 9am PST.
>>>>>>> For this discussion, I do think it would be great to focus on the 
>>>>>>> manifest
>>>>>>> DV representation, factoring in analyses on bitmap representation 
>>>>>>> storage
>>>>>>> footprints, and the entry structure considering how we want to approach
>>>>>>> change detection. If there are other topics that people want to 
>>>>>>> highlight,
>>>>>>> please do bring those up as well!
>>>>>>>
>>>>>>> I also recognize that this is a bit short term scheduling, so please
>>>>>>> do reach out to me if this time is difficult to work with; next week is 
>>>>>>> the
>>>>>>> Thanksgiving holidays here, and since people would be travelling/out I
>>>>>>> figured I'd try to schedule before then.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Amogh Jahagirdar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 17, 2025 at 9:03 AM Amogh Jahagirdar <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey folks,
>>>>>>>>
>>>>>>>> Sorry for the delay, here's the recording link
>>>>>>>> <https://drive.google.com/file/d/1YOmPROXjAKYAWAcYxqAFHdADbqELVVf2/view>
>>>>>>>>   from
>>>>>>>> last week's discussion.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Amogh Jahagirdar
>>>>>>>>
>>>>>>>> On Fri, Oct 10, 2025 at 9:44 AM Péter Váry <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Same here.
>>>>>>>>> Please record if you can.
>>>>>>>>> Thanks, Peter
>>>>>>>>>
>>>>>>>>> On Fri, Oct 10, 2025, 17:39 Fokko Driesprong <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Amogh,
>>>>>>>>>>
>>>>>>>>>> Thanks for the write-up. Unfortunately, I won’t be able to
>>>>>>>>>> attend. Will it be recorded? Thanks!
>>>>>>>>>>
>>>>>>>>>> Kind regards,
>>>>>>>>>> Fokko
>>>>>>>>>>
>>>>>>>>>> Op di 7 okt 2025 om 20:36 schreef Amogh Jahagirdar <
>>>>>>>>>> [email protected]>
>>>>>>>>>>
>>>>>>>>>>> Hey all,
>>>>>>>>>>>
>>>>>>>>>>> I've setup time this Friday at 9am PST for another sync on
>>>>>>>>>>> single file commits. In terms of what would be great to focus on 
>>>>>>>>>>> for the
>>>>>>>>>>> discussion:
>>>>>>>>>>>
>>>>>>>>>>> 1. Whether it makes sense or not to eliminate the tuple, and
>>>>>>>>>>> instead representing the tuple via lower/upper boundaries. As a 
>>>>>>>>>>> reminder,
>>>>>>>>>>> one of the goals is to avoid tying a partition spec to a manifest; 
>>>>>>>>>>> in the
>>>>>>>>>>> root we can have a mix of files spanning different partition specs, 
>>>>>>>>>>> and
>>>>>>>>>>> even in leaf manifests avoiding this coupling can enable more
>>>>>>>>>>> desirable clustering of metadata.
>>>>>>>>>>> In the vast majority of cases, we could leverage the property
>>>>>>>>>>> that a file is effectively partitioned if the lower/upper for a 
>>>>>>>>>>> given field
>>>>>>>>>>> is equal. The nuance here is with the particular case of
>>>>>>>>>>> identity partitioned string/binary columns which can be truncated 
>>>>>>>>>>> in stats.
>>>>>>>>>>> One approach is to require that writers must not produce truncated 
>>>>>>>>>>> stats
>>>>>>>>>>> for identity partitioned columns. It's also important to keep in 
>>>>>>>>>>> mind that
>>>>>>>>>>> all of this is just for the purpose of reconstructing the partition 
>>>>>>>>>>> tuple,
>>>>>>>>>>> which is only required during equality delete matching. Another 
>>>>>>>>>>> area we
>>>>>>>>>>> need to cover as part of this is on exact bounds on stats. There 
>>>>>>>>>>> are other
>>>>>>>>>>> options here as well such as making all new equality deletes in V4 
>>>>>>>>>>> be
>>>>>>>>>>> global and instead match based on bounds, or keeping the tuple but 
>>>>>>>>>>> each
>>>>>>>>>>> tuple is effectively based off a union schema of all partition 
>>>>>>>>>>> specs. I am
>>>>>>>>>>> adding a separate appendix section outlining the span of options 
>>>>>>>>>>> here and
>>>>>>>>>>> the different tradeoffs.
>>>>>>>>>>> Once we get this more to a conclusive state, I'll move a
>>>>>>>>>>> summarized version to the main doc.
>>>>>>>>>>>
>>>>>>>>>>> 2. @[email protected] <[email protected]> has updated the
>>>>>>>>>>> doc with a section
>>>>>>>>>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.rrpksmp8zkb#heading=h.qau0y5xkh9mn>
>>>>>>>>>>>  on
>>>>>>>>>>> how we can do change detection from the root in a variety of write
>>>>>>>>>>> scenarios. I've done a review on it, and it covers the cases I would
>>>>>>>>>>> expect. It'd be good for folks to take a look and please give 
>>>>>>>>>>> feedback
>>>>>>>>>>> before we discuss. Thank you Steven for adding that section and all 
>>>>>>>>>>> the
>>>>>>>>>>> diagrams.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 18, 2025 at 3:19 PM Amogh Jahagirdar <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey folks just following up from the discussion last Friday
>>>>>>>>>>>> with a summary and some next steps:
>>>>>>>>>>>>
>>>>>>>>>>>> 1.) For the various change detection cases, we concluded it's
>>>>>>>>>>>> best just to go through those in an offline manner on the doc 
>>>>>>>>>>>> since it's
>>>>>>>>>>>> hard to verify all that correctness in a large meeting setting.
>>>>>>>>>>>> 2.) We mostly discussed eliminating the partition tuple. On the
>>>>>>>>>>>> original proposal, I was mostly aiming for the ability to 
>>>>>>>>>>>> re-constructing
>>>>>>>>>>>> the tuple from the stats for the purpose of equality delete 
>>>>>>>>>>>> matching (a
>>>>>>>>>>>> file is partitioned if the lower and upper bounds are equal); 
>>>>>>>>>>>> There's some
>>>>>>>>>>>> nuance in how we need to handle identity partition values since for
>>>>>>>>>>>> string/binary they cannot be truncated. Another potential option 
>>>>>>>>>>>> is to
>>>>>>>>>>>> treat all equality deletes as effectively global and narrow their
>>>>>>>>>>>> application based on the stats values. This may require defining 
>>>>>>>>>>>> tight
>>>>>>>>>>>> bounds. I'm still collecting my thoughts on this one.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks folks! Please also let me know if any of the following
>>>>>>>>>>>> links are inaccessible for any reason.
>>>>>>>>>>>>
>>>>>>>>>>>> Meeting recording link:
>>>>>>>>>>>> https://drive.google.com/file/d/1gv8TrR5xzqqNxek7_sTZkpbwQx1M3dhK/view
>>>>>>>>>>>>
>>>>>>>>>>>> Meeting summary:
>>>>>>>>>>>> https://docs.google.com/document/d/131N0CDpzZczURxitN0HGS7dTqRxQT_YS9jMECkGGvQU
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 8, 2025 at 3:40 PM Amogh Jahagirdar <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Update: I moved the discussion time to this Friday at 9 am PST
>>>>>>>>>>>>> since I found out that quite a few folks involved in the 
>>>>>>>>>>>>> proposals will be
>>>>>>>>>>>>> out next week, and I also know some folks will also be out the 
>>>>>>>>>>>>> week after
>>>>>>>>>>>>> that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Amogh J
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 8, 2025 at 8:57 AM Amogh Jahagirdar <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey folks sorry for the late follow up here,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks @Kevin Liu <[email protected]> for sharing the
>>>>>>>>>>>>>> recording link of the previous discussion! I've set up another 
>>>>>>>>>>>>>> sync for
>>>>>>>>>>>>>> next Tuesday 09/16 at 9am PST. This time I've set it up from my 
>>>>>>>>>>>>>> corporate
>>>>>>>>>>>>>> email so we can get recordings and transcriptions (and I've made 
>>>>>>>>>>>>>> sure to
>>>>>>>>>>>>>> keep the meeting invite open so we don't have to manually let 
>>>>>>>>>>>>>> people in).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In terms of next steps of areas which I think would be good
>>>>>>>>>>>>>> to focus on for establishing consensus:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. How do we model the manifest entry structure so that
>>>>>>>>>>>>>> changes to manifest DVs can be obtained easily from the root? 
>>>>>>>>>>>>>> There are a
>>>>>>>>>>>>>> few options here; the most promising approach is to keep an 
>>>>>>>>>>>>>> additional DV
>>>>>>>>>>>>>> which encodes the diff in additional positions which have been 
>>>>>>>>>>>>>> removed from
>>>>>>>>>>>>>> a leaf manifest.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. Modeling partition transforms via expressions and
>>>>>>>>>>>>>> establishing a unified table ID space so that we can simplify 
>>>>>>>>>>>>>> how partition
>>>>>>>>>>>>>> tuples may be represented via stats and also have a way in the 
>>>>>>>>>>>>>> future to
>>>>>>>>>>>>>> store stats on any derived column. I have a short proposal
>>>>>>>>>>>>>> <https://docs.google.com/document/d/1oV8dapKVzB4pZy5pKHUCj5j9i2_1p37BJSeT7hyKPpg/edit?tab=t.0>
>>>>>>>>>>>>>>  for
>>>>>>>>>>>>>> this that probably still needs some tightening up on the 
>>>>>>>>>>>>>> expression
>>>>>>>>>>>>>> modeling itself (and some prototyping) but the general idea for
>>>>>>>>>>>>>> establishing a unified table ID space is covered. All feedback 
>>>>>>>>>>>>>> welcome!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Aug 25, 2025 at 1:34 PM Kevin Liu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks Amogh. Looks like the recording for last week's sync
>>>>>>>>>>>>>>> is available on Youtube. Here's the link,
>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=uWm-p--8oVQ
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Kevin Liu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Aug 12, 2025 at 9:10 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Just following up on this to give the community as to where
>>>>>>>>>>>>>>>> we're at and my proposed next steps.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've been editing and merging the contents from our
>>>>>>>>>>>>>>>> proposal into the proposal
>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw>
>>>>>>>>>>>>>>>>  from
>>>>>>>>>>>>>>>> Russell and others. For any future comments on docs, please 
>>>>>>>>>>>>>>>> comment on the
>>>>>>>>>>>>>>>> linked proposal. I've also marked it on our doc in red text so 
>>>>>>>>>>>>>>>> it's clear
>>>>>>>>>>>>>>>> to redirect to the other proposal as a source of truth for 
>>>>>>>>>>>>>>>> comments.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In terms of next steps,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. An important design decision point is around inline
>>>>>>>>>>>>>>>> manifest DVs, external manifest DVs or enabling both. I'm 
>>>>>>>>>>>>>>>> working on
>>>>>>>>>>>>>>>> measuring different approaches for representing the compressed 
>>>>>>>>>>>>>>>> DV
>>>>>>>>>>>>>>>> representation since that will inform how many entries can 
>>>>>>>>>>>>>>>> reasonably fit
>>>>>>>>>>>>>>>> in a small root manifest; from that we can derive implications 
>>>>>>>>>>>>>>>> on different
>>>>>>>>>>>>>>>> write patterns and determine the right approach for storing 
>>>>>>>>>>>>>>>> these manifest
>>>>>>>>>>>>>>>> DVs.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2. Another key point is around determining if/how we can
>>>>>>>>>>>>>>>> reasonably enable V4 to represent changes in the root manifest 
>>>>>>>>>>>>>>>> so that
>>>>>>>>>>>>>>>> readers can effectively just infer file level changes from the 
>>>>>>>>>>>>>>>> root.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 3. One of the aspects of the proposal is getting away from
>>>>>>>>>>>>>>>> partition tuple requirement in the root which currently holds 
>>>>>>>>>>>>>>>> us to have
>>>>>>>>>>>>>>>> associativity between a partition spec and a manifest. These 
>>>>>>>>>>>>>>>> aspects can be
>>>>>>>>>>>>>>>> modeled as essentially column stats which gives a lot of 
>>>>>>>>>>>>>>>> flexibility into
>>>>>>>>>>>>>>>> the organization of the manifest. There are important details 
>>>>>>>>>>>>>>>> around field
>>>>>>>>>>>>>>>> ID spaces here which tie into how the stats are structured. 
>>>>>>>>>>>>>>>> What we're
>>>>>>>>>>>>>>>> proposing here is to have a unified expression ID space that 
>>>>>>>>>>>>>>>> could also
>>>>>>>>>>>>>>>> benefit us for storing things like virtual columns down the 
>>>>>>>>>>>>>>>> line. I go into
>>>>>>>>>>>>>>>> this in the proposal but I'm working on separating the 
>>>>>>>>>>>>>>>> appropriate parts so
>>>>>>>>>>>>>>>> that the original proposal can mostly just focus on the 
>>>>>>>>>>>>>>>> organization of the
>>>>>>>>>>>>>>>> content metadata tree and not how we want to solve this 
>>>>>>>>>>>>>>>> particular ID space
>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 4. I'm planning on scheduling a recurring community sync
>>>>>>>>>>>>>>>> starting next Tuesday at 9am PST, every 2 weeks. If I get 
>>>>>>>>>>>>>>>> feedback from
>>>>>>>>>>>>>>>> folks that this time will never work, I can certainly adjust. 
>>>>>>>>>>>>>>>> For some
>>>>>>>>>>>>>>>> reason, I don't have the ability to add to the Iceberg Dev 
>>>>>>>>>>>>>>>> calendar, so
>>>>>>>>>>>>>>>> I'll figure that out and update the thread when the event is 
>>>>>>>>>>>>>>>> scheduled.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Jul 22, 2025 at 11:47 AM Russell Spitzer <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think this is a great way forward, starting out with
>>>>>>>>>>>>>>>>> this much parallel development shows that we have a lot of 
>>>>>>>>>>>>>>>>> consensus
>>>>>>>>>>>>>>>>> already :)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Jul 22, 2025 at 12:42 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hey folks, just following up on this. It looks like our
>>>>>>>>>>>>>>>>>> proposal and the proposal that @Russell Spitzer
>>>>>>>>>>>>>>>>>> <[email protected]> shared are pretty aligned. I
>>>>>>>>>>>>>>>>>> was just chatting with Russell about this, and we think it'd 
>>>>>>>>>>>>>>>>>> be best to
>>>>>>>>>>>>>>>>>> combine both proposals and have a singular large effort on 
>>>>>>>>>>>>>>>>>> this. I can also
>>>>>>>>>>>>>>>>>> set up a focused community discussion (similar to what we're 
>>>>>>>>>>>>>>>>>> doing on the
>>>>>>>>>>>>>>>>>> other V4 proposals) on this starting sometime next week just 
>>>>>>>>>>>>>>>>>> to get things
>>>>>>>>>>>>>>>>>> moving, if that works for people.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Jul 14, 2025 at 9:48 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hey Russell,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for sharing the proposal! A few of us (Ryan, Dan,
>>>>>>>>>>>>>>>>>>> Anoop and I) have also been working on a proposal for an 
>>>>>>>>>>>>>>>>>>> adaptive metadata
>>>>>>>>>>>>>>>>>>> tree structure as part of enabling more efficient one file 
>>>>>>>>>>>>>>>>>>> commits. From a
>>>>>>>>>>>>>>>>>>> read of the summary, it's great to see that we're thinking 
>>>>>>>>>>>>>>>>>>> along the same
>>>>>>>>>>>>>>>>>>> lines about how to tackle this fundamental area!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Here is our proposal:
>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0
>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1q2asTpq471pltOTC6AsTLQIQcgEsh0AvEhRWnCcvZn0>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Jul 14, 2025 at 8:08 PM Russell Spitzer <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hey y'all!
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We (Yi Fang, Steven Wu and Myself) wanted to share some
>>>>>>>>>>>>>>>>>>>> of the thoughts we had on how one-file commits could
>>>>>>>>>>>>>>>>>>>> work in Iceberg. This is pretty
>>>>>>>>>>>>>>>>>>>> much just a high level overview of the concepts we
>>>>>>>>>>>>>>>>>>>> think we need and how Iceberg would behave.
>>>>>>>>>>>>>>>>>>>> We haven't gone very far into the actual implementation
>>>>>>>>>>>>>>>>>>>> and changes that would need to occur in the
>>>>>>>>>>>>>>>>>>>> SDK to make this happen.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The high level summary is:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Manifest Lists are out
>>>>>>>>>>>>>>>>>>>> Root Manifests take their place
>>>>>>>>>>>>>>>>>>>>   A Root manifest can have data manifests, delete
>>>>>>>>>>>>>>>>>>>> manifests, manifest delete vectors, data delete vectors 
>>>>>>>>>>>>>>>>>>>> and data files
>>>>>>>>>>>>>>>>>>>>   Manifest delete vectors allow for modifying a
>>>>>>>>>>>>>>>>>>>> manifest without deleting it entirely
>>>>>>>>>>>>>>>>>>>>   Data files let you append without writing an
>>>>>>>>>>>>>>>>>>>> intermediary manifest
>>>>>>>>>>>>>>>>>>>>   Having child data and delete manifests lets you still
>>>>>>>>>>>>>>>>>>>> scale
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Please take a look if you like,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I'm excited to see what other proposals and Ideas are
>>>>>>>>>>>>>>>>>>>> floating around the community,
>>>>>>>>>>>>>>>>>>>> Russ
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Jul 2, 2025 at 6:29 PM John Zhuge <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Very excited about the idea!
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 2, 2025 at 1:17 PM Anoop Johnson <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I'm very interested in this initiative. Micah
>>>>>>>>>>>>>>>>>>>>>> Kornfield and I presented
>>>>>>>>>>>>>>>>>>>>>> <https://youtu.be/4d4nqKkANdM?si=9TXgaUIXbq-l8idi&t=1405>
>>>>>>>>>>>>>>>>>>>>>> on high-throughput ingestion for Iceberg tables at the 
>>>>>>>>>>>>>>>>>>>>>> 2024 Iceberg Summit,
>>>>>>>>>>>>>>>>>>>>>> which leveraged Google infrastructure like Colossus for 
>>>>>>>>>>>>>>>>>>>>>> efficient appends.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> This new proposal is particularly exciting because it
>>>>>>>>>>>>>>>>>>>>>> offers significant advancements in commit latency and 
>>>>>>>>>>>>>>>>>>>>>> metadata storage
>>>>>>>>>>>>>>>>>>>>>> footprint. Furthermore, a consistent manifest structure 
>>>>>>>>>>>>>>>>>>>>>> promises to
>>>>>>>>>>>>>>>>>>>>>> simplify the design and codebase, which is a major 
>>>>>>>>>>>>>>>>>>>>>> benefit.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> A related idea I've been exploring is having a loose
>>>>>>>>>>>>>>>>>>>>>> affinity between data and delete manifests. While the 
>>>>>>>>>>>>>>>>>>>>>> current separation of
>>>>>>>>>>>>>>>>>>>>>> data and delete manifests in Iceberg is valuable for 
>>>>>>>>>>>>>>>>>>>>>> avoiding data file
>>>>>>>>>>>>>>>>>>>>>> rewrites (and stats updates) when deletes change, it 
>>>>>>>>>>>>>>>>>>>>>> does necessitate a
>>>>>>>>>>>>>>>>>>>>>> join operation during reads. I'd be keen to discuss 
>>>>>>>>>>>>>>>>>>>>>> approaches that could
>>>>>>>>>>>>>>>>>>>>>> potentially reduce this read-side cost while retaining 
>>>>>>>>>>>>>>>>>>>>>> the benefits of
>>>>>>>>>>>>>>>>>>>>>> separate manifests.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>> Anoop
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 13, 2025 at 11:06 AM Jagdeep Sidhu <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I am new to the Iceberg community but would love to
>>>>>>>>>>>>>>>>>>>>>>> participate in these discussions to reduce the number 
>>>>>>>>>>>>>>>>>>>>>>> of file writes,
>>>>>>>>>>>>>>>>>>>>>>> especially for small writes/commits.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thank you!
>>>>>>>>>>>>>>>>>>>>>>> -Jagdeep
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 5, 2025 at 4:02 PM Anurag Mantripragada
>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> We have been hitting all the metadata problems you
>>>>>>>>>>>>>>>>>>>>>>>> mentioned, Ryan. I’m on-board to help however I can to 
>>>>>>>>>>>>>>>>>>>>>>>> improve this area.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> ~ Anurag Mantripragada
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Jun 3, 2025, at 2:22 AM, Huang-Hsiang Cheng
>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I am interested in this idea and looking forward to
>>>>>>>>>>>>>>>>>>>>>>>> collaboration.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> Huang-Hsiang
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Jun 2, 2025, at 10:14 AM, namratha mk <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I am interested in contributing to this effort.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> Namratha
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 1:36 PM Amogh Jahagirdar <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for kicking this thread off Ryan, I'm
>>>>>>>>>>>>>>>>>>>>>>>>> interested in helping out here! I've been working on 
>>>>>>>>>>>>>>>>>>>>>>>>> a proposal in this
>>>>>>>>>>>>>>>>>>>>>>>>> area and it would be great to collaborate with 
>>>>>>>>>>>>>>>>>>>>>>>>> different folks and exchange
>>>>>>>>>>>>>>>>>>>>>>>>> ideas here, since I think a lot of people are 
>>>>>>>>>>>>>>>>>>>>>>>>> interested in solving this
>>>>>>>>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 29, 2025 at 2:25 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Like Russell’s recent note, I’m starting a thread
>>>>>>>>>>>>>>>>>>>>>>>>>> to connect those of us that are interested in the 
>>>>>>>>>>>>>>>>>>>>>>>>>> idea of changing
>>>>>>>>>>>>>>>>>>>>>>>>>> Iceberg’s metadata in v4 so that in most cases 
>>>>>>>>>>>>>>>>>>>>>>>>>> committing a change only
>>>>>>>>>>>>>>>>>>>>>>>>>> requires writing one additional metadata file.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> *Idea: One-file commits*
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> The current Iceberg metadata structure requires
>>>>>>>>>>>>>>>>>>>>>>>>>> writing at least one manifest and a new manifest 
>>>>>>>>>>>>>>>>>>>>>>>>>> list to produce a new
>>>>>>>>>>>>>>>>>>>>>>>>>> snapshot. The goal of this work is to allow more 
>>>>>>>>>>>>>>>>>>>>>>>>>> flexibility by allowing
>>>>>>>>>>>>>>>>>>>>>>>>>> the manifest list layer to store data and delete 
>>>>>>>>>>>>>>>>>>>>>>>>>> files. As a result, only
>>>>>>>>>>>>>>>>>>>>>>>>>> one file write would be needed before committing the 
>>>>>>>>>>>>>>>>>>>>>>>>>> new snapshot. In
>>>>>>>>>>>>>>>>>>>>>>>>>> addition, this work will also try to explore:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>    - Avoiding small manifests that must be read
>>>>>>>>>>>>>>>>>>>>>>>>>>    in parallel and later compacted (metadata 
>>>>>>>>>>>>>>>>>>>>>>>>>> maintenance changes)
>>>>>>>>>>>>>>>>>>>>>>>>>>    - Extend metadata skipping to use aggregated
>>>>>>>>>>>>>>>>>>>>>>>>>>    column ranges that are compatible with geospatial 
>>>>>>>>>>>>>>>>>>>>>>>>>> data (manifest metadata)
>>>>>>>>>>>>>>>>>>>>>>>>>>    - Using soft deletes to avoid rewriting
>>>>>>>>>>>>>>>>>>>>>>>>>>    existing manifests (metadata DVs)
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> If you’re interested in these problems, please
>>>>>>>>>>>>>>>>>>>>>>>>>> reply!
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> John Zhuge
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] v4 - One file commits

Reply via email to