Re: [Discuss] Column Update File Representation

Andrei Tserakhau via dev Thu, 21 May 2026 09:42:05 -0700

One thing to add here:

We're scoping internal work on our side to validate the dense-
column-updates direction against Delta -- design alignment plus
re-verifying the older measurements under current builds. Marco
Kroll is driving that effort and is the right person from our
side to engage here on implementation realities going forward.


Marco -- want to drop into the discussion? Your read on the
representation choice from a Delta writer-implementation
perspective would be useful as the validation work progresses.

(The crossover numbers Marco shared on the parent thread came
from the same older measurement; treat as recollection pending
re-verification.)

On iceberg-go, echoing Matt's read on iceberg-rust: neither
option is a blocker, but Option 2 is materially more work given
where arrow-go's compute surface sits today. Not a hard
constraint, but rather one more thumb on the Option 1 scale.

Andrei

On Thu, May 21, 2026 at 6:07 PM Andrei Tserakhau <
[email protected]> wrote:

> Hi Peter, all,
>
> Four things from the PoC, bigger ones first.
> Nothing blocking TBH, but some stuff need attention.
>
> == Row-group alignment within a file ==
>
> Echoing Matt's read -- this is the real practical floor. Our PoC
> forced 1:1 base->column-file at the file level (disabling
> adaptive file sizing on the writer side), but within a file the
> row group boundaries drifted between base and column file. We
> took the concat-refill cost at every drifting boundary and moved
> on; never pursued the deeper fix. The explicit "force short read
> at row count N" API Matt described in arrow-rs would let
> stitchers re-align without copies, which feels like the right
> upstream ask if iceberg-rust pushes Option 1 to production.
>
> For Java/Spark the same problem exists but vectorized batch sizes
> hide most of it because Spark already does the equivalent of
> concat-refill at batch boundaries. The Rust path is genuinely
> harder here, not just due to scatter.
>
> == Stats correction needs explicit ownership in the spec ==
>
> Gabor's null_count adjustment is right, but the spec should also
> state WHICH stats the writer must adjust before publishing to
> manifests vs which engines can leave raw and reconcile at read
> time. The set we saw matter:
>
>   - null_count: must be adjusted (filler NULLs inflate it)
>   - value_count: if left raw, CBO cardinality estimates overcount
>     by the delete count -- affects join strategy choices
>   - min / max: stays correct, no adjustment needed
>   - IS NULL fast paths: engines that prune via stats over-match;
>     correct results post-DV-filter but wasted I/O
>
> Leaving any of these undefined is a recipe for engine divergence
> at predicate pushdown time -- two implementations will make
> different join-plan decisions on the same table
>
> == Compaction timing ==
>
> We didn't reach compaction in the PoC. The mechanics (merge
> column files into the base) are clear enough, but the timing
> policy -- every snapshot? read-pressure-driven? manual via
> RewriteDataFiles with a column-file-aware strategy? -- was never
> settled. Probably fine to leave the timing to engines, but worth
> describing the merge mechanics in the spec so engines don't
> diverge there.
>
> == Schema evolution ==
>
> Worth calling out so this doesn't get relitigated later: Iceberg's
> stable field_ids handle column rename and add cleanly because
> column files reference field_ids, not names. Type widening
> reduces to the same promotion logic the reader already applies
> to the base file. We had to work around this on Delta (names not
> IDs), which limited what we could test, but the Iceberg design
> gets it for free.
>
> Andrei
>
> On Thu, May 21, 2026 at 3:47 PM Péter Váry <[email protected]>
> wrote:
>
>> Thanks everyone for taking a look!
>> I could echo the concerns above, so I won’t duplicate them.
>> @Andrei: given your prior experience, were there any blockers we haven’t
>> identified yet?
>>
>>
>> Matt Butrovich <[email protected]> ezt írta (időpont: 2026. máj. 20.,
>> Sze, 21:51):
>>
>>> Thanks for writing this up, Anurag, and everyone else for all the
>>> analysis and discussion so far!
>>>
>>> Re: "I did not find scatter in arrow-rs." I think this could be overcome
>>> with take() and materialization, but it could be expensive. Based on what
>>> we've seen with joins these scatter codepaths are often the most expensive
>>> part because they can be extremely cache-unfriendly, especially with larger
>>> data types.
>>>
>>> Discussion seems to be leaning toward Option 1, so I went through a
>>> thought experiment of what this might look like in iceberg-rust, assuming
>>> continuing to use arrow-rs over Parquet data files:
>>>
>>> 1. Plan-time: FileScanTask carries the column_files list for the base
>>> file. Each entry is opaque to scan planning beyond its file_path and
>>> field_ids.
>>> 2. File open: for each task, open the base file plus one reader per
>>> referenced column-update file, in parallel, reusing the existing
>>> ArrowFileReader/AsyncFileReader plumbing.
>>> 3. Per-file projection: partition the requested field_ids by their
>>> source file. Each reader's ProjectionMask covers only the field_ids it
>>> owns. Field_ids overridden by an update file are excluded from the base
>>> reader's mask, so Parquet does not fetch or decode their column chunks. The
>>> base file is read as if those columns were not projected.
>>> 4. Coordinated row selection: the deletion vector and any
>>> predicate-driven row selection are translated to a single file-global
>>> RowSelection and applied identically to every reader. The same set of rows
>>> is therefore selected from every file.
>>> 5. Stitch stage: assemble each output batch by selecting columns from
>>> the per-file batches according to field_id ownership. With aligned row
>>> counts this is an Arc-level column swap; near row group boundaries readers
>>> diverge and re-alignment requires occasional buffer copies (see Risks).
>>> 6. Downstream: the existing RecordBatchTransformer operates on the
>>> stitched batch unchanged.
>>>
>>> Some care needs to be done with predicate handling:
>>> - Predicates over field_ids owned by exactly one file are pushed into
>>> that file's RowFilter.
>>> - Predicates spanning multiple files cannot be pushed; they evaluate
>>> post-stitch via arrow::compute::filter.
>>> - Row group statistics pruning is per-file but the decision is global.
>>> Each file evaluates its own predicates against its own row group stats; the
>>> resulting "could match" row ranges are intersected with each other and with
>>> the DV to produce a single file-global RowSelection shared by all readers.
>>> Each reader then derives its own with_row_groups list from that selection.
>>> Row group boundaries differ between files but row counts stay aligned
>>> because every reader applies the same selection.
>>>
>>> It seems like every primitive we would need is in arrow-rs and the
>>> iceberg-rust reader pipeline already.
>>>
>>> I have one concern wrt row groups, but this mostly comes down to
>>> implementation. Just documenting for posterity/if I have to implement this:
>>>
>>> Parquet row groups are byte-sized and differ across files, and
>>> arrow-rs's batch boundaries follow row group boundaries, so stitching
>>> across misaligned files requires periodic buffer copies (concat-refill at
>>> each row group boundary) to keep readers in sync. An upstream arrow-rs API
>>> that allows the caller to request a short read at a chosen row count would
>>> let the stitcher force re-alignment without copies; absent that, the concat
>>> cost is the practical floor.
>>>
>>> All of that said, neither option 1 or option 2 seem impossible in
>>> iceberg-rust/arrow-rs, so I don't think there are blockers from that
>>> standpoint.
>>>
>>> -Matt
>>>
>>> On Wed, May 20, 2026 at 11:38 AM Gábor Kaszab <[email protected]>
>>> wrote:
>>>
>>>> Minor addition to the storage overhead with writing NULLs as filling
>>>> values:
>>>> If we assume Parquet V2 and we write the _pos to the update file, with
>>>> 20% of the rows deleted will totally amortize the storage overhead of
>>>> writing NULLs. The reason is that with the "positional aligned" updates we
>>>> have a complete sequence of the _pos that is efficiently compressed with
>>>> delta encoding, while with the "applied deletes" approach there are holes
>>>> in the _pos sequence, hence delta encoding is slightly less efficient.
>>>>
>>>> Just some number to visualize with 20% deletion ratio:
>>>> 1 int col _pos + 1 int
>>>> Omit deleted rows 9735745 10111232
>>>> Null deleted rows 10189522 10205497
>>>> Overhead 4,66% 0,93%
>>>>
>>>> Gabor
>>>>
>>>>
>>>> Gábor Kaszab <[email protected]> ezt írta (időpont: 2026. máj.
>>>> 20., Sze, 16:50):
>>>>
>>>>> Hey Iceberg Community,
>>>>>
>>>>> Thanks Anurag for starting a dedicated thread on this! Just a couple
>>>>> of thoughts from my side:
>>>>>
>>>>> *Storage overhead:*
>>>>> *TLDR:* Regardless which way we go, storage overhead shouldn't be an
>>>>> issue.
>>>>>
>>>>> *Details:*
>>>>> I made a couple of measurements recently on the storage size of the
>>>>> update files. For numbers, see the "Update file size measurements" tab in 
>>>>> this
>>>>> doc
>>>>> <https://docs.google.com/spreadsheets/d/1I5u72D-4LbIs7p7lBnf5ITZou4V9jdHAj9z9p_nUO0Q/edit?usp=sharing>
>>>>> .
>>>>>
>>>>> 1) Writing _pos column:
>>>>>
>>>>>    - File formats could hide the overhead for the _pos column, e.g.
>>>>>    Parquet V2 has negligible (<0,5%) overhead
>>>>>    - I recall there was an argument that having _pos even for
>>>>>    pos-aligned updates might help us debug writer issues to see what is
>>>>>    missing/icorrect in the update file
>>>>>    - On the write path we already read and sort by _pos so there
>>>>>    should be no overhead either
>>>>>
>>>>> In summary, I slightly lean towards always having _pos in the updates
>>>>> regardless what representation we choose.
>>>>> If we can agree on this, the relevant pros and cons around _pos are
>>>>> obsolete.
>>>>>
>>>>> 2) Filling rows/values for deleted rows:
>>>>>
>>>>>    - I measured noticeable overhead when filling fields for deleted
>>>>>    rows with NULLs. As always the overhead depends on many stuff
>>>>>    - Deleted 5% of rows => writing NULLs has 2-3% storage overhead
>>>>>       - Deleted 20% of rows => writing NULLs has 4-4.6% overhead
>>>>>    - Since the consensus is constantly eliminating deletes, I think
>>>>>    this overhead is acceptable
>>>>>
>>>>> *Factors to consider to choose representation*
>>>>> As described above, the storage size shouldn't be a factor here, I'd
>>>>> not consider the presence of _pos as pro or con.
>>>>>
>>>>> If we scratch the storage size related ones, here is what we have left:
>>>>>
>>>>> *1) Accuracy of stats*
>>>>> By filling missing rows with auxiliary values the stats can go off
>>>>>
>>>>>    - I'm not that worried about Parquet footer stats, they represent
>>>>>    the file itself, not the logical deletes on top
>>>>>    - If we went for NULLs as filling values, we could set each field
>>>>>    in a deleted row to NULL (not the entire row). As a result the 
>>>>> column-level
>>>>>    null count in the table metadata can be off because of the filling 
>>>>> values.
>>>>>    - Technically, if we want to, we can correct these stats if we
>>>>>    collected the number of deleted rows and then adjust the null count by 
>>>>> this
>>>>>    number
>>>>>    - In general, deletes make stats off anyway regardless of column
>>>>>    updates (probably not the null count but the avg size, though)
>>>>>
>>>>> I don't think that this should be a decisive factor when choosing
>>>>> representation.
>>>>>
>>>>> *2) Complexity in general*
>>>>>     a) Read - stitching
>>>>>
>>>>>    - Positional alignment approach is straightforward for stitching
>>>>>    both in vectorized and row readers
>>>>>       - Stitch before applying deletes
>>>>>    - Applied deletes approach seems more complicated initially
>>>>>       - Scattering of update rows is required
>>>>>       - Might not be supported for all the language implementations
>>>>>       ATM. Thanks Anurag for taking a look!
>>>>>
>>>>>     b) Write
>>>>>
>>>>>    - Applied deletes approach is straightforward
>>>>>    - Positional alignment approach has one major complexity:
>>>>>       - when trailing rows are deleted, the writer currently has no
>>>>>       information how many rows to fill
>>>>>       - In the PoC we broadcast a "file path to row count" so that
>>>>>       the writers can now if trailing rows have to be filled with nulls, 
>>>>> and with
>>>>>       how many (comparing to the _pos column)
>>>>>       - In theory we could simply not fill trailing deleted rows, but
>>>>>       then we have a hybrid approach between positional alignment and 
>>>>> applied
>>>>>       deletes. Probably, we don't want this complexity in the spec
>>>>>
>>>>> *3) Read and write performance*
>>>>>
>>>>>    - I'm not expecting any difference in write perf
>>>>>    - Read could have a toll on the "applied deletes" approach due to
>>>>>    scattering. *@pvary* might have some more insights here.
>>>>>
>>>>> *Summary*
>>>>> I hope this summarizes all we have discussed from a different angle
>>>>> and might narrow down the areas to look for. I think the two main 
>>>>> questions
>>>>> to sort out to make a decision are:
>>>>>     1) Can we find a way to take care of trailing deletes in
>>>>> "positional aligned" approach (or are we fine not filling trailing 
>>>>> deletes)
>>>>>     2) What is the cost of scattering the update rows in the "applied
>>>>> deletes" approach
>>>>>         2/b) Is scattering feasible on all language implementations
>>>>>
>>>>> Best Regards,
>>>>> Gabor
>>>>>
>>>>>
>>>>> Anurag Mantripragada <[email protected]> ezt írta
>>>>> (időpont: 2026. máj. 20., Sze, 2:37):
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Following up on the column updates design
>>>>>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0#heading=h.b3mc4alqde65>
>>>>>>  and
>>>>>> the original discussion thread
>>>>>> <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>,
>>>>>> I'd like to start a focused discussion on how column update files should
>>>>>> represent rows when deletion vectors (DVs) are present.
>>>>>>
>>>>>> *Context*
>>>>>>
>>>>>> We've reached consensus on using a dense representation for column
>>>>>> update files. When a column is updated, the column file contains values 
>>>>>> for
>>>>>> all rows including unchanged rows. This avoids complex merge logic on the
>>>>>> write path when successive updates target overlapping fields.
>>>>>>
>>>>>> The open question is: what should the column file contain at
>>>>>> positions where the base file has deleted rows? There are two options.
>>>>>>
>>>>>> *Option 1*: Positional Alignment (row count matches base file)
>>>>>>
>>>>>> The column file has exactly base_file.record_count rows. Row N in
>>>>>> the column file corresponds to row N in the base file. Deleted positions
>>>>>> contain filler values (e.g., NULLs).
>>>>>>
>>>>>> Pros*:*
>>>>>>
>>>>>>    - Stitching is a zero-copy column swap in Arrow
>>>>>>    - Works identically in every Arrow implementation (Java, Rust,
>>>>>>    Python, C++)
>>>>>>    - No _pos column needed
>>>>>>    - Engines apply their existing DV filter to both base and column
>>>>>>    file
>>>>>>
>>>>>> Cons*:*
>>>>>>
>>>>>>    - Filler values at deleted positions skew Parquet footer
>>>>>>    statistics (null_count, avg_length)
>>>>>>    - Writes slightly more data than necessary (filler values for
>>>>>>    deleted rows)
>>>>>>    - Writer must know base_file.record_count to pad trailing
>>>>>>    deletions (base file metadata already available during write planning)
>>>>>>
>>>>>> *Option 2*: Applied Deletes (row count = live rows only)
>>>>>>
>>>>>> The column file contains only live rows (after applying DVs). A _pos 
>>>>>> column
>>>>>> maps each row back to its ordinal position in the base file.
>>>>>>
>>>>>> Pros*:*
>>>>>>
>>>>>>    - Only stores valid rows in column update files.
>>>>>>    - Parquet footer statistics are accurate (no skew from NULLs at
>>>>>>    deleted positions)
>>>>>>    - Slightly smaller file (no filler bytes)
>>>>>>
>>>>>> Cons*:*
>>>>>>
>>>>>>    - _pos adds storage overhead (Encoding must be left to the file
>>>>>>    format)
>>>>>>    - Stitching requires a scatter operation to allocate a new array
>>>>>>    and place values at the correct positions
>>>>>>    - It's not zero-copy in Arrow and requires manipulation.
>>>>>>    - As it stands today this might be  harder for non-Java engines
>>>>>>    (see section below)
>>>>>>
>>>>>> I investigated how three Iceberg implementations handle vectorized
>>>>>> reading and what column stitching would require in each. The key
>>>>>> architectural difference is how they expose Arrow memory:
>>>>>>
>>>>>> * Java/Spark**:* Spark's ColumnVector is an abstract class. We can
>>>>>> override getInt(rowId)to redirect reads without copying data. This
>>>>>> makes scatter operations appear "free" via virtual dispatch. My POC uses
>>>>>> this approach.
>>>>>>
>>>>>> *PyIceberg:* Uses PyArrow's native arrays. I could not find a way
>>>>>> to override what array[i] returns. PyArrow has take() (gather) but lacks
>>>>>> a scatter() primitive (in the  version we use).
>>>>>>
>>>>>> *iceberg-rust:* Uses arrow-rs arrays, which are concrete structs (not
>>>>>> trait objects). Int32Array::value(i) is a direct memory offset. Must
>>>>>> materialize new arrays via ArrayBuilder for any non-trivial column
>>>>>> manipulation.
>>>>>>
>>>>>> TL;DR: If we choose Option 2 (applied deletes), engines need a
>>>>>> scatter method to stitch column files. I found the following
>>>>>> implementations in Arrow which can be used to stitch.
>>>>>>
>>>>>>
>>>>>>    - C++ <https://github.com/apache/arrow/pull/44394> (Since Arrow
>>>>>>    20.0.0)
>>>>>>
>>>>>>    - Python <https://github.com/apache/arrow/pull/48267> (Since
>>>>>>    Arrow 23.0.0)
>>>>>>    - I did not find scatter in arrow-rs.
>>>>>>
>>>>>> I'm still researching these options and would love to hear from
>>>>>> everyone.
>>>>>>
>>>>>> Thanks,
>>>>>> Anurag
>>>>>>
>>>>>>

Re: [Discuss] Column Update File Representation

Reply via email to