Re: [Discuss] Column Update File Representation

Andrei Tserakhau via dev Thu, 21 May 2026 09:07:44 -0700

Hi Peter, all,

Four things from the PoC, bigger ones first.
Nothing blocking TBH, but some stuff need attention.


== Row-group alignment within a file ==

Echoing Matt's read -- this is the real practical floor. Our PoC
forced 1:1 base->column-file at the file level (disabling
adaptive file sizing on the writer side), but within a file the
row group boundaries drifted between base and column file. We
took the concat-refill cost at every drifting boundary and moved
on; never pursued the deeper fix. The explicit "force short read
at row count N" API Matt described in arrow-rs would let
stitchers re-align without copies, which feels like the right
upstream ask if iceberg-rust pushes Option 1 to production.

For Java/Spark the same problem exists but vectorized batch sizes
hide most of it because Spark already does the equivalent of
concat-refill at batch boundaries. The Rust path is genuinely
harder here, not just due to scatter.

== Stats correction needs explicit ownership in the spec ==

Gabor's null_count adjustment is right, but the spec should also
state WHICH stats the writer must adjust before publishing to
manifests vs which engines can leave raw and reconcile at read
time. The set we saw matter:

  - null_count: must be adjusted (filler NULLs inflate it)
  - value_count: if left raw, CBO cardinality estimates overcount
    by the delete count -- affects join strategy choices
  - min / max: stays correct, no adjustment needed
  - IS NULL fast paths: engines that prune via stats over-match;
    correct results post-DV-filter but wasted I/O

Leaving any of these undefined is a recipe for engine divergence
at predicate pushdown time -- two implementations will make
different join-plan decisions on the same table

== Compaction timing ==

We didn't reach compaction in the PoC. The mechanics (merge
column files into the base) are clear enough, but the timing
policy -- every snapshot? read-pressure-driven? manual via
RewriteDataFiles with a column-file-aware strategy? -- was never
settled. Probably fine to leave the timing to engines, but worth
describing the merge mechanics in the spec so engines don't
diverge there.

== Schema evolution ==

Worth calling out so this doesn't get relitigated later: Iceberg's
stable field_ids handle column rename and add cleanly because
column files reference field_ids, not names. Type widening
reduces to the same promotion logic the reader already applies
to the base file. We had to work around this on Delta (names not
IDs), which limited what we could test, but the Iceberg design
gets it for free.

Andrei

On Thu, May 21, 2026 at 3:47 PM Péter Váry <[email protected]>
wrote:

> Thanks everyone for taking a look!
> I could echo the concerns above, so I won’t duplicate them.
> @Andrei: given your prior experience, were there any blockers we haven’t
> identified yet?
>
>
> Matt Butrovich <[email protected]> ezt írta (időpont: 2026. máj. 20.,
> Sze, 21:51):
>
>> Thanks for writing this up, Anurag, and everyone else for all the
>> analysis and discussion so far!
>>
>> Re: "I did not find scatter in arrow-rs." I think this could be overcome
>> with take() and materialization, but it could be expensive. Based on what
>> we've seen with joins these scatter codepaths are often the most expensive
>> part because they can be extremely cache-unfriendly, especially with larger
>> data types.
>>
>> Discussion seems to be leaning toward Option 1, so I went through a
>> thought experiment of what this might look like in iceberg-rust, assuming
>> continuing to use arrow-rs over Parquet data files:
>>
>> 1. Plan-time: FileScanTask carries the column_files list for the base
>> file. Each entry is opaque to scan planning beyond its file_path and
>> field_ids.
>> 2. File open: for each task, open the base file plus one reader per
>> referenced column-update file, in parallel, reusing the existing
>> ArrowFileReader/AsyncFileReader plumbing.
>> 3. Per-file projection: partition the requested field_ids by their source
>> file. Each reader's ProjectionMask covers only the field_ids it owns.
>> Field_ids overridden by an update file are excluded from the base reader's
>> mask, so Parquet does not fetch or decode their column chunks. The base
>> file is read as if those columns were not projected.
>> 4. Coordinated row selection: the deletion vector and any
>> predicate-driven row selection are translated to a single file-global
>> RowSelection and applied identically to every reader. The same set of rows
>> is therefore selected from every file.
>> 5. Stitch stage: assemble each output batch by selecting columns from the
>> per-file batches according to field_id ownership. With aligned row counts
>> this is an Arc-level column swap; near row group boundaries readers diverge
>> and re-alignment requires occasional buffer copies (see Risks).
>> 6. Downstream: the existing RecordBatchTransformer operates on the
>> stitched batch unchanged.
>>
>> Some care needs to be done with predicate handling:
>> - Predicates over field_ids owned by exactly one file are pushed into
>> that file's RowFilter.
>> - Predicates spanning multiple files cannot be pushed; they evaluate
>> post-stitch via arrow::compute::filter.
>> - Row group statistics pruning is per-file but the decision is global.
>> Each file evaluates its own predicates against its own row group stats; the
>> resulting "could match" row ranges are intersected with each other and with
>> the DV to produce a single file-global RowSelection shared by all readers.
>> Each reader then derives its own with_row_groups list from that selection.
>> Row group boundaries differ between files but row counts stay aligned
>> because every reader applies the same selection.
>>
>> It seems like every primitive we would need is in arrow-rs and the
>> iceberg-rust reader pipeline already.
>>
>> I have one concern wrt row groups, but this mostly comes down to
>> implementation. Just documenting for posterity/if I have to implement this:
>>
>> Parquet row groups are byte-sized and differ across files, and arrow-rs's
>> batch boundaries follow row group boundaries, so stitching across
>> misaligned files requires periodic buffer copies (concat-refill at each row
>> group boundary) to keep readers in sync. An upstream arrow-rs API that
>> allows the caller to request a short read at a chosen row count would let
>> the stitcher force re-alignment without copies; absent that, the concat
>> cost is the practical floor.
>>
>> All of that said, neither option 1 or option 2 seem impossible in
>> iceberg-rust/arrow-rs, so I don't think there are blockers from that
>> standpoint.
>>
>> -Matt
>>
>> On Wed, May 20, 2026 at 11:38 AM Gábor Kaszab <[email protected]>
>> wrote:
>>
>>> Minor addition to the storage overhead with writing NULLs as filling
>>> values:
>>> If we assume Parquet V2 and we write the _pos to the update file, with
>>> 20% of the rows deleted will totally amortize the storage overhead of
>>> writing NULLs. The reason is that with the "positional aligned" updates we
>>> have a complete sequence of the _pos that is efficiently compressed with
>>> delta encoding, while with the "applied deletes" approach there are holes
>>> in the _pos sequence, hence delta encoding is slightly less efficient.
>>>
>>> Just some number to visualize with 20% deletion ratio:
>>> 1 int col _pos + 1 int
>>> Omit deleted rows 9735745 10111232
>>> Null deleted rows 10189522 10205497
>>> Overhead 4,66% 0,93%
>>>
>>> Gabor
>>>
>>>
>>> Gábor Kaszab <[email protected]> ezt írta (időpont: 2026. máj.
>>> 20., Sze, 16:50):
>>>
>>>> Hey Iceberg Community,
>>>>
>>>> Thanks Anurag for starting a dedicated thread on this! Just a couple of
>>>> thoughts from my side:
>>>>
>>>> *Storage overhead:*
>>>> *TLDR:* Regardless which way we go, storage overhead shouldn't be an
>>>> issue.
>>>>
>>>> *Details:*
>>>> I made a couple of measurements recently on the storage size of the
>>>> update files. For numbers, see the "Update file size measurements" tab in 
>>>> this
>>>> doc
>>>> <https://docs.google.com/spreadsheets/d/1I5u72D-4LbIs7p7lBnf5ITZou4V9jdHAj9z9p_nUO0Q/edit?usp=sharing>
>>>> .
>>>>
>>>> 1) Writing _pos column:
>>>>
>>>>    - File formats could hide the overhead for the _pos column, e.g.
>>>>    Parquet V2 has negligible (<0,5%) overhead
>>>>    - I recall there was an argument that having _pos even for
>>>>    pos-aligned updates might help us debug writer issues to see what is
>>>>    missing/icorrect in the update file
>>>>    - On the write path we already read and sort by _pos so there
>>>>    should be no overhead either
>>>>
>>>> In summary, I slightly lean towards always having _pos in the updates
>>>> regardless what representation we choose.
>>>> If we can agree on this, the relevant pros and cons around _pos are
>>>> obsolete.
>>>>
>>>> 2) Filling rows/values for deleted rows:
>>>>
>>>>    - I measured noticeable overhead when filling fields for deleted
>>>>    rows with NULLs. As always the overhead depends on many stuff
>>>>    - Deleted 5% of rows => writing NULLs has 2-3% storage overhead
>>>>       - Deleted 20% of rows => writing NULLs has 4-4.6% overhead
>>>>    - Since the consensus is constantly eliminating deletes, I think
>>>>    this overhead is acceptable
>>>>
>>>> *Factors to consider to choose representation*
>>>> As described above, the storage size shouldn't be a factor here, I'd
>>>> not consider the presence of _pos as pro or con.
>>>>
>>>> If we scratch the storage size related ones, here is what we have left:
>>>>
>>>> *1) Accuracy of stats*
>>>> By filling missing rows with auxiliary values the stats can go off
>>>>
>>>>    - I'm not that worried about Parquet footer stats, they represent
>>>>    the file itself, not the logical deletes on top
>>>>    - If we went for NULLs as filling values, we could set each field
>>>>    in a deleted row to NULL (not the entire row). As a result the 
>>>> column-level
>>>>    null count in the table metadata can be off because of the filling 
>>>> values.
>>>>    - Technically, if we want to, we can correct these stats if we
>>>>    collected the number of deleted rows and then adjust the null count by 
>>>> this
>>>>    number
>>>>    - In general, deletes make stats off anyway regardless of column
>>>>    updates (probably not the null count but the avg size, though)
>>>>
>>>> I don't think that this should be a decisive factor when choosing
>>>> representation.
>>>>
>>>> *2) Complexity in general*
>>>>     a) Read - stitching
>>>>
>>>>    - Positional alignment approach is straightforward for stitching
>>>>    both in vectorized and row readers
>>>>       - Stitch before applying deletes
>>>>    - Applied deletes approach seems more complicated initially
>>>>       - Scattering of update rows is required
>>>>       - Might not be supported for all the language implementations
>>>>       ATM. Thanks Anurag for taking a look!
>>>>
>>>>     b) Write
>>>>
>>>>    - Applied deletes approach is straightforward
>>>>    - Positional alignment approach has one major complexity:
>>>>       - when trailing rows are deleted, the writer currently has no
>>>>       information how many rows to fill
>>>>       - In the PoC we broadcast a "file path to row count" so that the
>>>>       writers can now if trailing rows have to be filled with nulls, and 
>>>> with how
>>>>       many (comparing to the _pos column)
>>>>       - In theory we could simply not fill trailing deleted rows, but
>>>>       then we have a hybrid approach between positional alignment and 
>>>> applied
>>>>       deletes. Probably, we don't want this complexity in the spec
>>>>
>>>> *3) Read and write performance*
>>>>
>>>>    - I'm not expecting any difference in write perf
>>>>    - Read could have a toll on the "applied deletes" approach due to
>>>>    scattering. *@pvary* might have some more insights here.
>>>>
>>>> *Summary*
>>>> I hope this summarizes all we have discussed from a different angle and
>>>> might narrow down the areas to look for. I think the two main questions to
>>>> sort out to make a decision are:
>>>>     1) Can we find a way to take care of trailing deletes in
>>>> "positional aligned" approach (or are we fine not filling trailing deletes)
>>>>     2) What is the cost of scattering the update rows in the "applied
>>>> deletes" approach
>>>>         2/b) Is scattering feasible on all language implementations
>>>>
>>>> Best Regards,
>>>> Gabor
>>>>
>>>>
>>>> Anurag Mantripragada <[email protected]> ezt írta
>>>> (időpont: 2026. máj. 20., Sze, 2:37):
>>>>
>>>>> Hi all,
>>>>>
>>>>> Following up on the column updates design
>>>>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0#heading=h.b3mc4alqde65>
>>>>>  and
>>>>> the original discussion thread
>>>>> <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>,
>>>>> I'd like to start a focused discussion on how column update files should
>>>>> represent rows when deletion vectors (DVs) are present.
>>>>>
>>>>> *Context*
>>>>>
>>>>> We've reached consensus on using a dense representation for column
>>>>> update files. When a column is updated, the column file contains values 
>>>>> for
>>>>> all rows including unchanged rows. This avoids complex merge logic on the
>>>>> write path when successive updates target overlapping fields.
>>>>>
>>>>> The open question is: what should the column file contain at
>>>>> positions where the base file has deleted rows? There are two options.
>>>>>
>>>>> *Option 1*: Positional Alignment (row count matches base file)
>>>>>
>>>>> The column file has exactly base_file.record_count rows. Row N in the
>>>>> column file corresponds to row N in the base file. Deleted positions
>>>>> contain filler values (e.g., NULLs).
>>>>>
>>>>> Pros*:*
>>>>>
>>>>>    - Stitching is a zero-copy column swap in Arrow
>>>>>    - Works identically in every Arrow implementation (Java, Rust,
>>>>>    Python, C++)
>>>>>    - No _pos column needed
>>>>>    - Engines apply their existing DV filter to both base and column
>>>>>    file
>>>>>
>>>>> Cons*:*
>>>>>
>>>>>    - Filler values at deleted positions skew Parquet footer
>>>>>    statistics (null_count, avg_length)
>>>>>    - Writes slightly more data than necessary (filler values for
>>>>>    deleted rows)
>>>>>    - Writer must know base_file.record_count to pad trailing
>>>>>    deletions (base file metadata already available during write planning)
>>>>>
>>>>> *Option 2*: Applied Deletes (row count = live rows only)
>>>>>
>>>>> The column file contains only live rows (after applying DVs). A _pos 
>>>>> column
>>>>> maps each row back to its ordinal position in the base file.
>>>>>
>>>>> Pros*:*
>>>>>
>>>>>    - Only stores valid rows in column update files.
>>>>>    - Parquet footer statistics are accurate (no skew from NULLs at
>>>>>    deleted positions)
>>>>>    - Slightly smaller file (no filler bytes)
>>>>>
>>>>> Cons*:*
>>>>>
>>>>>    - _pos adds storage overhead (Encoding must be left to the file
>>>>>    format)
>>>>>    - Stitching requires a scatter operation to allocate a new array
>>>>>    and place values at the correct positions
>>>>>    - It's not zero-copy in Arrow and requires manipulation.
>>>>>    - As it stands today this might be  harder for non-Java engines
>>>>>    (see section below)
>>>>>
>>>>> I investigated how three Iceberg implementations handle vectorized
>>>>> reading and what column stitching would require in each. The key
>>>>> architectural difference is how they expose Arrow memory:
>>>>>
>>>>> * Java/Spark**:* Spark's ColumnVector is an abstract class. We can
>>>>> override getInt(rowId)to redirect reads without copying data. This
>>>>> makes scatter operations appear "free" via virtual dispatch. My POC uses
>>>>> this approach.
>>>>>
>>>>> *PyIceberg:* Uses PyArrow's native arrays. I could not find a way
>>>>> to override what array[i] returns. PyArrow has take() (gather) but lacks
>>>>> a scatter() primitive (in the  version we use).
>>>>>
>>>>> *iceberg-rust:* Uses arrow-rs arrays, which are concrete structs (not
>>>>> trait objects). Int32Array::value(i) is a direct memory offset. Must
>>>>> materialize new arrays via ArrayBuilder for any non-trivial column
>>>>> manipulation.
>>>>>
>>>>> TL;DR: If we choose Option 2 (applied deletes), engines need a
>>>>> scatter method to stitch column files. I found the following
>>>>> implementations in Arrow which can be used to stitch.
>>>>>
>>>>>
>>>>>    - C++ <https://github.com/apache/arrow/pull/44394> (Since Arrow
>>>>>    20.0.0)
>>>>>
>>>>>    - Python <https://github.com/apache/arrow/pull/48267> (Since Arrow
>>>>>    23.0.0)
>>>>>    - I did not find scatter in arrow-rs.
>>>>>
>>>>> I'm still researching these options and would love to hear from
>>>>> everyone.
>>>>>
>>>>> Thanks,
>>>>> Anurag
>>>>>
>>>>>

Re: [Discuss] Column Update File Representation

Reply via email to