Thanks everyone for taking a look!
I could echo the concerns above, so I won’t duplicate them.
@Andrei: given your prior experience, were there any blockers we haven’t
identified yet?


Matt Butrovich <[email protected]> ezt írta (időpont: 2026. máj. 20.,
Sze, 21:51):

> Thanks for writing this up, Anurag, and everyone else for all the analysis
> and discussion so far!
>
> Re: "I did not find scatter in arrow-rs." I think this could be overcome
> with take() and materialization, but it could be expensive. Based on what
> we've seen with joins these scatter codepaths are often the most expensive
> part because they can be extremely cache-unfriendly, especially with larger
> data types.
>
> Discussion seems to be leaning toward Option 1, so I went through a
> thought experiment of what this might look like in iceberg-rust, assuming
> continuing to use arrow-rs over Parquet data files:
>
> 1. Plan-time: FileScanTask carries the column_files list for the base
> file. Each entry is opaque to scan planning beyond its file_path and
> field_ids.
> 2. File open: for each task, open the base file plus one reader per
> referenced column-update file, in parallel, reusing the existing
> ArrowFileReader/AsyncFileReader plumbing.
> 3. Per-file projection: partition the requested field_ids by their source
> file. Each reader's ProjectionMask covers only the field_ids it owns.
> Field_ids overridden by an update file are excluded from the base reader's
> mask, so Parquet does not fetch or decode their column chunks. The base
> file is read as if those columns were not projected.
> 4. Coordinated row selection: the deletion vector and any predicate-driven
> row selection are translated to a single file-global RowSelection and
> applied identically to every reader. The same set of rows is therefore
> selected from every file.
> 5. Stitch stage: assemble each output batch by selecting columns from the
> per-file batches according to field_id ownership. With aligned row counts
> this is an Arc-level column swap; near row group boundaries readers diverge
> and re-alignment requires occasional buffer copies (see Risks).
> 6. Downstream: the existing RecordBatchTransformer operates on the
> stitched batch unchanged.
>
> Some care needs to be done with predicate handling:
> - Predicates over field_ids owned by exactly one file are pushed into that
> file's RowFilter.
> - Predicates spanning multiple files cannot be pushed; they evaluate
> post-stitch via arrow::compute::filter.
> - Row group statistics pruning is per-file but the decision is global.
> Each file evaluates its own predicates against its own row group stats; the
> resulting "could match" row ranges are intersected with each other and with
> the DV to produce a single file-global RowSelection shared by all readers.
> Each reader then derives its own with_row_groups list from that selection.
> Row group boundaries differ between files but row counts stay aligned
> because every reader applies the same selection.
>
> It seems like every primitive we would need is in arrow-rs and the
> iceberg-rust reader pipeline already.
>
> I have one concern wrt row groups, but this mostly comes down to
> implementation. Just documenting for posterity/if I have to implement this:
>
> Parquet row groups are byte-sized and differ across files, and arrow-rs's
> batch boundaries follow row group boundaries, so stitching across
> misaligned files requires periodic buffer copies (concat-refill at each row
> group boundary) to keep readers in sync. An upstream arrow-rs API that
> allows the caller to request a short read at a chosen row count would let
> the stitcher force re-alignment without copies; absent that, the concat
> cost is the practical floor.
>
> All of that said, neither option 1 or option 2 seem impossible in
> iceberg-rust/arrow-rs, so I don't think there are blockers from that
> standpoint.
>
> -Matt
>
> On Wed, May 20, 2026 at 11:38 AM Gábor Kaszab <[email protected]>
> wrote:
>
>> Minor addition to the storage overhead with writing NULLs as filling
>> values:
>> If we assume Parquet V2 and we write the _pos to the update file, with
>> 20% of the rows deleted will totally amortize the storage overhead of
>> writing NULLs. The reason is that with the "positional aligned" updates we
>> have a complete sequence of the _pos that is efficiently compressed with
>> delta encoding, while with the "applied deletes" approach there are holes
>> in the _pos sequence, hence delta encoding is slightly less efficient.
>>
>> Just some number to visualize with 20% deletion ratio:
>> 1 int col _pos + 1 int
>> Omit deleted rows 9735745 10111232
>> Null deleted rows 10189522 10205497
>> Overhead 4,66% 0,93%
>>
>> Gabor
>>
>>
>> Gábor Kaszab <[email protected]> ezt írta (időpont: 2026. máj. 20.,
>> Sze, 16:50):
>>
>>> Hey Iceberg Community,
>>>
>>> Thanks Anurag for starting a dedicated thread on this! Just a couple of
>>> thoughts from my side:
>>>
>>> *Storage overhead:*
>>> *TLDR:* Regardless which way we go, storage overhead shouldn't be an
>>> issue.
>>>
>>> *Details:*
>>> I made a couple of measurements recently on the storage size of the
>>> update files. For numbers, see the "Update file size measurements" tab in 
>>> this
>>> doc
>>> <https://docs.google.com/spreadsheets/d/1I5u72D-4LbIs7p7lBnf5ITZou4V9jdHAj9z9p_nUO0Q/edit?usp=sharing>
>>> .
>>>
>>> 1) Writing _pos column:
>>>
>>>    - File formats could hide the overhead for the _pos column, e.g.
>>>    Parquet V2 has negligible (<0,5%) overhead
>>>    - I recall there was an argument that having _pos even for
>>>    pos-aligned updates might help us debug writer issues to see what is
>>>    missing/icorrect in the update file
>>>    - On the write path we already read and sort by _pos so there should
>>>    be no overhead either
>>>
>>> In summary, I slightly lean towards always having _pos in the updates
>>> regardless what representation we choose.
>>> If we can agree on this, the relevant pros and cons around _pos are
>>> obsolete.
>>>
>>> 2) Filling rows/values for deleted rows:
>>>
>>>    - I measured noticeable overhead when filling fields for deleted
>>>    rows with NULLs. As always the overhead depends on many stuff
>>>    - Deleted 5% of rows => writing NULLs has 2-3% storage overhead
>>>       - Deleted 20% of rows => writing NULLs has 4-4.6% overhead
>>>    - Since the consensus is constantly eliminating deletes, I think
>>>    this overhead is acceptable
>>>
>>> *Factors to consider to choose representation*
>>> As described above, the storage size shouldn't be a factor here, I'd not
>>> consider the presence of _pos as pro or con.
>>>
>>> If we scratch the storage size related ones, here is what we have left:
>>>
>>> *1) Accuracy of stats*
>>> By filling missing rows with auxiliary values the stats can go off
>>>
>>>    - I'm not that worried about Parquet footer stats, they represent
>>>    the file itself, not the logical deletes on top
>>>    - If we went for NULLs as filling values, we could set each field in
>>>    a deleted row to NULL (not the entire row). As a result the column-level
>>>    null count in the table metadata can be off because of the filling 
>>> values.
>>>    - Technically, if we want to, we can correct these stats if we
>>>    collected the number of deleted rows and then adjust the null count by 
>>> this
>>>    number
>>>    - In general, deletes make stats off anyway regardless of column
>>>    updates (probably not the null count but the avg size, though)
>>>
>>> I don't think that this should be a decisive factor when choosing
>>> representation.
>>>
>>> *2) Complexity in general*
>>>     a) Read - stitching
>>>
>>>    - Positional alignment approach is straightforward for stitching
>>>    both in vectorized and row readers
>>>       - Stitch before applying deletes
>>>    - Applied deletes approach seems more complicated initially
>>>       - Scattering of update rows is required
>>>       - Might not be supported for all the language implementations
>>>       ATM. Thanks Anurag for taking a look!
>>>
>>>     b) Write
>>>
>>>    - Applied deletes approach is straightforward
>>>    - Positional alignment approach has one major complexity:
>>>       - when trailing rows are deleted, the writer currently has no
>>>       information how many rows to fill
>>>       - In the PoC we broadcast a "file path to row count" so that the
>>>       writers can now if trailing rows have to be filled with nulls, and 
>>> with how
>>>       many (comparing to the _pos column)
>>>       - In theory we could simply not fill trailing deleted rows, but
>>>       then we have a hybrid approach between positional alignment and 
>>> applied
>>>       deletes. Probably, we don't want this complexity in the spec
>>>
>>> *3) Read and write performance*
>>>
>>>    - I'm not expecting any difference in write perf
>>>    - Read could have a toll on the "applied deletes" approach due to
>>>    scattering. *@pvary* might have some more insights here.
>>>
>>> *Summary*
>>> I hope this summarizes all we have discussed from a different angle and
>>> might narrow down the areas to look for. I think the two main questions to
>>> sort out to make a decision are:
>>>     1) Can we find a way to take care of trailing deletes in "positional
>>> aligned" approach (or are we fine not filling trailing deletes)
>>>     2) What is the cost of scattering the update rows in the "applied
>>> deletes" approach
>>>         2/b) Is scattering feasible on all language implementations
>>>
>>> Best Regards,
>>> Gabor
>>>
>>>
>>> Anurag Mantripragada <[email protected]> ezt írta
>>> (időpont: 2026. máj. 20., Sze, 2:37):
>>>
>>>> Hi all,
>>>>
>>>> Following up on the column updates design
>>>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0#heading=h.b3mc4alqde65>
>>>>  and
>>>> the original discussion thread
>>>> <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>,
>>>> I'd like to start a focused discussion on how column update files should
>>>> represent rows when deletion vectors (DVs) are present.
>>>>
>>>> *Context*
>>>>
>>>> We've reached consensus on using a dense representation for column
>>>> update files. When a column is updated, the column file contains values for
>>>> all rows including unchanged rows. This avoids complex merge logic on the
>>>> write path when successive updates target overlapping fields.
>>>>
>>>> The open question is: what should the column file contain at positions
>>>> where the base file has deleted rows? There are two options.
>>>>
>>>> *Option 1*: Positional Alignment (row count matches base file)
>>>>
>>>> The column file has exactly base_file.record_count rows. Row N in the
>>>> column file corresponds to row N in the base file. Deleted positions
>>>> contain filler values (e.g., NULLs).
>>>>
>>>> Pros*:*
>>>>
>>>>    - Stitching is a zero-copy column swap in Arrow
>>>>    - Works identically in every Arrow implementation (Java, Rust,
>>>>    Python, C++)
>>>>    - No _pos column needed
>>>>    - Engines apply their existing DV filter to both base and column
>>>>    file
>>>>
>>>> Cons*:*
>>>>
>>>>    - Filler values at deleted positions skew Parquet footer statistics
>>>>    (null_count, avg_length)
>>>>    - Writes slightly more data than necessary (filler values for
>>>>    deleted rows)
>>>>    - Writer must know base_file.record_count to pad trailing deletions
>>>>    (base file metadata already available during write planning)
>>>>
>>>> *Option 2*: Applied Deletes (row count = live rows only)
>>>>
>>>> The column file contains only live rows (after applying DVs). A _pos column
>>>> maps each row back to its ordinal position in the base file.
>>>>
>>>> Pros*:*
>>>>
>>>>    - Only stores valid rows in column update files.
>>>>    - Parquet footer statistics are accurate (no skew from NULLs at
>>>>    deleted positions)
>>>>    - Slightly smaller file (no filler bytes)
>>>>
>>>> Cons*:*
>>>>
>>>>    - _pos adds storage overhead (Encoding must be left to the file
>>>>    format)
>>>>    - Stitching requires a scatter operation to allocate a new array
>>>>    and place values at the correct positions
>>>>    - It's not zero-copy in Arrow and requires manipulation.
>>>>    - As it stands today this might be  harder for non-Java engines
>>>>    (see section below)
>>>>
>>>> I investigated how three Iceberg implementations handle vectorized
>>>> reading and what column stitching would require in each. The key
>>>> architectural difference is how they expose Arrow memory:
>>>>
>>>> * Java/Spark**:* Spark's ColumnVector is an abstract class. We can
>>>> override getInt(rowId)to redirect reads without copying data. This
>>>> makes scatter operations appear "free" via virtual dispatch. My POC uses
>>>> this approach.
>>>>
>>>> *PyIceberg:* Uses PyArrow's native arrays. I could not find a way
>>>> to override what array[i] returns. PyArrow has take() (gather) but lacks
>>>> a scatter() primitive (in the  version we use).
>>>>
>>>> *iceberg-rust:* Uses arrow-rs arrays, which are concrete structs (not
>>>> trait objects). Int32Array::value(i) is a direct memory offset. Must
>>>> materialize new arrays via ArrayBuilder for any non-trivial column
>>>> manipulation.
>>>>
>>>> TL;DR: If we choose Option 2 (applied deletes), engines need a scatter
>>>> method to stitch column files. I found the following implementations
>>>> in Arrow which can be used to stitch.
>>>>
>>>>
>>>>    - C++ <https://github.com/apache/arrow/pull/44394> (Since Arrow
>>>>    20.0.0)
>>>>
>>>>    - Python <https://github.com/apache/arrow/pull/48267> (Since Arrow
>>>>    23.0.0)
>>>>    - I did not find scatter in arrow-rs.
>>>>
>>>> I'm still researching these options and would love to hear from
>>>> everyone.
>>>>
>>>> Thanks,
>>>> Anurag
>>>>
>>>>

Reply via email to