Thanks everyone for taking a look! I could echo the concerns above, so I won’t duplicate them. @Andrei: given your prior experience, were there any blockers we haven’t identified yet?
Matt Butrovich <[email protected]> ezt írta (időpont: 2026. máj. 20., Sze, 21:51): > Thanks for writing this up, Anurag, and everyone else for all the analysis > and discussion so far! > > Re: "I did not find scatter in arrow-rs." I think this could be overcome > with take() and materialization, but it could be expensive. Based on what > we've seen with joins these scatter codepaths are often the most expensive > part because they can be extremely cache-unfriendly, especially with larger > data types. > > Discussion seems to be leaning toward Option 1, so I went through a > thought experiment of what this might look like in iceberg-rust, assuming > continuing to use arrow-rs over Parquet data files: > > 1. Plan-time: FileScanTask carries the column_files list for the base > file. Each entry is opaque to scan planning beyond its file_path and > field_ids. > 2. File open: for each task, open the base file plus one reader per > referenced column-update file, in parallel, reusing the existing > ArrowFileReader/AsyncFileReader plumbing. > 3. Per-file projection: partition the requested field_ids by their source > file. Each reader's ProjectionMask covers only the field_ids it owns. > Field_ids overridden by an update file are excluded from the base reader's > mask, so Parquet does not fetch or decode their column chunks. The base > file is read as if those columns were not projected. > 4. Coordinated row selection: the deletion vector and any predicate-driven > row selection are translated to a single file-global RowSelection and > applied identically to every reader. The same set of rows is therefore > selected from every file. > 5. Stitch stage: assemble each output batch by selecting columns from the > per-file batches according to field_id ownership. With aligned row counts > this is an Arc-level column swap; near row group boundaries readers diverge > and re-alignment requires occasional buffer copies (see Risks). > 6. Downstream: the existing RecordBatchTransformer operates on the > stitched batch unchanged. > > Some care needs to be done with predicate handling: > - Predicates over field_ids owned by exactly one file are pushed into that > file's RowFilter. > - Predicates spanning multiple files cannot be pushed; they evaluate > post-stitch via arrow::compute::filter. > - Row group statistics pruning is per-file but the decision is global. > Each file evaluates its own predicates against its own row group stats; the > resulting "could match" row ranges are intersected with each other and with > the DV to produce a single file-global RowSelection shared by all readers. > Each reader then derives its own with_row_groups list from that selection. > Row group boundaries differ between files but row counts stay aligned > because every reader applies the same selection. > > It seems like every primitive we would need is in arrow-rs and the > iceberg-rust reader pipeline already. > > I have one concern wrt row groups, but this mostly comes down to > implementation. Just documenting for posterity/if I have to implement this: > > Parquet row groups are byte-sized and differ across files, and arrow-rs's > batch boundaries follow row group boundaries, so stitching across > misaligned files requires periodic buffer copies (concat-refill at each row > group boundary) to keep readers in sync. An upstream arrow-rs API that > allows the caller to request a short read at a chosen row count would let > the stitcher force re-alignment without copies; absent that, the concat > cost is the practical floor. > > All of that said, neither option 1 or option 2 seem impossible in > iceberg-rust/arrow-rs, so I don't think there are blockers from that > standpoint. > > -Matt > > On Wed, May 20, 2026 at 11:38 AM Gábor Kaszab <[email protected]> > wrote: > >> Minor addition to the storage overhead with writing NULLs as filling >> values: >> If we assume Parquet V2 and we write the _pos to the update file, with >> 20% of the rows deleted will totally amortize the storage overhead of >> writing NULLs. The reason is that with the "positional aligned" updates we >> have a complete sequence of the _pos that is efficiently compressed with >> delta encoding, while with the "applied deletes" approach there are holes >> in the _pos sequence, hence delta encoding is slightly less efficient. >> >> Just some number to visualize with 20% deletion ratio: >> 1 int col _pos + 1 int >> Omit deleted rows 9735745 10111232 >> Null deleted rows 10189522 10205497 >> Overhead 4,66% 0,93% >> >> Gabor >> >> >> Gábor Kaszab <[email protected]> ezt írta (időpont: 2026. máj. 20., >> Sze, 16:50): >> >>> Hey Iceberg Community, >>> >>> Thanks Anurag for starting a dedicated thread on this! Just a couple of >>> thoughts from my side: >>> >>> *Storage overhead:* >>> *TLDR:* Regardless which way we go, storage overhead shouldn't be an >>> issue. >>> >>> *Details:* >>> I made a couple of measurements recently on the storage size of the >>> update files. For numbers, see the "Update file size measurements" tab in >>> this >>> doc >>> <https://docs.google.com/spreadsheets/d/1I5u72D-4LbIs7p7lBnf5ITZou4V9jdHAj9z9p_nUO0Q/edit?usp=sharing> >>> . >>> >>> 1) Writing _pos column: >>> >>> - File formats could hide the overhead for the _pos column, e.g. >>> Parquet V2 has negligible (<0,5%) overhead >>> - I recall there was an argument that having _pos even for >>> pos-aligned updates might help us debug writer issues to see what is >>> missing/icorrect in the update file >>> - On the write path we already read and sort by _pos so there should >>> be no overhead either >>> >>> In summary, I slightly lean towards always having _pos in the updates >>> regardless what representation we choose. >>> If we can agree on this, the relevant pros and cons around _pos are >>> obsolete. >>> >>> 2) Filling rows/values for deleted rows: >>> >>> - I measured noticeable overhead when filling fields for deleted >>> rows with NULLs. As always the overhead depends on many stuff >>> - Deleted 5% of rows => writing NULLs has 2-3% storage overhead >>> - Deleted 20% of rows => writing NULLs has 4-4.6% overhead >>> - Since the consensus is constantly eliminating deletes, I think >>> this overhead is acceptable >>> >>> *Factors to consider to choose representation* >>> As described above, the storage size shouldn't be a factor here, I'd not >>> consider the presence of _pos as pro or con. >>> >>> If we scratch the storage size related ones, here is what we have left: >>> >>> *1) Accuracy of stats* >>> By filling missing rows with auxiliary values the stats can go off >>> >>> - I'm not that worried about Parquet footer stats, they represent >>> the file itself, not the logical deletes on top >>> - If we went for NULLs as filling values, we could set each field in >>> a deleted row to NULL (not the entire row). As a result the column-level >>> null count in the table metadata can be off because of the filling >>> values. >>> - Technically, if we want to, we can correct these stats if we >>> collected the number of deleted rows and then adjust the null count by >>> this >>> number >>> - In general, deletes make stats off anyway regardless of column >>> updates (probably not the null count but the avg size, though) >>> >>> I don't think that this should be a decisive factor when choosing >>> representation. >>> >>> *2) Complexity in general* >>> a) Read - stitching >>> >>> - Positional alignment approach is straightforward for stitching >>> both in vectorized and row readers >>> - Stitch before applying deletes >>> - Applied deletes approach seems more complicated initially >>> - Scattering of update rows is required >>> - Might not be supported for all the language implementations >>> ATM. Thanks Anurag for taking a look! >>> >>> b) Write >>> >>> - Applied deletes approach is straightforward >>> - Positional alignment approach has one major complexity: >>> - when trailing rows are deleted, the writer currently has no >>> information how many rows to fill >>> - In the PoC we broadcast a "file path to row count" so that the >>> writers can now if trailing rows have to be filled with nulls, and >>> with how >>> many (comparing to the _pos column) >>> - In theory we could simply not fill trailing deleted rows, but >>> then we have a hybrid approach between positional alignment and >>> applied >>> deletes. Probably, we don't want this complexity in the spec >>> >>> *3) Read and write performance* >>> >>> - I'm not expecting any difference in write perf >>> - Read could have a toll on the "applied deletes" approach due to >>> scattering. *@pvary* might have some more insights here. >>> >>> *Summary* >>> I hope this summarizes all we have discussed from a different angle and >>> might narrow down the areas to look for. I think the two main questions to >>> sort out to make a decision are: >>> 1) Can we find a way to take care of trailing deletes in "positional >>> aligned" approach (or are we fine not filling trailing deletes) >>> 2) What is the cost of scattering the update rows in the "applied >>> deletes" approach >>> 2/b) Is scattering feasible on all language implementations >>> >>> Best Regards, >>> Gabor >>> >>> >>> Anurag Mantripragada <[email protected]> ezt írta >>> (időpont: 2026. máj. 20., Sze, 2:37): >>> >>>> Hi all, >>>> >>>> Following up on the column updates design >>>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0#heading=h.b3mc4alqde65> >>>> and >>>> the original discussion thread >>>> <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>, >>>> I'd like to start a focused discussion on how column update files should >>>> represent rows when deletion vectors (DVs) are present. >>>> >>>> *Context* >>>> >>>> We've reached consensus on using a dense representation for column >>>> update files. When a column is updated, the column file contains values for >>>> all rows including unchanged rows. This avoids complex merge logic on the >>>> write path when successive updates target overlapping fields. >>>> >>>> The open question is: what should the column file contain at positions >>>> where the base file has deleted rows? There are two options. >>>> >>>> *Option 1*: Positional Alignment (row count matches base file) >>>> >>>> The column file has exactly base_file.record_count rows. Row N in the >>>> column file corresponds to row N in the base file. Deleted positions >>>> contain filler values (e.g., NULLs). >>>> >>>> Pros*:* >>>> >>>> - Stitching is a zero-copy column swap in Arrow >>>> - Works identically in every Arrow implementation (Java, Rust, >>>> Python, C++) >>>> - No _pos column needed >>>> - Engines apply their existing DV filter to both base and column >>>> file >>>> >>>> Cons*:* >>>> >>>> - Filler values at deleted positions skew Parquet footer statistics >>>> (null_count, avg_length) >>>> - Writes slightly more data than necessary (filler values for >>>> deleted rows) >>>> - Writer must know base_file.record_count to pad trailing deletions >>>> (base file metadata already available during write planning) >>>> >>>> *Option 2*: Applied Deletes (row count = live rows only) >>>> >>>> The column file contains only live rows (after applying DVs). A _pos column >>>> maps each row back to its ordinal position in the base file. >>>> >>>> Pros*:* >>>> >>>> - Only stores valid rows in column update files. >>>> - Parquet footer statistics are accurate (no skew from NULLs at >>>> deleted positions) >>>> - Slightly smaller file (no filler bytes) >>>> >>>> Cons*:* >>>> >>>> - _pos adds storage overhead (Encoding must be left to the file >>>> format) >>>> - Stitching requires a scatter operation to allocate a new array >>>> and place values at the correct positions >>>> - It's not zero-copy in Arrow and requires manipulation. >>>> - As it stands today this might be harder for non-Java engines >>>> (see section below) >>>> >>>> I investigated how three Iceberg implementations handle vectorized >>>> reading and what column stitching would require in each. The key >>>> architectural difference is how they expose Arrow memory: >>>> >>>> * Java/Spark**:* Spark's ColumnVector is an abstract class. We can >>>> override getInt(rowId)to redirect reads without copying data. This >>>> makes scatter operations appear "free" via virtual dispatch. My POC uses >>>> this approach. >>>> >>>> *PyIceberg:* Uses PyArrow's native arrays. I could not find a way >>>> to override what array[i] returns. PyArrow has take() (gather) but lacks >>>> a scatter() primitive (in the version we use). >>>> >>>> *iceberg-rust:* Uses arrow-rs arrays, which are concrete structs (not >>>> trait objects). Int32Array::value(i) is a direct memory offset. Must >>>> materialize new arrays via ArrayBuilder for any non-trivial column >>>> manipulation. >>>> >>>> TL;DR: If we choose Option 2 (applied deletes), engines need a scatter >>>> method to stitch column files. I found the following implementations >>>> in Arrow which can be used to stitch. >>>> >>>> >>>> - C++ <https://github.com/apache/arrow/pull/44394> (Since Arrow >>>> 20.0.0) >>>> >>>> - Python <https://github.com/apache/arrow/pull/48267> (Since Arrow >>>> 23.0.0) >>>> - I did not find scatter in arrow-rs. >>>> >>>> I'm still researching these options and would love to hear from >>>> everyone. >>>> >>>> Thanks, >>>> Anurag >>>> >>>>
