Thanks Gabor! Well explained.
On Mon, May 11, 2026 at 10:11 PM Gábor Kaszab <[email protected]> wrote: > > Hey Gang Wu, > > Thanks for taking a look! > The biggest consideration of choosing dense representation over sparse is > that on the write path sparse could become complex if we want to merge the > existing update for a particular field ID with the newly changed fields > (consider partial update for [col_1, col_2] first and then [col_2, col_3] > second). This complexity is put on the engine/writer that we wanted to avoid. > > Dense representation might be less performant on the write-side compared to > sparse, but way simpler and according to measurement dense is still very > performant compared to the existing MoR even for not that wide tables. We > already have a PoC for the dense representation that seems similar to a > "projected copy-on-write". File path is something we need to read anyway on > the write-path to be able to provide a 1-1 mapping between base files per > update operation. > The potential need for distributing row count is something we ran into too. > It's needed when there are deleted rows at the end of the base file and we > want to fill the end of the update file with auxiliary values. Note, even > though we went for the dense representation for partial updates, we could > still choose to represent the deleted rows in a sparse manner, meaning we > simply don't write them to the update file. I'd personally like to find a way > for the positional aligned update files with auxiliary values, TBH. For that > we either need to distribute the per base file row count to know if we have > to fill values to the update file to the end. Or we could either leave the > end of the update file unfilled, and handle this on the reader side when > stitching to the base file. > > I hope I covered your concerns, but feel free to explain more, if I haven't! > > Thanks, > Gabor > > Gang Wu <[email protected]> ezt írta (időpont: 2026. máj. 9., Szo, 10:32): >> >> I agree that writing dense update files improves read performance. The main >> point of introducing column update files is to accelerate the write job. >> Have we considered the complexity of an SQL column update job plan if we use >> dense update files? For example, we must either read updated columns from >> the data file to fetch before-update values or carry extra metadata for >> every data file (e.g. file path, total row count) to restore benign values >> of deleted rows. >> >> I'm not objecting to dense files, I just want to raise some concerns that we >> have discussed internally. >> >> On Fri, May 8, 2026 at 9:31 PM Péter Váry <[email protected]> >> wrote: >>> >>> One scenario where storing only plain values clearly pays off is vectorized >>> reader stitching. >>> >>> If both the base file (baseBatch) and the update file (updateBatch) return >>> batches of equal size, stitching is trivial: we can construct a merged >>> batch (mergedBatch) by simply concatenating the column vectors from both >>> inputs, creating a new ColumnarBatch on top of them. >>> >>> /** >>> * Merges the columns from {@code baseBatch} and {@code updateBatch} into a >>> new {@link >>> * ColumnarBatch} that contains all columns from both inputs, in order >>> (base columns first, >>> * followed by update columns). Both batches must have the same number of >>> rows. >>> * >>> * @param baseBatch the base batch whose columns appear first >>> * @param updateBatch the update batch whose columns are appended after the >>> base columns >>> * @return a new {@link ColumnarBatch} containing all columns from both >>> input batches >>> */ >>> static ColumnarBatch mergeBatches(ColumnarBatch baseBatch, ColumnarBatch >>> updateBatch) { >>> Preconditions.checkArgument(baseBatch != null, "Invalid base batch: >>> null"); >>> Preconditions.checkArgument(updateBatch != null, "Invalid update batch: >>> null"); >>> Preconditions.checkArgument( >>> baseBatch.numRows() == updateBatch.numRows(), >>> "Cannot merge batches with different row counts: %s != %s", >>> baseBatch.numRows(), >>> updateBatch.numRows()); >>> >>> int baseCols = baseBatch.numCols(); >>> int updateCols = updateBatch.numCols(); >>> ColumnVector[] merged = new ColumnVector[baseCols + updateCols]; >>> for (int i = 0; i < baseCols; i += 1) { >>> merged[i] = baseBatch.column(i); >>> } >>> >>> for (int i = 0; i < updateCols; i += 1) { >>> merged[baseCols + i] = updateBatch.column(i); >>> } >>> >>> return new ColumnarBatch(baseBatch.numRows(), merged); >>> } >>> >>> >>> The situation becomes more complex if updateBatch does not contain a row >>> for every base row. In that case, we must either: >>> >>> perform a sparse to dense conversion, or >>> fall back to row-by-row stitching >>> >>> The first option is preferable but introduces overhead for vectors with >>> missing rows. The second is significantly worse, as it breaks vectorization >>> and severely degrades reader performance. >>> >>> A representation where every row is materialized in the update file avoids >>> these issues entirely, allowing us to keep stitching fully vectorized. For >>> that reason, I strongly prefer this approach for vectorized reads. >>> >>> Russell Spitzer <[email protected]> ezt írta (időpont: 2026. máj. >>> 7., Cs, 20:32): >>>> >>>> I don't think we need to force anything but I think engines should use >>>> delta encoding. The spec really just needs to describe something >>>> technically feasible, it's up to engines to actually implement that in a >>>> useful way. >>>> >>>> Here is a quick example of two files with a single column containing >>>> values from 0->999,999 >>>> >>>> Delta encoding is 43 kb >>>> Plain encoding is 3910 kb >>>> >>>> >>>> >>>> >>>> On Thu, May 7, 2026 at 8:07 AM Gábor Kaszab <[email protected]> wrote: >>>>> >>>>> Hey All, >>>>> >>>>> Thank you for the constructive participation on the latest sync! Let me >>>>> sum up the key highlights. I found the recording is already uploaded here. >>>>> >>>>> Decided: What update file representation to use for partial updates? >>>>> >>>>> We have been rolling this for a while but it seems we reached consensus >>>>> here >>>>> Options were: >>>>> >>>>> Sparse representation: write only what we update >>>>> "Complete" / "Dense" / "Projected copy-on-write" representation: Write >>>>> entire columns including values that are unchanged >>>>> >>>>> Decision: After a long discussion we had consensus to go with the >>>>> "Complete" approach due to the simplicity it provides compared to the >>>>> complexity of the other approach especially on the write path. >>>>> >>>>> Open: What update file representation to use if we have deletes? >>>>> >>>>> Going with the "Complete" representation for partial updates, it is still >>>>> a question how to represent update files when some rows are >>>>> deleted/invalidated in the base file with DVs. >>>>> Options are: >>>>> >>>>> Positional matching update files (same number of rows as in base file, in >>>>> same order). >>>>> >>>>> We have to fill the deleted rows with some values >>>>> Options to fill these rows: >>>>> >>>>> NULLs: In Parquet we can't set an entire row to NULL, but we can set each >>>>> field to NULL. >>>>> Same value as in base file >>>>> Last valid value (backfill when deleted from beginning): similarly as >>>>> Lance does >>>>> >>>>> Pros: >>>>> >>>>> Stitching with the base file seems more straightforward. >>>>> We might not even need to write 'position' column to the update file >>>>> >>>>> Cons: >>>>> >>>>> We write values just to discard them later >>>>> Extra complexity on the write path >>>>> Auxiliary values might bring the stats OFF (num nulls, avg length, value >>>>> count) >>>>> Effect on storage size? >>>>> Some values might make some operations invalid, like dividing with zero, >>>>> etc. >>>>> >>>>> Don't write such rows to the base file >>>>> >>>>> Updates files wouldn't be positional-aligned with base files >>>>> Pros: >>>>> >>>>> Everything in the update file is valid >>>>> No OFF stats >>>>> No extra complexity on the write path >>>>> >>>>> Cons: >>>>> >>>>> Position column is a must in the update file >>>>> Seems more complicated to stitch rows with the base file (especially in >>>>> vectorized read) >>>>> >>>>> I don't think we reached an entire consensus on this front. >>>>> >>>>> Do we want to allow both? Engines could decide which one they write. >>>>> >>>>> Open: Is it necessary to write position into the update file? >>>>> >>>>> In general I think we assume that we write the position into the update >>>>> file >>>>> However, we can exercise the idea, what if we don't. >>>>> >>>>> See "Complete" representation with filler values => we might not need pos >>>>> >>>>> Ran some experiments on this: >>>>> >>>>> Wrote positions [1 .. 2.5M] into the update files >>>>> By default Iceberg uses Plain values + ZSTD for positions >>>>> This brings an extra 2.5MB storage size per update file >>>>> >>>>> If we update an INT col that compresses well, then the update file size >>>>> is mainly taken by the position column >>>>> Update 1 INT col that compresses well: 200KB without pos => 2.8MB with pos >>>>> Update 1 INT col that doesn't compress well: 6.7MB without pos => 9.2MB >>>>> with pos >>>>> >>>>> I think the questions here are: >>>>> >>>>> Can we enforce writing the update file with Parquet V2 to use delta >>>>> encoding? >>>>> Can we restrict column updates for readers supporting Parquet V2? >>>>> Should we allow Parquet V1 update files and are we Ok with the storage >>>>> overhead then? >>>>> >>>>> See you in the discussion! >>>>> Cheers, >>>>> Gabor >>>>> >>>>> Anurag Mantripragada <[email protected]> ezt írta (időpont: 2026. >>>>> ápr. 25., Szo, 1:52): >>>>>> >>>>>> Hi all, >>>>>> >>>>>> Thanks for attending the efficient column updates sync on Tuesday. You >>>>>> can find the recording here: Link to Recording. >>>>>> >>>>>> Key discussion points from the meeting: >>>>>> >>>>>> Change Detection: We discussed scenarios for change detection. In both >>>>>> sparse and full column file representations, adding a >>>>>> last_updated_sequence_number to each row would allow readers to >>>>>> differentiate newly updated rows from unchanged rows carried over from >>>>>> the base file. >>>>>> Full vs. Sparse Column File Representation: The choice between full vs >>>>>> sparse column files is deferred pending further benchmarks. Gabor and I >>>>>> will continue the POCs on both representations. >>>>>> Read Performance: We concluded that column updates are primarily >>>>>> intended to optimize writes by avoiding full row rewrites, rather than >>>>>> to improve read performance. Any potential read gains from parallel >>>>>> reads will likely be superseded by the upcoming vectorized I/O feature. >>>>>> Spark SPIP: I submitted an SPIP in Spark to get feedback on write schema >>>>>> narrowing for column updates. >>>>>> >>>>>> Thanks, >>>>>> Anurag >>>>>> >>>>>> On Mon, Mar 30, 2026 at 11:12 AM Gábor Kaszab <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>> Steven, these are the observations wrt the multithreaded reading: >>>>>>> - Multithreading here is based on column splits: read the base file >>>>>>> and the update files in parallel >>>>>>> - There is usually a bigger "base" file reader and a number of >>>>>>> additional readers for the smaller update files >>>>>>> - The tests simulate full scan for all the columns, no projection >>>>>>> - When running the reads in parallel, the base file reading will >>>>>>> dominate the runtime, while the reading of the update files is way >>>>>>> faster >>>>>>> - The more columns are read from update files, the less columns we >>>>>>> have to read from the base file. This results in faster reading of the >>>>>>> base file >>>>>>> - When reading a high number of columns from update files, still the >>>>>>> base file reading is the dominant factor. => the faster the base file >>>>>>> reading finishes, the faster the entire read process is >>>>>>> >>>>>>> Some additional info on further measurements: >>>>>>> - In an isolated environment, e.g. when running benchmarks on the >>>>>>> readers, multithreading brings read performance boost >>>>>>> - Also in an isolated environment, the more columns are updated, the >>>>>>> faster the total reading gets (because as described above, it reduces >>>>>>> the time spent on the dominant read of the base file) >>>>>>> - However, when putting this multithreaded read approach together >>>>>>> with Spark, no read gain shows unfortunately. In fact there is >>>>>>> performance degradation >>>>>>> - Our initial analysis is that Spark already reads file splits (by >>>>>>> row groups) assigned to cores, and there are easily more read splits >>>>>>> than cores >>>>>>> - CPU seems already saturated even when processing the split reads >>>>>>> - When adding another dimension of parallelization via multithreaded >>>>>>> column split reading, there is no extra CPU capacity to use. This seems >>>>>>> to bring overhead but no gain, hence the degradation >>>>>>> - The measurement was executed on a local machine with 14 cores and 5 >>>>>>> data files (22 row group splits). My theory is that we'd have the same >>>>>>> situation in a prod-like setup too because it's not that difficult to >>>>>>> get CPU saturation there too. Back of the envelope calculation: 1000 >>>>>>> node cluster, 32 core per node => 32000 row group split capacity => >>>>>>> With Iceberg default settings it's 8000 files (4 row groups per file by >>>>>>> default) capacity that's not an extreme number >>>>>>> - Will follow-up on this once we have even more understanding >>>>>>> >>>>>>> Hope this makes sense! Let me know if there are further questions! >>>>>>> Gabor >>>>>>> >>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2026. márc. 30., H, >>>>>>> 19:37): >>>>>>>> >>>>>>>> Peter, why does the run time decrease with more updated columns for >>>>>>>> the multi-threaded benchmark? It seems counterintuitive. >>>>>>>> >>>>>>>> On Thu, Mar 26, 2026 at 5:16 AM Péter Váry >>>>>>>> <[email protected]> wrote: >>>>>>>>> >>>>>>>>> Hi All, >>>>>>>>> >>>>>>>>> Tuesday evenings are not the best for me, but Gabor was there and he >>>>>>>>> presented the work he has done well. >>>>>>>>> >>>>>>>>> I have run tests for doing parallel reads using multiple threads. >>>>>>>>> >>>>>>>>> <TL><DR> >>>>>>>>> We can see an increase in single-threaded read time as integer >>>>>>>>> columns are updated one-by-one in a 100‑column table. In contrast, >>>>>>>>> with multi‑threaded reads, the runtime slightly decreases, as >>>>>>>>> performance is dominated by the base file, from which fewer and fewer >>>>>>>>> columns are read. >>>>>>>>> >>>>>>>>> In detail: >>>>>>>>> >>>>>>>>> Baseline - reading the same data with the original reader: >>>>>>>>> Benchmark (multiThreaded) >>>>>>>>> (updatedColumns) Mode Cnt Score Error Units >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates false >>>>>>>>> 0 ss 20 4.320 ± 0.020 s/op >>>>>>>>> >>>>>>>>> Single threaded reading results: >>>>>>>>> Benchmark (multiThreaded) >>>>>>>>> (updatedColumns) Mode Cnt Score Error Units >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates false >>>>>>>>> 1 ss 20 4.547 ± 0.031 s/op >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates false >>>>>>>>> 2 ss 20 4.606 ± 0.066 s/op >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates false >>>>>>>>> 3 ss 20 5.063 ± 0.039 s/op >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates false >>>>>>>>> 4 ss 20 5.159 ± 0.019 s/op >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates false >>>>>>>>> 5 ss 20 5.839 ± 0.064 s/op >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates false >>>>>>>>> 10 ss 20 6.760 ± 0.032 s/op >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates false >>>>>>>>> 40 ss 20 6.691 ± 0.093 s/op >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates false >>>>>>>>> 80 ss 20 7.434 ± 0.157 s/op >>>>>>>>> >>>>>>>>> Multi threaded reading results: >>>>>>>>> Benchmark (multiThreaded) >>>>>>>>> (updatedColumns) Mode Cnt Score Error Units >>>>>>>>> >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates true >>>>>>>>> 1 ss 20 4.394 ± 0.033 s/op >>>>>>>>> >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates true >>>>>>>>> 2 ss 20 4.367 ± 0.048 s/op >>>>>>>>> >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates true >>>>>>>>> 3 ss 20 4.332 ± 0.017 s/op >>>>>>>>> >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates true >>>>>>>>> 4 ss 20 4.353 ± 0.024 s/op >>>>>>>>> >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates true >>>>>>>>> 5 ss 20 4.285 ± 0.019 s/op >>>>>>>>> >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates true >>>>>>>>> 10 ss 20 4.285 ± 0.030 s/op >>>>>>>>> >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates true >>>>>>>>> 40 ss 20 3.529 ± 0.023 s/op >>>>>>>>> >>>>>>>>> ColumnUpdateParquetBenchmark.readWithColumnUpdates true >>>>>>>>> 80 ss 20 3.526 ± 0.025 s/op >>>>>>>>> >>>>>>>>> >>>>>>>>> The tests are exercising FormatModelRegistry.readBuilder directly so >>>>>>>>> the integrated test results might differ a bit. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Anurag Mantripragada <[email protected]> ezt írta >>>>>>>>> (időpont: 2026. márc. 24., K, 19:10): >>>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> Thank you for attending today's sync. You can find the recording of >>>>>>>>>> the session here. >>>>>>>>>> >>>>>>>>>> Highlights >>>>>>>>>> >>>>>>>>>> We presented initial results using a fully row-aligned column update >>>>>>>>>> file representation. >>>>>>>>>> We discussed how change detection will work alongside column updates. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Next Steps >>>>>>>>>> >>>>>>>>>> Extend the POC to include sparse representation with read-time null >>>>>>>>>> alignment to compare write and read performance against the >>>>>>>>>> row-aligned approach. >>>>>>>>>> Run the POC on larger tables within cloud storage. >>>>>>>>>> Enhance the POC with parallel file reads (base and column updates) >>>>>>>>>> to achieve performance gains. >>>>>>>>>> Update the design document with a section on change detection, >>>>>>>>>> ensuring alignment with ongoing V4 single-commit design discussions. >>>>>>>>>> Make this sync a bi-weekly recurring meeting. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ~ Anurag >>>>>>>>>> >>>>>>>>>> On Fri, Mar 20, 2026 at 9:26 AM Anurag Mantripragada >>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Gabor, >>>>>>>>>>> >>>>>>>>>>> Thanks for sharing the POC results. This is exciting! >>>>>>>>>>> >>>>>>>>>>> As Gabor mentioned, we are continuing to iterate while taking >>>>>>>>>>> initial measurements. Our next steps involve testing on larger >>>>>>>>>>> workloads and cloud storage. I look forward to discussing this >>>>>>>>>>> further during our next sync. >>>>>>>>>>> >>>>>>>>>>> @Péter Váry - Thanks for your findings, we are proceeding with full >>>>>>>>>>> column updates in the initial POC >>>>>>>>>>> @Gang Wu, I will update the design document to address your >>>>>>>>>>> comments before our meeting on Tuesday. >>>>>>>>>>> >>>>>>>>>>> Efficient column updates sync >>>>>>>>>>> Tuesday, March 24 · 9:00 – 10:00am >>>>>>>>>>> Time zone: America/Los_Angeles >>>>>>>>>>> Video call link: https://meet.google.com/nif-qtvi-uzm >>>>>>>>>>> >>>>>>>>>>> Happy weekend! >>>>>>>>>>> ~ Anurag >>>>>>>>>>> >>>>>>>>>>> On Fri, Mar 20, 2026 at 7:59 AM Gábor Kaszab >>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hey All, >>>>>>>>>>>> >>>>>>>>>>>> I just wanted to share some results wrt the PoC I'm working on >>>>>>>>>>>> with Anurag and Peter. We got to a point where we have a working >>>>>>>>>>>> end-to-end PoC where we can issue a full column update from Spark >>>>>>>>>>>> SQL and then also read a table with column updates stitching the >>>>>>>>>>>> rows together. Obviously there are some cut corners, and some of >>>>>>>>>>>> the design decisions aren't concluded, but this PoC is nice to see >>>>>>>>>>>> if there are any missing building blocks and also gives us the >>>>>>>>>>>> chance to run initial performance measurements. >>>>>>>>>>>> >>>>>>>>>>>> TLDR; >>>>>>>>>>>> >>>>>>>>>>>> Only full column updates covered >>>>>>>>>>>> Initial measurements on the write path shows pretty nice >>>>>>>>>>>> improvements compared to MoR and CoW >>>>>>>>>>>> Storage cost improvement is obvious but also showed with the PoC >>>>>>>>>>>> There is cost on the read side that is proportional to the number >>>>>>>>>>>> of update files being stitched together. Seems reasonable but >>>>>>>>>>>> working on further improvements >>>>>>>>>>>> >>>>>>>>>>>> Details about the PoC: >>>>>>>>>>>> >>>>>>>>>>>> Only full column updates are allowed: >>>>>>>>>>>> >>>>>>>>>>>> UPDATE tbl SET col_x = <some_input>, col_y =<some_other_input>; >>>>>>>>>>>> No WHERE filter >>>>>>>>>>>> >>>>>>>>>>>> To easily trigger column update writes a new >>>>>>>>>>>> "write.update.mode"="column-update" is introduced for the sake of >>>>>>>>>>>> testing >>>>>>>>>>>> Required Spark changes on the write/UPDATE side: >>>>>>>>>>>> >>>>>>>>>>>> Project the scan to produce only the rows required for the update >>>>>>>>>>>> Project the writer to write only the updated columns >>>>>>>>>>>> >>>>>>>>>>>> Content of update files: >>>>>>>>>>>> >>>>>>>>>>>> Updated columns only >>>>>>>>>>>> No _pos or other auxiliary columns >>>>>>>>>>>> >>>>>>>>>>>> Row count of update files match row count of base file >>>>>>>>>>>> >>>>>>>>>>>> In case of deleted rows, fields are filled with NULLs when writing >>>>>>>>>>>> For this use the _pos from the base file to see the need to fill >>>>>>>>>>>> rows >>>>>>>>>>>> If deleted from the end, "file path to rows count" mapping is >>>>>>>>>>>> distributed so that writers can fill trailing rows >>>>>>>>>>>> >>>>>>>>>>>> Row reader is covered, vectorized reader is not >>>>>>>>>>>> Only Parquet readers are covered >>>>>>>>>>>> ColumnSplitReadBuilder that: >>>>>>>>>>>> >>>>>>>>>>>> Holds different readers for different column splits: base file + >>>>>>>>>>>> update files >>>>>>>>>>>> Projection could rule out readers if they are no longer needed for >>>>>>>>>>>> the set of fields queried >>>>>>>>>>>> Split reading can jump to the beginning of the split of the base >>>>>>>>>>>> file while we have to iterate to the required position on the >>>>>>>>>>>> update files' readers >>>>>>>>>>>> Filtering is not yet implemented, will work on it >>>>>>>>>>>> >>>>>>>>>>>> The PoC code is here: >>>>>>>>>>>> >>>>>>>>>>>> Spark >>>>>>>>>>>> Iceberg >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Details about the measurements: >>>>>>>>>>>> Number collected in this doc. Fists tab for single writes, second >>>>>>>>>>>> tab for reads >>>>>>>>>>>> Local measurements using Spark with HadoopCatalog and local FS as >>>>>>>>>>>> storage >>>>>>>>>>>> Test table is 5 data files, each is ~500MB and 2,5M rows. Table >>>>>>>>>>>> has 100 cols (80 ints, 20 strings). >>>>>>>>>>>> >>>>>>>>>>>> Write: >>>>>>>>>>>> >>>>>>>>>>>> After initial measurements we enhanced scan to project to the >>>>>>>>>>>> required fields only (plus _file, _pos and _partition metadata >>>>>>>>>>>> cols) >>>>>>>>>>>> Updating a single col runs fraction of the time compared to >>>>>>>>>>>> Cow/Mor (1,5s vs 20s) >>>>>>>>>>>> Increasing the number of read and written columns increased the >>>>>>>>>>>> required time proportionally >>>>>>>>>>>> Reading 20 cols as input and writing 20 cols to the update files >>>>>>>>>>>> is still significantly faster than CoW/MoR (5s vs 20s) >>>>>>>>>>>> Going with the default settings, base files have 4 row groups >>>>>>>>>>>> while the update files have a single row group each >>>>>>>>>>>> >>>>>>>>>>>> Read >>>>>>>>>>>> >>>>>>>>>>>> For 1 column update file there is 10% overhead compared to CoW >>>>>>>>>>>> For 10 column update files there is 15% overhead compared to CoW >>>>>>>>>>>> MoR goes mad, no point of using for comparison: With each full >>>>>>>>>>>> update we multiply the amount of data read and DVs applied. Isn't >>>>>>>>>>>> this something we can improve? >>>>>>>>>>>> Next experiment: row group aligned update files with the base files >>>>>>>>>>>> >>>>>>>>>>>> Expectation is that storage cost would be slightly higher due to >>>>>>>>>>>> somewhat worse compression because of multiple row groups in on >>>>>>>>>>>> update file >>>>>>>>>>>> For split reading, update readers won't have to iterate to the >>>>>>>>>>>> first row of the split >>>>>>>>>>>> >>>>>>>>>>>> Another experiment: Produce rows in a multithreaded fashion for >>>>>>>>>>>> the split readers >>>>>>>>>>>> >>>>>>>>>>>> We'll continue the work on the PoC and will execute further >>>>>>>>>>>> measurements, I just wanted to share the state before heading off >>>>>>>>>>>> for the weekend. Probably we can discuss further details on the >>>>>>>>>>>> sync next week. >>>>>>>>>>>> Regards, >>>>>>>>>>>> Gabor >>>>>>>>>>>> >>>>>>>>>>>> Gang Wu <[email protected]> ezt írta (időpont: 2026. márc. 19., Cs, >>>>>>>>>>>> 9:59): >>>>>>>>>>>>> >>>>>>>>>>>>> I agree with Peter that it would be too complicated to maintain >>>>>>>>>>>>> LUSN for the sparse format, so +1 on full column update. >>>>>>>>>>>>> >>>>>>>>>>>>> BTW, I have some questions when columns to update have schema >>>>>>>>>>>>> evolutions or are nested types. I have left some inline comments >>>>>>>>>>>>> to the design doc. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Mar 10, 2026 at 5:48 PM Péter Váry >>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Short summary: >>>>>>>>>>>>>> Surfacing changed rows ultimately requires a merged >>>>>>>>>>>>>> _last_updated_sequence_number (LUSN) column for both partial and >>>>>>>>>>>>>> full updates, which significantly reduces the benefits of >>>>>>>>>>>>>> partial updates and adds complexity. Given this, moving forward >>>>>>>>>>>>>> with full column updates seems reasonable. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Details: >>>>>>>>>>>>>> For partial updates, we can assume that update files contain >>>>>>>>>>>>>> only changed rows, but if we rely on that, it needs to be >>>>>>>>>>>>>> explicitly guaranteed. For full updates, the engine can compute >>>>>>>>>>>>>> changed rows directly since original values are available. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The requirement to identify updated rows could be satisfied >>>>>>>>>>>>>> using the _last_updated_sequence_number column. Every update >>>>>>>>>>>>>> must correctly advance LUSN for updated rows, even when >>>>>>>>>>>>>> consecutive updates touch disjoint row sets. As a result, each >>>>>>>>>>>>>> update must carry forward a merged LUSN view, regardless of >>>>>>>>>>>>>> whether the update is partial or full. >>>>>>>>>>>>>> >>>>>>>>>>>>>> With partial updates, this leaves us with three options: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Store LUSN in a separate file (extra file per write), >>>>>>>>>>>>>> Rewrite a larger cell matrix covering all rows changed since the >>>>>>>>>>>>>> base file (more read and write I/O, but no extra file), >>>>>>>>>>>>>> Resolve LUSN at read time by merging multiple files (added read >>>>>>>>>>>>>> complexity and file access). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Option (2) seems the most reasonable, but it further reduces the >>>>>>>>>>>>>> gains of partial column updates and increases writer complexity. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Overall, once LUSN handling is factored in, the advantages of >>>>>>>>>>>>>> partial updates shrink considerably, while the complexity grows. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>> Peter >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2026. márc. >>>>>>>>>>>>>> 6., P, 2:12): >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I watched the recording and had the same question regarding the >>>>>>>>>>>>>>> sparse representation that Peter mentioned. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> DVs seem like a good fit for updates columns for a small >>>>>>>>>>>>>>> percentage of rows in a file. Full column update is good for >>>>>>>>>>>>>>> updating a high percentage of rows. If we are going with this >>>>>>>>>>>>>>> high-level guideline, what's the need for the sparse >>>>>>>>>>>>>>> representation? The discussion brought up one reason for the >>>>>>>>>>>>>>> sparse representation: handling deleted rows (via DV). E.g., a >>>>>>>>>>>>>>> data file has 100 rows and 10 rows have been deleted via DV. >>>>>>>>>>>>>>> How should we generate the column file: 90 or 100 rows? The >>>>>>>>>>>>>>> former would require sparse representation for tracking row >>>>>>>>>>>>>>> positions in the column file. The latter would require solving >>>>>>>>>>>>>>> these two problems. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1) What column values should be filled for deleted rows? Null >>>>>>>>>>>>>>> or default values seem good candidates. In the end, those >>>>>>>>>>>>>>> column values don't really matter. Null may mess up the >>>>>>>>>>>>>>> null_value_count stats. However, it is possible for the writer >>>>>>>>>>>>>>> to keep the null_value_count correct in this case. Besides, we >>>>>>>>>>>>>>> already have the imprecise column stats (min, max, null_count) >>>>>>>>>>>>>>> after applying a DV to a data file. Extra data files can be >>>>>>>>>>>>>>> scanned and rows will be ignored via residual filter. There is >>>>>>>>>>>>>>> no impact on query correctness. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2) How to detect deleted rows? I remember Russell briefly >>>>>>>>>>>>>>> mentioned one idea in the first meeting using the `pos` >>>>>>>>>>>>>>> metadata column (ordinal position within a file). If the >>>>>>>>>>>>>>> records are processed in the same order, we can use the gap to >>>>>>>>>>>>>>> detect deleted rows. E.g., the last processed row position is >>>>>>>>>>>>>>> 10, and the current row position is 13. Then we know rows 11 >>>>>>>>>>>>>>> and 12 have been deleted and null values can be persisted into >>>>>>>>>>>>>>> the column file for the deleted rows. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Mar 5, 2026 at 5:57 AM Péter Váry >>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I had to drop off after the first half hour, but I watched the >>>>>>>>>>>>>>>> recording afterward and discussed the topic in depth with >>>>>>>>>>>>>>>> Gábor. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> TL;DR >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> My intuition is that the volume of updated data is usually >>>>>>>>>>>>>>>> small compared to the original file. Even for a table with, >>>>>>>>>>>>>>>> say, 1000 columns, updating only a few columns typically >>>>>>>>>>>>>>>> produces relatively little data, even if those columns are >>>>>>>>>>>>>>>> rewritten in full. As a result, overall cost is often >>>>>>>>>>>>>>>> dominated more by file access and seek overhead than by the >>>>>>>>>>>>>>>> actual amount of data read or written. This suggests we should >>>>>>>>>>>>>>>> favor a simpler solution and support only full column updates. >>>>>>>>>>>>>>>> Predicate pushdown does not work well with partial updates, >>>>>>>>>>>>>>>> whereas with some effort it can be made to work with full >>>>>>>>>>>>>>>> column updates. >>>>>>>>>>>>>>>> If we do want to support partial updates, I agree that sparse >>>>>>>>>>>>>>>> update files make sense. However, if we decide not to support >>>>>>>>>>>>>>>> partial updates, I think we should revisit the decision to use >>>>>>>>>>>>>>>> a custom encoding for update files. In that case, update files >>>>>>>>>>>>>>>> will typically contain very few deleted rows, which >>>>>>>>>>>>>>>> invalidates several assumptions behind sparse encodings. In >>>>>>>>>>>>>>>> this scenario, we could relatively cheaply add the `_file`, >>>>>>>>>>>>>>>> `_pos`, and `_deleted` columns to the read query, use that >>>>>>>>>>>>>>>> information to write out the results, and delegate the >>>>>>>>>>>>>>>> encoding to Parquet. Parquet already provides efficient >>>>>>>>>>>>>>>> encodings for columns that are not extremely sparse, and it >>>>>>>>>>>>>>>> would be difficult to outperform that with a custom solution. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In detail >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> What we gain compared to full column updates is mostly on the >>>>>>>>>>>>>>>> write path >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We don’t need to read unchanged column values. >>>>>>>>>>>>>>>> We may not need to touch the original data file at all. >>>>>>>>>>>>>>>> We don’t need to write unchanged values (although we still >>>>>>>>>>>>>>>> need to create a new file). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> What we lose is mostly on the read path >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We need to read the column from the original file. >>>>>>>>>>>>>>>> We also need to read the update file (even if it’s small, it’s >>>>>>>>>>>>>>>> still an additional file access). >>>>>>>>>>>>>>>> Predicate pushdown does not work on the updated column; >>>>>>>>>>>>>>>> filters must be applied manually. Predicate pushdown can only >>>>>>>>>>>>>>>> be applied to the update file itself. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Edge cases >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Partial updates shine when updates do not require reading from >>>>>>>>>>>>>>>> the original table at all. >>>>>>>>>>>>>>>> Full updates are best when reads only need to touch the newly >>>>>>>>>>>>>>>> written data and can completely ignore the original file. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Typical case comparison >>>>>>>>>>>>>>>> In practice, both approaches look quite similar in terms of >>>>>>>>>>>>>>>> file access: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Reads: original file + new data file in both cases. >>>>>>>>>>>>>>>> Writes: read the original file (and any existing update file) >>>>>>>>>>>>>>>> and write new data in both cases. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The main differences are: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> With partial column updates, we read and write less data >>>>>>>>>>>>>>>> during updates (only the changed cells). >>>>>>>>>>>>>>>> With full column updates, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Reads are cheaper because data is already merged into a single >>>>>>>>>>>>>>>> file and we don’t need to read old column data. >>>>>>>>>>>>>>>> Predicate pushdown can work, although we still need to combine >>>>>>>>>>>>>>>> with columns from the base file. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Overall, the key difference is the amount of column data read >>>>>>>>>>>>>>>> and written, not full file sizes. At that point, file access >>>>>>>>>>>>>>>> patterns and seek overhead tend to dominate the cost rather >>>>>>>>>>>>>>>> than raw I/O volume. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Anurag Mantripragada <[email protected]> ezt írta >>>>>>>>>>>>>>>> (időpont: 2026. márc. 5., Cs, 3:35): >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi everyone! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks for joining the sync today. Sorry, Google cut us off >>>>>>>>>>>>>>>>> while Gabor was explaining his POC work. We can discuss that >>>>>>>>>>>>>>>>> in the next meeting. Here is the recording. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Meeting notes: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Partial updates >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We could potentially support partial updates if the writer >>>>>>>>>>>>>>>>> could merge all the existing updates COW style into a new >>>>>>>>>>>>>>>>> column file. We could potentially explore this, but the >>>>>>>>>>>>>>>>> general consensus was to favor a single mechanism for >>>>>>>>>>>>>>>>> updates, whether partial or not. This requires some more >>>>>>>>>>>>>>>>> thought and we can iterate over it. >>>>>>>>>>>>>>>>> This remains an open question until we consider all the >>>>>>>>>>>>>>>>> synchronous writing cases. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Column File Row Alignment >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We generally agreed on using sparse Parquet files to store >>>>>>>>>>>>>>>>> updates. Each update file contains only the modified values >>>>>>>>>>>>>>>>> and their corresponding row positions from the base file. >>>>>>>>>>>>>>>>> Rationale: This avoids the stats-corruption risk of full, >>>>>>>>>>>>>>>>> padded files (which would require filling non-updated rows >>>>>>>>>>>>>>>>> with arbitrary values) and the Parquet limitation against >>>>>>>>>>>>>>>>> top-level nulls. >>>>>>>>>>>>>>>>> Read Path: Readers will materialize the sparse updates into a >>>>>>>>>>>>>>>>> full buffer with nulls, then efficiently merge by position. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Single Update File Per Column >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> To simplify reads, each base file can have only one active >>>>>>>>>>>>>>>>> update file per column. >>>>>>>>>>>>>>>>> Subsequent updates must rewrite the existing update file, >>>>>>>>>>>>>>>>> synchronously applying all prior changes. >>>>>>>>>>>>>>>>> This avoids the complexity of merging multiple update files >>>>>>>>>>>>>>>>> during the read path. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Other open questions >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Change Detection: We need to think more about how change >>>>>>>>>>>>>>>>> detection would work with the synchronous update case. The V4 >>>>>>>>>>>>>>>>> spec is undergoing revisions to support other use cases, and >>>>>>>>>>>>>>>>> we should follow that work to see how this design aligns with >>>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Next Steps >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Anurag: Update the design doc with the sparse file format, >>>>>>>>>>>>>>>>> the single-update-file rule and add details about how this >>>>>>>>>>>>>>>>> would work in various scenarios. >>>>>>>>>>>>>>>>> Anurag: Review the V4 CDC metadata proposal to ensure >>>>>>>>>>>>>>>>> alignment with the column update design. >>>>>>>>>>>>>>>>> Gábor: Continue developing the POC, focusing on the >>>>>>>>>>>>>>>>> synchronous rewrite logic and reader implementation. Anurag >>>>>>>>>>>>>>>>> will work on the Spark plumbing needed to materialize only >>>>>>>>>>>>>>>>> the changed rows and the planner changes in Spark 4.x >>>>>>>>>>>>>>>>> All: Schedule a follow-up meeting to review the updated >>>>>>>>>>>>>>>>> design doc. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> Anurag >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Mar 4, 2026 at 1:47 PM Anton Okolnychyi >>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Gabor, I know Anurag also expressed interest in extending >>>>>>>>>>>>>>>>>> Spark DML to accommodate column updates. I am happy to work >>>>>>>>>>>>>>>>>> with both of you to get the Spark piece designed and >>>>>>>>>>>>>>>>>> implemented. It is not something we would be able to handle >>>>>>>>>>>>>>>>>> in Iceberg via extensions. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Regarding partial updates, I agree we will have to iterate >>>>>>>>>>>>>>>>>> on open questions before making a call on whether to support >>>>>>>>>>>>>>>>>> this functionality. Can you elaborate on the last use case >>>>>>>>>>>>>>>>>> you mentioned? Why would we have to combine {1} with {2, 3}? >>>>>>>>>>>>>>>>>> Will it be possible to produce a column file with only >>>>>>>>>>>>>>>>>> affected columns in each write? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ср, 4 бер. 2026 р. о 12:13 Gábor Kaszab >>>>>>>>>>>>>>>>>> <[email protected]> пише: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hey All, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Apparently, the meeting dropped all of us after exactly one >>>>>>>>>>>>>>>>>>> hour :) At the end I just wanted to mention that during my >>>>>>>>>>>>>>>>>>> attempt to implement a PoC I found a couple of missing >>>>>>>>>>>>>>>>>>> building blocks (collecting the updated field IDs when >>>>>>>>>>>>>>>>>>> committing after a Spark write; tweaking UPDATE's plan e.g. >>>>>>>>>>>>>>>>>>> adding/removing columns compared to CoW) and also found >>>>>>>>>>>>>>>>>>> some interesting technical details/questions (e.g. how to >>>>>>>>>>>>>>>>>>> align rows when reading a split based on base file's >>>>>>>>>>>>>>>>>>> split_offsets) that we could discuss next time. I'll >>>>>>>>>>>>>>>>>>> collect all of these and share. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> In the meantime, I gave another thought to the partial >>>>>>>>>>>>>>>>>>> updates idea Anton mentioned where we can basically have >>>>>>>>>>>>>>>>>>> the same metadata and read path as for the full column >>>>>>>>>>>>>>>>>>> update approach, and we'd push the responsibility to the >>>>>>>>>>>>>>>>>>> writers to always merge existing updates with new ones. I >>>>>>>>>>>>>>>>>>> think in theory, this seems a reasonable design and seems >>>>>>>>>>>>>>>>>>> not that complicated to implement when the new update >>>>>>>>>>>>>>>>>>> aligns with the field IDs of some of the existing updates. >>>>>>>>>>>>>>>>>>> For instance, partially updating rows by field ID1 and then >>>>>>>>>>>>>>>>>>> updating different rows also for the same field ID seems >>>>>>>>>>>>>>>>>>> straightforward to merge these into a new file and refer >>>>>>>>>>>>>>>>>>> that file from the metadata. >>>>>>>>>>>>>>>>>>> However, I'm not sure how trivial it is when we update >>>>>>>>>>>>>>>>>>> overlapping but not entirely the same set of fields. E.g >>>>>>>>>>>>>>>>>>> first partially updating by fields {1, 2} then by {2, 3}. I >>>>>>>>>>>>>>>>>>> don't think we want to merge these into 1 and have a single >>>>>>>>>>>>>>>>>>> update for {1, 2, 3} as that would have a snowball effect >>>>>>>>>>>>>>>>>>> of merging more and more cols together by time. But I don't >>>>>>>>>>>>>>>>>>> think we want to split them either, or require a separate >>>>>>>>>>>>>>>>>>> partial update for each field (wouldn't be suitable for >>>>>>>>>>>>>>>>>>> column families either later on). >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>> Gabor >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Micah Kornfield <[email protected]> ezt írta (időpont: >>>>>>>>>>>>>>>>>>> 2026. márc. 4., Sze, 0:32): >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> If this is correct, it aligns well with the current >>>>>>>>>>>>>>>>>>>>> proposal and shouldn't introduce any additional >>>>>>>>>>>>>>>>>>>>> complexity. I will add it to the discussion points for >>>>>>>>>>>>>>>>>>>>> tomorrow's community sync. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Yes, this example aligns with what I was thinking (nit: >>>>>>>>>>>>>>>>>>>> "range" probably wouldn't be a string but I assume this >>>>>>>>>>>>>>>>>>>> was just for illustrative purposes) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On the other hand, in the column family use case, >>>>>>>>>>>>>>>>>>>>> splitting columns is a strict requirement for >>>>>>>>>>>>>>>>>>>>> performance. I haven’t considered how this would work, >>>>>>>>>>>>>>>>>>>>> but perhaps we could introduce a table property for >>>>>>>>>>>>>>>>>>>>> column families to make this explicit, and compaction >>>>>>>>>>>>>>>>>>>>> jobs would have to respect >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Yeah, I don't want to get into the exact mechanics for >>>>>>>>>>>>>>>>>>>> column families. I was just calling out that compaction to >>>>>>>>>>>>>>>>>>>> the base file is not desirable in all cases, so shouldn't >>>>>>>>>>>>>>>>>>>> be assumed as a solution for small files. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> Micah >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Tue, Mar 3, 2026 at 3:11 PM Anurag Mantripragada >>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Micah, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Could you expand on the complexity you think this >>>>>>>>>>>>>>>>>>>>>> introduces (or more specifically "significant" part)? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I may have misunderstood your approach regarding packing >>>>>>>>>>>>>>>>>>>>> row ranges. To clarify, is the following what you had in >>>>>>>>>>>>>>>>>>>>> mind? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Initially, we have base_file_1.parquet (rows 1-1000) and >>>>>>>>>>>>>>>>>>>>> base_file_2.parquet (rows 1001-2000). If we update the >>>>>>>>>>>>>>>>>>>>> "score" column across both files and pack those updates >>>>>>>>>>>>>>>>>>>>> into a single larger file, packed_col_A.parquet, would >>>>>>>>>>>>>>>>>>>>> the metadata structure look like this? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>>>> "data_file_path": "base_file_1.parquet", >>>>>>>>>>>>>>>>>>>>> "column_updates": [ >>>>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>>>> "field_id": 12, >>>>>>>>>>>>>>>>>>>>> "update_file_path": "packed_col_A.parquet", >>>>>>>>>>>>>>>>>>>>> "row_range": "0-1000" >>>>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>>>>>>> }, >>>>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>>>> "data_file_path": "base_file_2.parquet", >>>>>>>>>>>>>>>>>>>>> "column_updates": [ >>>>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>>>> "field_id": 12, >>>>>>>>>>>>>>>>>>>>> "update_file_path": "packed_col_A.parquet", >>>>>>>>>>>>>>>>>>>>> "row_range": "1001-2000" >>>>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> If this is correct, it aligns well with the current >>>>>>>>>>>>>>>>>>>>> proposal and shouldn't introduce any additional >>>>>>>>>>>>>>>>>>>>> complexity. I will add it to the discussion points for >>>>>>>>>>>>>>>>>>>>> tomorrow's community sync. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> This seems at odds with supporting column families in >>>>>>>>>>>>>>>>>>>>>> the future? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> In my opinion, there’s a distinction between the use >>>>>>>>>>>>>>>>>>>>> cases of column updates and column families. Column >>>>>>>>>>>>>>>>>>>>> updates are designed for fast writes while maintaining >>>>>>>>>>>>>>>>>>>>> reasonable read performance. Compaction is desirable to >>>>>>>>>>>>>>>>>>>>> reduce the complexity of the read side, if any. On the >>>>>>>>>>>>>>>>>>>>> other hand, in the column family use case, splitting >>>>>>>>>>>>>>>>>>>>> columns is a strict requirement for performance. I >>>>>>>>>>>>>>>>>>>>> haven’t considered how this would work, but perhaps we >>>>>>>>>>>>>>>>>>>>> could introduce a table property for column families to >>>>>>>>>>>>>>>>>>>>> make this explicit, and compaction jobs would have to >>>>>>>>>>>>>>>>>>>>> respect it. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> ~Anurag >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Tue, Mar 3, 2026 at 12:02 PM Micah Kornfield >>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi Anurag, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Compaction and small files: If I understand the row >>>>>>>>>>>>>>>>>>>>>>> ranges idea correctly, packing multiple updates into >>>>>>>>>>>>>>>>>>>>>>> larger column files would require matching ranges to >>>>>>>>>>>>>>>>>>>>>>> base files based on predicates, which adds significant >>>>>>>>>>>>>>>>>>>>>>> planning complexity. Regular compaction, which rewrites >>>>>>>>>>>>>>>>>>>>>>> column files into the base file seems more practical. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Could you expand on the complexity you think this >>>>>>>>>>>>>>>>>>>>>> introduces (or more specifically "significant" part)? In >>>>>>>>>>>>>>>>>>>>>> this case the predicate should be pretty simple (i.e. >>>>>>>>>>>>>>>>>>>>>> read rows between X and Y only) and can be done >>>>>>>>>>>>>>>>>>>>>> efficiently via row group statistics. Smart writers >>>>>>>>>>>>>>>>>>>>>> could even partition rows for a specific base file into >>>>>>>>>>>>>>>>>>>>>> their own row group/pages to make the filter trivial. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Regular compaction, which rewrites column files into >>>>>>>>>>>>>>>>>>>>>>> the base file seems more practical. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> This seems at odds with supporting column families in >>>>>>>>>>>>>>>>>>>>>> the future? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>> Micah >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Tue, Mar 3, 2026 at 11:43 AM Anurag Mantripragada >>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Sorry for the delayed response. I was on vacation and >>>>>>>>>>>>>>>>>>>>>>> catching up. Thanks for the continued discussion on >>>>>>>>>>>>>>>>>>>>>>> this topic. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Partial updates: I agree that MoR-style row-level >>>>>>>>>>>>>>>>>>>>>>> updates offer limited benefits beyond reducing the >>>>>>>>>>>>>>>>>>>>>>> writing of irrelevant columns. For use cases like >>>>>>>>>>>>>>>>>>>>>>> updating a subset of users, existing deletion vectors >>>>>>>>>>>>>>>>>>>>>>> and the new V4 manifest delete vectors should perform >>>>>>>>>>>>>>>>>>>>>>> well. Gabor’s suggestion for file-level partial updates >>>>>>>>>>>>>>>>>>>>>>> is a reasonable alternative, even with some write >>>>>>>>>>>>>>>>>>>>>>> amplification. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Compaction and small files: If I understand the row >>>>>>>>>>>>>>>>>>>>>>> ranges idea correctly, packing multiple updates into >>>>>>>>>>>>>>>>>>>>>>> larger column files would require matching ranges to >>>>>>>>>>>>>>>>>>>>>>> base files based on predicates, which adds significant >>>>>>>>>>>>>>>>>>>>>>> planning complexity. Regular compaction, which rewrites >>>>>>>>>>>>>>>>>>>>>>> column files into the base file seems more practical. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Column families: While splitting columns into families >>>>>>>>>>>>>>>>>>>>>>> is useful, the current design is more generic and >>>>>>>>>>>>>>>>>>>>>>> already supports packing families into column files. >>>>>>>>>>>>>>>>>>>>>>> Deciding how to group these columns (manually or via an >>>>>>>>>>>>>>>>>>>>>>> engine) can be addressed in separate follow-up work. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Next steps: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Gabor and I are developing a POC for metadata changes, >>>>>>>>>>>>>>>>>>>>>>> focusing on reading and writing column files using >>>>>>>>>>>>>>>>>>>>>>> Spark for integration. We will share more details soon. >>>>>>>>>>>>>>>>>>>>>>> I will update the doc in preparation for tomorrow's >>>>>>>>>>>>>>>>>>>>>>> sync. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> As a reminder we have a sync on column updates upcoming >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Efficient column updates sync >>>>>>>>>>>>>>>>>>>>>>> Wednesday, March 4 · 9:00 – 10:00am >>>>>>>>>>>>>>>>>>>>>>> Time zone: America/Los_Angeles >>>>>>>>>>>>>>>>>>>>>>> Google Meet joining info >>>>>>>>>>>>>>>>>>>>>>> Video call link: https://meet.google.com/naf-tvvn-qup >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> ~ Anurag >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 25, 2026 at 1:32 PM Gábor Kaszab >>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hey All, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Nice to see the activity on this thread. Thanks to >>>>>>>>>>>>>>>>>>>>>>>> everyone who chimed in! >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Micah, I also feel that 1) (full column updates) and >>>>>>>>>>>>>>>>>>>>>>>> 2) (partial but file-level column updates) could be a >>>>>>>>>>>>>>>>>>>>>>>> good middle ground between perf improvement and >>>>>>>>>>>>>>>>>>>>>>>> keeping the code complexity low. In fact I had the >>>>>>>>>>>>>>>>>>>>>>>> chance to experiment in this area and the metadata + >>>>>>>>>>>>>>>>>>>>>>>> API part would be as simple as in this PoC. Just a >>>>>>>>>>>>>>>>>>>>>>>> side note for 3), from the SQL aspect I'm a bit >>>>>>>>>>>>>>>>>>>>>>>> hesitant how straightforward it is for the users to >>>>>>>>>>>>>>>>>>>>>>>> write predicates that align with file boundaries, >>>>>>>>>>>>>>>>>>>>>>>> though. >>>>>>>>>>>>>>>>>>>>>>>> For deciding on partial column updates, we probably >>>>>>>>>>>>>>>>>>>>>>>> can't get away without doing some measurements of how >>>>>>>>>>>>>>>>>>>>>>>> it compares to existing MoR. I have it on my roadmap, >>>>>>>>>>>>>>>>>>>>>>>> so I'll share it once I have something. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Wrapping multiple update files into one is an >>>>>>>>>>>>>>>>>>>>>>>> interesting idea. Let's bring this up on the next >>>>>>>>>>>>>>>>>>>>>>>> sync! Additionally, full column updates could add a >>>>>>>>>>>>>>>>>>>>>>>> huge overhead on the metadata files being created too >>>>>>>>>>>>>>>>>>>>>>>> (delete everything + write everything with updates), >>>>>>>>>>>>>>>>>>>>>>>> unless we decide to do some manifest >>>>>>>>>>>>>>>>>>>>>>>> rewrites/optimizations under the hood during the >>>>>>>>>>>>>>>>>>>>>>>> commit. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Peter, column families as a schema-like table metadata >>>>>>>>>>>>>>>>>>>>>>>> level information would definitely be useful. It seems >>>>>>>>>>>>>>>>>>>>>>>> like a natural follow-up of the column update work, >>>>>>>>>>>>>>>>>>>>>>>> but we have to keep in mind to choose a design that >>>>>>>>>>>>>>>>>>>>>>>> won't prevent us from implementing a more general >>>>>>>>>>>>>>>>>>>>>>>> column families concept (probably for inserts too). >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>>>>>>>>>>>> Gabor >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Micah Kornfield <[email protected]> ezt írta >>>>>>>>>>>>>>>>>>>>>>>> (időpont: 2026. febr. 21., Szo, 17:53): >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> 1) and 3) are what I was thinking of as use-cases. I >>>>>>>>>>>>>>>>>>>>>>>>> agree unless there is a strong motivating use-case >>>>>>>>>>>>>>>>>>>>>>>>> for MoR style column updates we should try to avoid >>>>>>>>>>>>>>>>>>>>>>>>> this complexity and use the existing row based MoR. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> One other idea I was trying to think through is the >>>>>>>>>>>>>>>>>>>>>>>>> "small file problem" we would likely encounter for >>>>>>>>>>>>>>>>>>>>>>>>> single column additions/updates for fixed width data. >>>>>>>>>>>>>>>>>>>>>>>>> Would it make sense to add a record-range into the >>>>>>>>>>>>>>>>>>>>>>>>> metadata for column families, so that we can pack >>>>>>>>>>>>>>>>>>>>>>>>> column updates across files into reasonably sized >>>>>>>>>>>>>>>>>>>>>>>>> files (similar to what we do for DVs today in puffin >>>>>>>>>>>>>>>>>>>>>>>>> files)? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>> Micah >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Feb 16, 2026 at 7:23 AM Gábor Kaszab >>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hey All, >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Anurag for the summary! >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I regret we don't have a recording for the sync, but >>>>>>>>>>>>>>>>>>>>>>>>>> I had the impression that, even though there was a >>>>>>>>>>>>>>>>>>>>>>>>>> lengthy discussion about the implementation >>>>>>>>>>>>>>>>>>>>>>>>>> requirements for partial updates, there wasn't a >>>>>>>>>>>>>>>>>>>>>>>>>> strong consensus around the need and there were no >>>>>>>>>>>>>>>>>>>>>>>>>> strong use cases to justify partial updates either. >>>>>>>>>>>>>>>>>>>>>>>>>> Let me sum up where I see we are at now: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Scope of the updates >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> 1) Full column updates >>>>>>>>>>>>>>>>>>>>>>>>>> There is a consensus and common understanding that >>>>>>>>>>>>>>>>>>>>>>>>>> this use case makes sense. If this was the only >>>>>>>>>>>>>>>>>>>>>>>>>> supported use-case, the implementation would be >>>>>>>>>>>>>>>>>>>>>>>>>> relatively simple. We could guarantee there is no >>>>>>>>>>>>>>>>>>>>>>>>>> overlap in column updates by deduplicating the field >>>>>>>>>>>>>>>>>>>>>>>>>> IDs in the column update metadata. E.g. Let's say we >>>>>>>>>>>>>>>>>>>>>>>>>> have a column update on columns {1,2} and we write >>>>>>>>>>>>>>>>>>>>>>>>>> another column update for {2,3}: we can change the >>>>>>>>>>>>>>>>>>>>>>>>>> metadata for the first one to only cover {1} and not >>>>>>>>>>>>>>>>>>>>>>>>>> {1,2}. With this the write and the read/stitching >>>>>>>>>>>>>>>>>>>>>>>>>> process is also straightforward (if we decide not to >>>>>>>>>>>>>>>>>>>>>>>>>> support equality deletes together with column >>>>>>>>>>>>>>>>>>>>>>>>>> updates). >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Both row matching approaches could work here: >>>>>>>>>>>>>>>>>>>>>>>>>> - row number matching update files, where we >>>>>>>>>>>>>>>>>>>>>>>>>> fill the deleted rows with an arbitrary value >>>>>>>>>>>>>>>>>>>>>>>>>> (preferably null) >>>>>>>>>>>>>>>>>>>>>>>>>> - sparse update files with some auxiliary column >>>>>>>>>>>>>>>>>>>>>>>>>> written into the column update file, like row >>>>>>>>>>>>>>>>>>>>>>>>>> position in base file >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Partial column updates (row-level) >>>>>>>>>>>>>>>>>>>>>>>>>> I see 2 use cases mentioned for this: bug-fixing a >>>>>>>>>>>>>>>>>>>>>>>>>> subset of rows, updating features for active users >>>>>>>>>>>>>>>>>>>>>>>>>> My initial impression here is that whether to use >>>>>>>>>>>>>>>>>>>>>>>>>> column updates or not heavily depends on the >>>>>>>>>>>>>>>>>>>>>>>>>> selectivity of the partial update queries. I'm sure >>>>>>>>>>>>>>>>>>>>>>>>>> there is a percentage of the affected rows where if >>>>>>>>>>>>>>>>>>>>>>>>>> we go below it's simply better to use the >>>>>>>>>>>>>>>>>>>>>>>>>> traditional row level updates (cow/mor). I'm not >>>>>>>>>>>>>>>>>>>>>>>>>> entirely convinced that covering these scenarios is >>>>>>>>>>>>>>>>>>>>>>>>>> worth the extra complexity here: >>>>>>>>>>>>>>>>>>>>>>>>>> - We can't deduplicate the column updates by >>>>>>>>>>>>>>>>>>>>>>>>>> field IDs on the metadata-side >>>>>>>>>>>>>>>>>>>>>>>>>> - We have two options for writers: >>>>>>>>>>>>>>>>>>>>>>>>>> - Merge the existing column update files >>>>>>>>>>>>>>>>>>>>>>>>>> themselves when writing a new one with an overlap of >>>>>>>>>>>>>>>>>>>>>>>>>> field Ids. No need to sort out the different column >>>>>>>>>>>>>>>>>>>>>>>>>> updates files and merge them on the read side, but >>>>>>>>>>>>>>>>>>>>>>>>>> there is overhead on write side >>>>>>>>>>>>>>>>>>>>>>>>>> - Don't bother merging existing column >>>>>>>>>>>>>>>>>>>>>>>>>> updates when writing a new one. This makes overhead >>>>>>>>>>>>>>>>>>>>>>>>>> on the read side. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Handling of sparse update files is a must here, with >>>>>>>>>>>>>>>>>>>>>>>>>> the chance for optimisation if all the rows are >>>>>>>>>>>>>>>>>>>>>>>>>> covered with the update file, as Micah suggested. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> To sum up, I think to justify this approach we need >>>>>>>>>>>>>>>>>>>>>>>>>> to have strong use-cases and measurements to verify >>>>>>>>>>>>>>>>>>>>>>>>>> that the extra complexity results convincingly >>>>>>>>>>>>>>>>>>>>>>>>>> better results compared to existing CoW/MoR >>>>>>>>>>>>>>>>>>>>>>>>>> approaches. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> 3) Partial column updates (file-level) >>>>>>>>>>>>>>>>>>>>>>>>>> This option wasn't brought up during our >>>>>>>>>>>>>>>>>>>>>>>>>> conversation but might be worth considering. This is >>>>>>>>>>>>>>>>>>>>>>>>>> basically a middleground between the above two >>>>>>>>>>>>>>>>>>>>>>>>>> approaches. Partial updates are allowed as long as >>>>>>>>>>>>>>>>>>>>>>>>>> they affect entire data files, and it's allowed to >>>>>>>>>>>>>>>>>>>>>>>>>> only cover a subset of the files. One use-case would >>>>>>>>>>>>>>>>>>>>>>>>>> be to do column updates per partition for instance. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> With this approach the metadata representation could >>>>>>>>>>>>>>>>>>>>>>>>>> be as simple as in 1), where we can deduplicate the >>>>>>>>>>>>>>>>>>>>>>>>>> updates files by field IDs. Also there is no write >>>>>>>>>>>>>>>>>>>>>>>>>> and read overhead on top of 1) apart from the >>>>>>>>>>>>>>>>>>>>>>>>>> verification step to ensure that the WHERE filter on >>>>>>>>>>>>>>>>>>>>>>>>>> the update is doing the split on file boundaries. >>>>>>>>>>>>>>>>>>>>>>>>>> Also similarly to 1), sparse update files weren't a >>>>>>>>>>>>>>>>>>>>>>>>>> must here, we could consider row-matching update >>>>>>>>>>>>>>>>>>>>>>>>>> files too. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Row alignment >>>>>>>>>>>>>>>>>>>>>>>>>> Sparse update files are required for row-level >>>>>>>>>>>>>>>>>>>>>>>>>> partial updates, but if we decide to go with any of >>>>>>>>>>>>>>>>>>>>>>>>>> the other options we could also evaluate the "row >>>>>>>>>>>>>>>>>>>>>>>>>> count matching" approach too. Even though it >>>>>>>>>>>>>>>>>>>>>>>>>> requires filling the missing rows with arbitrary >>>>>>>>>>>>>>>>>>>>>>>>>> values (null seems a good candidate) it would result >>>>>>>>>>>>>>>>>>>>>>>>>> in less write overhead (no need to write row >>>>>>>>>>>>>>>>>>>>>>>>>> position) and read overhead (no need to join rows by >>>>>>>>>>>>>>>>>>>>>>>>>> row position) too that could worth the inconvenience >>>>>>>>>>>>>>>>>>>>>>>>>> of having 'invalid' but inaccessible values in the >>>>>>>>>>>>>>>>>>>>>>>>>> files. The num nulls stats being off is a good >>>>>>>>>>>>>>>>>>>>>>>>>> argument against this, but I think we could have a >>>>>>>>>>>>>>>>>>>>>>>>>> way of fixing this too by keeping track of how many >>>>>>>>>>>>>>>>>>>>>>>>>> rows were deleted (and subtract this value from the >>>>>>>>>>>>>>>>>>>>>>>>>> num nulls counter returned by the writer). >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Next steps >>>>>>>>>>>>>>>>>>>>>>>>>> I'm actively working on a very basic PoC >>>>>>>>>>>>>>>>>>>>>>>>>> implementation where we would be able to test the >>>>>>>>>>>>>>>>>>>>>>>>>> different approaches comparing pros and cons so that >>>>>>>>>>>>>>>>>>>>>>>>>> we can make a decision on the above questions. I'll >>>>>>>>>>>>>>>>>>>>>>>>>> sync with Anurag on this and will let you know once >>>>>>>>>>>>>>>>>>>>>>>>>> we have something. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>>>>>>>>>>>>>> Gabor >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Micah Kornfield <[email protected]> ezt írta >>>>>>>>>>>>>>>>>>>>>>>>>> (időpont: 2026. febr. 14., Szo, 2:20): >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Given that, the sparse representation with >>>>>>>>>>>>>>>>>>>>>>>>>>>> alignment at read time (using dummy/null values) >>>>>>>>>>>>>>>>>>>>>>>>>>>> seems to provide the benefits of both efficient >>>>>>>>>>>>>>>>>>>>>>>>>>>> vectorized reads and stitching as well as support >>>>>>>>>>>>>>>>>>>>>>>>>>>> for partial column updates. Would you agree? >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thinking more about it, I think the sparse approach >>>>>>>>>>>>>>>>>>>>>>>>>>> is actually a superset set approach, so it is not a >>>>>>>>>>>>>>>>>>>>>>>>>>> concern. If writers want they can write out the >>>>>>>>>>>>>>>>>>>>>>>>>>> fully populated columns with position indexes from >>>>>>>>>>>>>>>>>>>>>>>>>>> 1 to N, and readers can take an optimized path if >>>>>>>>>>>>>>>>>>>>>>>>>>> they detect the number of rows in the update is >>>>>>>>>>>>>>>>>>>>>>>>>>> equal to the number of base rows. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> I still think there is a question on what writers >>>>>>>>>>>>>>>>>>>>>>>>>>> should do (i.e. when do they decide to duplicate >>>>>>>>>>>>>>>>>>>>>>>>>>> data instead of trying to give sparse updates) but >>>>>>>>>>>>>>>>>>>>>>>>>>> that is an implementation question and not >>>>>>>>>>>>>>>>>>>>>>>>>>> necessarily something that needs to block spec work. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>>>>>>>>> Micah >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 13, 2026 at 11:29 AM Anurag >>>>>>>>>>>>>>>>>>>>>>>>>>> Mantripragada <[email protected]> >>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Micah, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> This seems like a classic MoR vs CoW trade-off. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> But it seems like maybe both sparse and full >>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be available (I understand this adds >>>>>>>>>>>>>>>>>>>>>>>>>>>>> complexity). For adding a new column or >>>>>>>>>>>>>>>>>>>>>>>>>>>>> completely updating a new column, the performance >>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be better to prefill the data >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Our internal use cases are very similar to what >>>>>>>>>>>>>>>>>>>>>>>>>>>> you describe. We primarily deal with full column >>>>>>>>>>>>>>>>>>>>>>>>>>>> updates. However, the feedback on the proposal >>>>>>>>>>>>>>>>>>>>>>>>>>>> from the wider community indicated that partial >>>>>>>>>>>>>>>>>>>>>>>>>>>> updates (e.g., bug-fixing a subset of rows, >>>>>>>>>>>>>>>>>>>>>>>>>>>> updating features for active users) are also a >>>>>>>>>>>>>>>>>>>>>>>>>>>> very common and critical use case. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Is there evidence to say that partial column >>>>>>>>>>>>>>>>>>>>>>>>>>>>> updates are more common in practice then full >>>>>>>>>>>>>>>>>>>>>>>>>>>>> rewrites? >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Personally, I don't have hard data on which use >>>>>>>>>>>>>>>>>>>>>>>>>>>> case is more common in the wild, only that both >>>>>>>>>>>>>>>>>>>>>>>>>>>> appear to be important. I also agree that a good >>>>>>>>>>>>>>>>>>>>>>>>>>>> long term solution should support both strategies. >>>>>>>>>>>>>>>>>>>>>>>>>>>> Given that, the sparse representation with >>>>>>>>>>>>>>>>>>>>>>>>>>>> alignment at read time (using dummy/null values) >>>>>>>>>>>>>>>>>>>>>>>>>>>> seems to provide the benefits of both efficient >>>>>>>>>>>>>>>>>>>>>>>>>>>> vectorized reads and stitching as well as support >>>>>>>>>>>>>>>>>>>>>>>>>>>> for partial column updates. Would you agree? >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> ~ Anurag >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 13, 2026 at 9:33 AM Micah Kornfield >>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Anurag, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Data Representation: Sparse column files are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> preferred for compact representation and are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better suited for partial column updates. We can >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimize sparse representation for vectorized >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reads by filling in null or default values at >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> read time for missing positions from the base >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file, which avoids joins during reads. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> This seems like a classic MoR vs CoW trade-off. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> But it seems like maybe both sparse and full >>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be available (I understand this adds >>>>>>>>>>>>>>>>>>>>>>>>>>>>> complexity). For adding a new column or >>>>>>>>>>>>>>>>>>>>>>>>>>>>> completely updating a new column, the performance >>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be better to prefill the data (otherwise >>>>>>>>>>>>>>>>>>>>>>>>>>>>> one ends up duplicating the work that is already >>>>>>>>>>>>>>>>>>>>>>>>>>>>> happening under the hood in parquet). >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Is there evidence to say that partial column >>>>>>>>>>>>>>>>>>>>>>>>>>>>> updates are more common in practice then full >>>>>>>>>>>>>>>>>>>>>>>>>>>>> rewrites? >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Micah >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Feb 12, 2026 at 3:32 AM Eduard >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Tudenhöfner <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey Anurag, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I wasn't able to make it to the sync but was >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hoping to watch the recording afterwards. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm curious what the reasons were for discarding >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the Parquet-native approach. Could you share a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> summary from what was discussed in the sync >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> please on that topic? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 10, 2026 at 8:20 PM Anurag >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Mantripragada <[email protected]> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for attending today's sync. Please >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> find the meeting notes below. I apologize that >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we were unable to record the session due to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> attendees not having record access. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Key updates and discussion points: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Decisions: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Table Format vs. Parquet: There is a general >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consensus that column update support should >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reside in the table format. Consequently, we >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have discarded the Parquet-native approach. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Metadata Representation: To maintain clean >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metadata and avoid complex resolution logic for >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> readers, the goal is to keep only one metadata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file per column. However, achieving this is >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> challenging if we support partial updates, as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> multiple column files may exist for the same >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> column (See open questions). >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Data Representation: Sparse column files are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> preferred for compact representation and are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better suited for partial column updates. We >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can optimize sparse representation for >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> vectorized reads by filling in null or default >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> values at read time for missing positions from >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the base file, which avoids joins during reads. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Open Questions: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We are still determining what restrictions are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> necessary when supporting partial updates. For >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instance, we need to decide whether to add a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> new column and subsequently allow partial >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> updates on it. This would involve managing both >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a base column file and subsequent update files. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We need a better understanding of the use cases >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for partial updates. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We need to further discuss the handling of >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> equality deletes. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If I missed anything, or if others took notes, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> please share them here. Thanks! >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will go ahead and update the doc with what we >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have discussed so we can continue next time >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from where we left off. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~ Anurag >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Feb 9, 2026 at 11:55 AM Anurag >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Mantripragada <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This design will be discussed tomorrow in a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dedicated sync. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Efficient column updates sync >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Tuesday, February 10 · 9:00 – 10:00am >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Time zone: America/Los_Angeles >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Google Meet joining info >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Video call link: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://meet.google.com/xsd-exug-tcd >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~ Anurag >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 6, 2026 at 8:30 AM Anurag >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Mantripragada <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Gabor, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the detailed example. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with Steven that Option 2 seems >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reasonable. I will add a section to the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design doc regarding equality delete >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> handling, and we can discuss this further >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> during our meeting on Tuesday. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~Anurag >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 6, 2026 at 7:08 AM Steven Wu >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > 1) When deleting with eq-deletes: If there >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > is a column update on the equality-filed >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > ID we use for the delete, reject deletion >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > 2) When adding a column update on a column >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > that is part of the equality field IDs in >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > some delete, we reject the column update >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Gabor, this is a good scenario. The 2nd >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> option makes sense to me, since equality ids >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are like primary key fields. If we have the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2nd rule enforced, the first option is not >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> applicable anymore. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 6, 2026 at 3:13 AM Gábor Kaszab >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hey, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for the proposal, Anurag! I made >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a pass recently and I think there is some >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interference between column updates and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> equality deletes. Let me describe below: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Steps: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CREATE TABLE tbl (int a, int b); >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INSERT INTO tbl VALUES (1, 11), (2, 22); >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- creates the base data file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> DELETE FROM tbl WHERE b=11; >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- creates an equality delete file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UPDATE tbl SET b=11; >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- writes column update >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SELECT * FROM tbl; >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Expected result: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (2, 11) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Data and metadata created after the above >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> steps: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Base file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (1, 11), (2, 22), >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seqnum=1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> EQ-delete >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> b=11 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seqnum=2 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Column update >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Field ids: [field_id_for_col_b] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seqnum=3 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Data file content: (dummy_value),(11) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Read steps: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Stitch base file with column updates in >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reader: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Rows: (1,dummy_value), (2,11) (Note, dummy >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> value can be either null, or 11, see the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal for more details) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Seqnum for base file=1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Seqnum for column update=3 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apply eq-delete b=11, seqnum=3 on the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> stitched result >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Query result depends on which seqnum we >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> carry forward to compare with the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> eq-delete's seqnum, but it's not correct in >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> any of the cases >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Use seqnum from base file: we get either an >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> empty result if 'dummy_value' is 11 or we >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> get (1, null) otherwise >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Use seqnum from last update file: don't >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> delete any rows, result set is (1, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dummy_value),(2,11) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Problem: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> EQ-delete should be applied midway applying >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the column updates to the base file based >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on sequence number, during the stitching >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> process. If I'm not mistaken, this is not >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feasible with the way readers work. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Proposal: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Don't allow equality deletes together with >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> column updates. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) When deleting with eq-deletes: If >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there is a column update on the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> equality-filed ID we use for the delete, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reject deletion >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) When adding a column update on a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> column that is part of the equality field >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IDs in some delete, we reject the column >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> update >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alternatively, column updates could be >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> controlled by a property of the table >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (immutable), and reject eq-deletes if the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> property indicates column updates are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> turned on for the table >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Let me know what you think! >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Gabor >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Anurag Mantripragada >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> ezt írta >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (időpont: 2026. jan. 28., Sze, 3:31): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you everyone for the initial review >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> comments. It is exciting to see so much >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interest in this proposal. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am currently reviewing and responding to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> each comment. The general themes of the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feedback so far include: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Including partial updates (column >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> updates on a subset of rows in a table). >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Adding details on how SQL engines will >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> write the update files. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Adding details on split planning and row >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alignment for update files. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will think through these points and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> update the design accordingly. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Anurag >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Jan 27, 2026 at 6:25 PM Anurag >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Mantripragada <[email protected]> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Xiangin, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Happy to learn from your experience in >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supporting backfill use-cases. Please >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feel free to review the proposal and add >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> your comments. I will wait for a couple >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of days more to ensure everyone has a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> chance to review the proposal. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~ Anurag >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Jan 27, 2026 at 6:42 AM Xianjin >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ye <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Anurag and Peter, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It’s great to see the partial column >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> update has gained great interest in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> community. I internally built a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BackfillColumns action to efficiently >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> backfill columns(by writing the partial >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> columns only and copies the binary data >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of other columns into a new DataFile). >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The speedup could be 10x for wide tables >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but the write amplification is still >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there. I would be happy to collaborate >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the work and eliminate the write >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> amplification. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2026/01/27 10:12:54 Péter Váry wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Hi Anurag, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > It’s great to see how much interest >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > there is in the community around this >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > potential new feature. Gábor and I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > have actually submitted an Iceberg >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Summit talk proposal on this topic, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > and we would be very happy to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > collaborate on the work. I was mainly >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > waiting for the File Format API to be >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > finalized, as I believe this feature >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > should build on top of it. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > For reference, our related work >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > includes: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > - *Dev list thread:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > https://lists.apache.org/thread/h0941sdq9jwrb6sj0pjfjjxov8tx7ov9 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > - *Proposal document:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > https://docs.google.com/document/d/1OHuZ6RyzZvCOQ6UQoV84GzwVp3UPiu_cfXClsOi03ww >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > (not shared widely yet) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > - *Performance testing PR for >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > readers and writers:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > https://github.com/apache/iceberg/pull/13306 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > During earlier discussions about >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > possible metadata changes, another >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > option >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > came up that hasn’t been documented >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > yet: separating planner metadata from >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > reader metadata. Since the planner >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > does not need to know about the actual >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > files, we could store the file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > composition in a separate file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > (potentially >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > a Puffin file). This file could hold >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > the column_files metadata, while the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > manifest would reference the Puffin >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > file and blob position instead of the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > data filename. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > This approach has the advantage of >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > keeping the existing metadata largely >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > intact, and it could also give us a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > natural place later to add file-level >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > indexes or Bloom filters for use >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > during reads or secondary filtering. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > The >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > downsides are the additional files and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > the increased complexity of >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > identifying files that are no longer >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > referenced by the table, so this may >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > not be an ideal solution. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > I do have some concerns about the MoR >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > metadata proposal described in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > document. At first glance, it seems to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > complicate distributed planning, as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > all entries for a given file would >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > need to be collected and merged to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > provide the information required by >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > both the planner and the reader. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Additionally, when a new column is >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > added or updated, we would still need >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > add a new metadata entry for every >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > existing data file. If we immediately >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > write out the merged metadata, the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > total number of entries remains the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > same. The main benefit is avoiding >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > rewriting statistics, which can be >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > significant, but this comes at the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > cost of increased planning complexity. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > If we choose to store the merged >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > statistics in the column_families >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > entry, I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > don’t see much benefit in excluding >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > the rest of the metadata, especially >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > since including it would simplify the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > planning process. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > As Anton already pointed out, we >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > should also discuss how this change >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > would >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > affect split handling, particularly >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > how to avoid double reads when row >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > groups are not aligned between the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > original data files and the new column >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > files. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Finally, I’d like to see some >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > discussion around the Java API >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > implications. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > In particular, what API changes are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > required, and how SQL engines would >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > perform updates. Since the new column >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > files must have the same number of >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > rows as the original data files, with >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > a strict one-to-one relationship, SQL >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > engines would need access to the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > source filename, position, and deletion >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > status in the DataFrame in order to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > generate the new files. This is more >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > involved than a simple update and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > deserves some explicit consideration. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Looking forward to your thoughts. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Best regards, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Peter >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > On Tue, Jan 27, 2026, 03:58 Anurag >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Mantripragada >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > <[email protected]> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Thanks Anton and others, for >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > providing some initial feedback. I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > will >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > address all your comments soon. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > On Mon, Jan 26, 2026 at 11:10 AM >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > Anton Okolnychyi >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > <[email protected]> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> I had a chance to see the proposal >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> before it landed and I think it is a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> cool idea and both presented >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> approaches would likely work. I am >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> looking >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> forward to discussing the tradeoffs >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> and would encourage everyone to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> push/polish each approach to see >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> what issues can be mitigated and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> what are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> fundamental. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> [1] Iceberg-native approach: better >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> visibility into column files from >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> metadata, potentially better >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> concurrency for non-overlapping >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> column >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> updates, no dep on Parquet. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> [2] Parquet-native approach: almost >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> no changes to the table format >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> metadata beyond tracking of base >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> files. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> I think [1] sounds a bit better on >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> paper but I am worried about the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> complexity in writers and readers >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> (especially around keeping row >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> groups >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> aligned and split planning). It >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> would be great to cover this in >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> detail in >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> the proposal. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> пн, 26 січ. 2026 р. о 09:00 Anurag >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> Mantripragada < >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> [email protected]> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> пише: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> Hi all, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> "Wide tables" with thousands of >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> columns present significant >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> challenges >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> for AI/ML workloads, particularly >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> when only a subset of columns >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> needs to be >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> added or updated. Current >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> Copy-on-Write (COW) and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> Merge-on-Read (MOR) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> operations in Iceberg apply at the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> row level, which leads to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> substantial >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> write amplification in scenarios >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> such as: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> - Feature Backfilling & Column >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> Updates: Adding new feature columns >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> (e.g., model embeddings) to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> petabyte-scale tables. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> - Model Score Updates: Refresh >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> prediction scores after retraining. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> - Embedding Refresh: Updating >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> vector embeddings, which currently >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> triggers a rewrite of the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> entire row. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> - Incremental Feature >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> Computation: Daily updates to a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> small fraction >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> of features in wide tables. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> With the Iceberg V4 proposal >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> introducing single-file commits >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> and column >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> stats improvements, this is an >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> ideal time to address column-level >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> updates >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> to better support these use cases. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> I have drafted a proposal that >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> explores both table-format >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> enhancements >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> and file-format (Parquet) changes >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> to enable more efficient updates. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> Proposal Details: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> - GitHub Issue: #15146 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> <https://github.com/apache/iceberg/issues/15146> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> - Design Document: Efficient >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> Column Updates in Iceberg >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> Next Steps: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> I plan to create POCs to benchmark >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> the approaches described in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> document. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> Please review the proposal and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> share your feedback. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> Anurag >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >
