jordepic opened a new pull request, #4658:
URL: https://github.com/apache/datafusion-comet/pull/4658

   ## Which issue does this PR close?
   
   Closes #4322.
   
   ## Rationale for this change
   
   Iceberg spark writes are V2 operators and contain the functionality for 
writing data files, metadata files, and committing to the catalog. Ultimately, 
Comet is only well-positioned to just accelerate data file writing (assuming 
they're parquet files).  It is also crucial to ensure that the actual data file 
writing piece of the spark plan for iceberg writing is included within the AQE 
block of a spark plan, thereby ensuring that we re-plan writes in response to 
runtime decisions regarding its upstream operators.
   
   Our split is fairly simple - we write the data files like normal in the 
"writer" operator, serialize its output, and pass it back to the "committer" 
operator. In the future, we'll target just the "writer" operator for speedup 
with iceberg-rust.
   
   ## What changes are included in this PR?
   
   This PR contains 5 commits.
   
   1) Docs outlining the WHOLE iceberg-write acceleration feature, not just 
these changes (I'm happy to modify/remove as needed).
   2) Planning rules to move iceberg append and overwrite operations to our 
"split operator" design.
   3) Planning rules to move iceberg delete, update, and merge operations to 
our "split operator" design.
   4) Tests for part 2
   5) Tests for part 3
   
   ## How are these changes tested?
   
   We have unit tests for each operator that we're replacing that ensures that 
the plan shape is correct, we commit to our iceberg table the proper number of 
times, and our iceberg table end state is correct when we scan it after a write 
operation. I've been running with these changes locally now and they're all 
performing as expected as well.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to