Hi Everyone, I'd like to revisit the feature request for specifying sort order during CREATE TABLE AS SELECT (CTAS) and REPLACE TABLE AS SELECT (RTAS) operations in Spark SQL. This has been previously raised in:
- https://github.com/apache/iceberg/issues/4889 (Support sort order in DDL) - https://github.com/apache/iceberg/issues/14612 THE PROBLEM ----------- Currently, the only way to set a sort order is through a separate ALTER TABLE ... WRITE ORDERED BY command after table creation. This creates significant operational challenges for full table refresh workflows: Option A - Empty table window: REPLACE TABLE db.sample AS SELECT * FROM source WHERE 1=0; ALTER TABLE db.sample WRITE ORDERED BY category, id; INSERT INTO db.sample SELECT * FROM source; Problem: Table is temporarily empty between steps 1 and 3, making it unavailable to consumers. Option B - Double write: REPLACE TABLE db.sample AS SELECT * FROM source; CALL system.rewrite_data_files( table => 'db.sample', strategy => 'sort', sort_order => 'category ASC,id ASC', options => map('rewrite-all', 'true') ); Problem: Data is written twice - once unsorted, then again sorted. Doubles compute cost and storage churn. Implementation Plan : Spark's DSv2 StagingTableCatalog interface doesnt have sort order parameter, So TBLPROPERTIES is the natural channel to pass the information. Changes in Spark Catalog with parseSortOrder, exact pattern from rewrite data files. Please let me know if I am missing something here. Thanks Hemanth Boyina
