Hi Everyone,

   I'd like to revisit the feature request for specifying sort order during
CREATE TABLE AS SELECT (CTAS) and REPLACE TABLE AS SELECT (RTAS) operations
in Spark SQL. This has been previously raised in:

- https://github.com/apache/iceberg/issues/4889 (Support sort order in DDL)
- https://github.com/apache/iceberg/issues/14612

  THE PROBLEM
-----------

Currently, the only way to set a sort order is through a separate
ALTER TABLE ... WRITE ORDERED BY command after table creation. This creates
significant operational challenges for full table refresh workflows:

Option A - Empty table window:

    REPLACE TABLE db.sample AS SELECT * FROM source WHERE 1=0;
    ALTER TABLE db.sample WRITE ORDERED BY category, id;
    INSERT INTO db.sample SELECT * FROM source;

Problem: Table is temporarily empty between steps 1 and 3, making it
unavailable to consumers.

Option B - Double write:

    REPLACE TABLE db.sample AS SELECT * FROM source;
    CALL system.rewrite_data_files(
        table => 'db.sample',
        strategy => 'sort',
        sort_order => 'category ASC,id ASC',
        options => map('rewrite-all', 'true')
    );

Problem: Data is written twice - once unsorted, then again sorted.
Doubles compute cost and storage churn.


Implementation Plan :

Spark's DSv2 StagingTableCatalog interface doesnt have sort order
parameter, So TBLPROPERTIES is the natural channel to pass the information.
Changes in Spark Catalog with parseSortOrder, exact pattern from rewrite
data files.

Please let me know if I am missing something here.

Thanks
Hemanth Boyina

Reply via email to