Re: [DISCUSS] Support sort order in CTAS/RTAS via TBLPROPERTIES — revisiting #4889 and #14612

Ryan Blue Sat, 14 Mar 2026 14:43:23 -0700

It doesn't seem worth it to me, since this is a weird use of table
properties -- you're not setting a property. I understand that it's a bit
easier since Iceberg does the work of making the partitioning and sort
order work together. But I think I prefer the solution to use standard SQL
clauses rather than passing something unsupported by Spark through table
properties, even though it is a bit more difficult for the caller.


On Sat, Mar 14, 2026 at 5:33 AM hemanth boyina <[email protected]>
wrote:

> Thank Ryan,
>
>    Order By in select can work for the physical sort especially if
> partition columns are included first so the sort survives the default Hash
> distribution.
>
>   The main difference i see is that with sort order metadata, the write
> path coordinates everything automatically- Range distribution and within
> partition ordering. Whereas with Order By, users need to manually ensure
> the column order aligns with the partition spec to avoid hash distribution
> breaking the sort.
>
>  Also, the bigger ask is about atomicity and tooling. Order By + a follow
> up alter table ordered by works as a two step process.
>
>  Would it make sense to support this via TBL PROPERTIES or a write ordered
> by clause in CTAS ?
>
> Thanks
> Hemanth Boyina
>
> On Sat, 14 Mar 2026 at 2:27 AM, Ryan Blue <[email protected]> wrote:
>
>> Why not include an ORDER BY in your SELECT query? If you want to specify
>> the sort order in RTAS, then you don't need to be setting the table
>> properties. Just pass it to the SELECT instead.
>>
>> On Fri, Mar 13, 2026 at 11:10 AM hemanth boyina <
>> [email protected]> wrote:
>>
>>> Hi Everyone,
>>>
>>>    I'd like to revisit the feature request for specifying sort order
>>> during
>>> CREATE TABLE AS SELECT (CTAS) and REPLACE TABLE AS SELECT (RTAS)
>>> operations
>>> in Spark SQL. This has been previously raised in:
>>>
>>> - https://github.com/apache/iceberg/issues/4889 (Support sort order in
>>> DDL)
>>> - https://github.com/apache/iceberg/issues/14612
>>>
>>>   THE PROBLEM
>>> -----------
>>>
>>> Currently, the only way to set a sort order is through a separate
>>> ALTER TABLE ... WRITE ORDERED BY command after table creation. This
>>> creates
>>> significant operational challenges for full table refresh workflows:
>>>
>>> Option A - Empty table window:
>>>
>>>     REPLACE TABLE db.sample AS SELECT * FROM source WHERE 1=0;
>>>     ALTER TABLE db.sample WRITE ORDERED BY category, id;
>>>     INSERT INTO db.sample SELECT * FROM source;
>>>
>>> Problem: Table is temporarily empty between steps 1 and 3, making it
>>> unavailable to consumers.
>>>
>>> Option B - Double write:
>>>
>>>     REPLACE TABLE db.sample AS SELECT * FROM source;
>>>     CALL system.rewrite_data_files(
>>>         table => 'db.sample',
>>>         strategy => 'sort',
>>>         sort_order => 'category ASC,id ASC',
>>>         options => map('rewrite-all', 'true')
>>>     );
>>>
>>> Problem: Data is written twice - once unsorted, then again sorted.
>>> Doubles compute cost and storage churn.
>>>
>>>
>>> Implementation Plan :
>>>
>>> Spark's DSv2 StagingTableCatalog interface doesnt have sort order
>>> parameter, So TBLPROPERTIES is the natural channel to pass the information.
>>> Changes in Spark Catalog with parseSortOrder, exact pattern from rewrite
>>> data files.
>>>
>>> Please let me know if I am missing something here.
>>>
>>> Thanks
>>> Hemanth Boyina
>>>
>>

Re: [DISCUSS] Support sort order in CTAS/RTAS via TBLPROPERTIES — revisiting #4889 and #14612

Reply via email to