In order to begin prototyping, I would start with the following questions.

1) Does Iceberg need a sort spec?
        - I would say yes
2) Should Iceberg allow users to define a sort spec only if the table is 
bucketed?
        - I would say no, as it seems valid to have partitioned and sorted 
tables.
3) How should Iceberg encode sort specs?
        - Option #1 is to rely on table properties, which will allow us to use 
ALTER TABLE ... SET TBLPROPERTIES to configure sorting specs. However, I am not 
sure it would be easy to encode non-trivial sort specs and track sort spec 
evolution (if needed).
        - Option #2 is to extend PartitionSpec to cover sorting as well. This 
option will allow us to use transformations to encode non-trivial sorts and 
won't require many changes to the codebase.
        - Option #3 is to store SortSpec separately from PartitionSpec. This 
will require more changes compared to Option #2 but can also give us extra 
flexibility.

Each option has its own trade-offs, but I tend to think #2 is reasonable.

4) Which sort orders should Iceberg support?
        - I think we have to be flexible and support adding more sort orders 
later. In addition to what Owen said, we can add sorting based on 
multi-dimensional space-filling curves in the future.


What do you think?

Thanks,
Anton

> On 1 Jul 2019, at 18:06, Owen O'Malley <owen.omal...@gmail.com> wrote:
> 
> My thought is just like Iceberg has to define partitioning and bucketing, it 
> has to define a canonical sort order. In particular, we can’t afford to have 
> Spark, Presto, and Hive writing files in different orders. I believe the 
> right approach is to define a sort order as a series of columns where each 
> column is either ascending or descending and defining the natural sort order 
> for each type.
> 
> The hard bit will be if we need to support non-natural sorts of strings. For 
> example, if we need to support case-insensitive sorts or the different 
> collations that databases support, I’d hope that we could start with the 
> default of utf-8 byte ordering and expand as needed. If you are curious what 
> the different collations look like - 
> https://dba.stackexchange.com/questions/94887/what-is-the-impact-of-lc-ctype-on-a-postgresql-database
>  
> <https://dba.stackexchange.com/questions/94887/what-is-the-impact-of-lc-ctype-on-a-postgresql-database>
>  .
> 
> .. Owen
> 
>> On Jul 1, 2019, at 4:18 AM, Anton Okolnychyi <aokolnyc...@apple.com.INVALID 
>> <mailto:aokolnyc...@apple.com.INVALID>> wrote:
>> 
>> Hey folks,
>> 
>> Iceberg users are advised not only to partition their data but also to sort 
>> within partitions by columns in predicates in order to get the best 
>> performance. Right now, this process is mostly manual and performed by users 
>> before writing.
>> I am wondering if we should extend Iceberg metadata so that query engines 
>> can do this automatically in the future. We already have `sortColumns` in 
>> DataFile but they are not used.
>> Do we need a notion of sort columns in TableMetadata?
>> Spark’s sort spec is tightly coupled with bucketing and cannot be used 
>> alone. However, it seems reasonable to have partitioned and sorted tables 
>> without bucketing. How do we see this in Iceberg?
>> If we decide to have sort spec in the metadata, do we want to make it part 
>> of PartitionSpec or have it separately?
>> Thanks,
>> Anton
>> 
> 

Reply via email to