In order to begin prototyping, I would start with the following questions. 1) Does Iceberg need a sort spec? - I would say yes 2) Should Iceberg allow users to define a sort spec only if the table is bucketed? - I would say no, as it seems valid to have partitioned and sorted tables. 3) How should Iceberg encode sort specs? - Option #1 is to rely on table properties, which will allow us to use ALTER TABLE ... SET TBLPROPERTIES to configure sorting specs. However, I am not sure it would be easy to encode non-trivial sort specs and track sort spec evolution (if needed). - Option #2 is to extend PartitionSpec to cover sorting as well. This option will allow us to use transformations to encode non-trivial sorts and won't require many changes to the codebase. - Option #3 is to store SortSpec separately from PartitionSpec. This will require more changes compared to Option #2 but can also give us extra flexibility.
Each option has its own trade-offs, but I tend to think #2 is reasonable. 4) Which sort orders should Iceberg support? - I think we have to be flexible and support adding more sort orders later. In addition to what Owen said, we can add sorting based on multi-dimensional space-filling curves in the future. What do you think? Thanks, Anton > On 1 Jul 2019, at 18:06, Owen O'Malley <owen.omal...@gmail.com> wrote: > > My thought is just like Iceberg has to define partitioning and bucketing, it > has to define a canonical sort order. In particular, we can’t afford to have > Spark, Presto, and Hive writing files in different orders. I believe the > right approach is to define a sort order as a series of columns where each > column is either ascending or descending and defining the natural sort order > for each type. > > The hard bit will be if we need to support non-natural sorts of strings. For > example, if we need to support case-insensitive sorts or the different > collations that databases support, I’d hope that we could start with the > default of utf-8 byte ordering and expand as needed. If you are curious what > the different collations look like - > https://dba.stackexchange.com/questions/94887/what-is-the-impact-of-lc-ctype-on-a-postgresql-database > > <https://dba.stackexchange.com/questions/94887/what-is-the-impact-of-lc-ctype-on-a-postgresql-database> > . > > .. Owen > >> On Jul 1, 2019, at 4:18 AM, Anton Okolnychyi <aokolnyc...@apple.com.INVALID >> <mailto:aokolnyc...@apple.com.INVALID>> wrote: >> >> Hey folks, >> >> Iceberg users are advised not only to partition their data but also to sort >> within partitions by columns in predicates in order to get the best >> performance. Right now, this process is mostly manual and performed by users >> before writing. >> I am wondering if we should extend Iceberg metadata so that query engines >> can do this automatically in the future. We already have `sortColumns` in >> DataFile but they are not used. >> Do we need a notion of sort columns in TableMetadata? >> Spark’s sort spec is tightly coupled with bucketing and cannot be used >> alone. However, it seems reasonable to have partitioned and sorted tables >> without bucketing. How do we see this in Iceberg? >> If we decide to have sort spec in the metadata, do we want to make it part >> of PartitionSpec or have it separately? >> Thanks, >> Anton >> >