Let me summarize what we talked here and follow up with a PR. - Iceberg should allow users to define a sort oder in its metadata that applies to partitions. - We should never assume the sort order is actually applied to all files in the table. - Sort orders might evolve and change over time. When this happens, existing files will not be rewritten. Query engines should follow the updated sort order during subsequent writes. As a result, files within a table or partition can be sorted differently at a given point in time. - We should be able to define a sort order even for unpartitioned tables, as opposed to current Spark tables that allow a sort order only for bucketed tables. - SortOrder is separate from PartitionSpec. - SortOrder will rely on transformations to define complex sort orders. - Files will be annotated with sort_order_id instead of sort_columns. We keep the question of file_ordinal open for now. - To begin with, we will support asc/desc natural sort orders (UTF8 ordering for Strings).
Thanks, Anton > On 16 Jul 2019, at 23:56, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > I agree that Iceberg metadata should include a way to configure a desired > sort order. But I want to note that I don’t think that we can ever assume > that it has been applied. Table configuration will evolve as use changes. We > don’t want to require rewrites when a configuration gets updated, so an > assumption should be that data files might not be sorted. > > Files that are sorted should indicate how they are sorted, so that > optimizations are applied if the file’s metadata indicates it can be safely > applied. For example, if both deletes and data rows are sorted the same way, > you can merge the two streams instead of using a hash set to check whether a > record has been deleted. I think this should rely on the delete file’s sort > order matching the data file it is applied to. > > Should Iceberg allow users to define a sort spec only if the table is > bucketed? > > No. In Iceberg, bucketing is just another partition transform. > > However, I think we need to consider what a sort order will mean. Here are a > few observations: > > Each file can have a sort order for its rows (Spark’s sortWithinPartitions, > which sorts each task’s data) > Sorting is also used to cluster values across files so it makes sense for a > table sort order to be applied within partitions (ORDER BY) > Multiple writes to the same partition are not expected to rewrite existing > data, so a partition may only be partially sorted or may have multiple sorted > file sets > Partitioning is independent from sorting. Even when partitioning is > orthogonal to a sort order (i.e., bucketing), partitioning must still take > precedence. > My conclusion is that a configured sort order applies to partitions, not data > across partitions. Again, bucketing is just another type of partition. > > How should Iceberg encode sort specs? > > I don’t think this should be in table properties. The sort order should > reference columns by ID so it doesn’t need to be changed when columns are > renamed. I think this should be implemented like PartitionSpec. > > If sorting is applied within partitions, then I would make PartitionSpec and > SortOrder separate. I would still use transforms to produce more complex sort > orders. I think that’s a great idea, but we don’t need to mix partitioning > and sorting to reuse transforms. Like partition specs, I think a table should > be able to define multiple sort orders and each should be identified by ID. > Then each data file can encode which sort order it was written with, just > like manifests and partition specs. > > I think we should add sort-orders like partition-specs, and a > default-sort-order-id like default-spec-id. This would also require removing > sort_columns from data files in the spec > <http://iceberg.apache.org/spec/#manifests> and adding sort_order_id. We can > keep file_ordinal, but probably want to add some context to know the group of > files where it is valid. We could also remove it. > > Which sort orders should Iceberg support? > > I agree with what’s already been said: we should use a natural order for each > type > <https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/types/Comparators.java#L31-L45>, > ascending and descending. To start, Strings must use UTF-8’s natural > ordering and we can expand from there. > > Here’s what a sort order might look like: > > "sort-orders": [ > { "id": 1, "ordering": [ { "source-id": 4, "ascending": true } ] }, > { "id": 2, "ordering": [ { "source-id": 4, "ascending": true }, { > "source-id": 5, "ascending": false } ] }, > { "id": 3, "ordering": [ { "transform": "zorder", "ascending": false, > "source-ids": [4, 5] } ] }, > ] > > On Thu, Jul 4, 2019 at 9:46 AM Anton Okolnychyi > <aokolnyc...@apple.com.invalid> wrote: > In order to begin prototyping, I would start with the following questions. > > 1) Does Iceberg need a sort spec? > - I would say yes > 2) Should Iceberg allow users to define a sort spec only if the table is > bucketed? > - I would say no, as it seems valid to have partitioned and sorted > tables. > 3) How should Iceberg encode sort specs? > - Option #1 is to rely on table properties, which will allow us to use > ALTER TABLE ... SET TBLPROPERTIES to configure sorting specs. However, I am > not sure it would be easy to encode non-trivial sort specs and track sort > spec evolution (if needed). > - Option #2 is to extend PartitionSpec to cover sorting as well. This > option will allow us to use transformations to encode non-trivial sorts and > won't require many changes to the codebase. > - Option #3 is to store SortSpec separately from PartitionSpec. This > will require more changes compared to Option #2 but can also give us extra > flexibility. > > Each option has its own trade-offs, but I tend to think #2 is reasonable. > > 4) Which sort orders should Iceberg support? > - I think we have to be flexible and support adding more sort orders > later. In addition to what Owen said, we can add sorting based on > multi-dimensional space-filling curves in the future. > > > What do you think? > > Thanks, > Anton > >> On 1 Jul 2019, at 18:06, Owen O'Malley <owen.omal...@gmail.com >> <mailto:owen.omal...@gmail.com>> wrote: >> >> My thought is just like Iceberg has to define partitioning and bucketing, it >> has to define a canonical sort order. In particular, we can’t afford to have >> Spark, Presto, and Hive writing files in different orders. I believe the >> right approach is to define a sort order as a series of columns where each >> column is either ascending or descending and defining the natural sort order >> for each type. >> >> The hard bit will be if we need to support non-natural sorts of strings. For >> example, if we need to support case-insensitive sorts or the different >> collations that databases support, I’d hope that we could start with the >> default of utf-8 byte ordering and expand as needed. If you are curious what >> the different collations look like - >> https://dba.stackexchange.com/questions/94887/what-is-the-impact-of-lc-ctype-on-a-postgresql-database >> >> <https://dba.stackexchange.com/questions/94887/what-is-the-impact-of-lc-ctype-on-a-postgresql-database> >> . >> >> .. Owen >> >>> On Jul 1, 2019, at 4:18 AM, Anton Okolnychyi <aokolnyc...@apple.com.INVALID >>> <mailto:aokolnyc...@apple.com.INVALID>> wrote: >>> >>> Hey folks, >>> >>> Iceberg users are advised not only to partition their data but also to sort >>> within partitions by columns in predicates in order to get the best >>> performance. Right now, this process is mostly manual and performed by >>> users before writing. >>> I am wondering if we should extend Iceberg metadata so that query engines >>> can do this automatically in the future. We already have `sortColumns` in >>> DataFile but they are not used. >>> Do we need a notion of sort columns in TableMetadata? >>> Spark’s sort spec is tightly coupled with bucketing and cannot be used >>> alone. However, it seems reasonable to have partitioned and sorted tables >>> without bucketing. How do we see this in Iceberg? >>> If we decide to have sort spec in the metadata, do we want to make it part >>> of PartitionSpec or have it separately? >>> Thanks, >>> Anton >>> >> > > > > -- > Ryan Blue > Software Engineer > Netflix