Hey folks,
Iceberg users are advised not only to partition their data but also to sort
within partitions by columns in predicates in order to get the best
performance. Right now, this process is mostly manual and performed by users
before writing.
I am wondering if we should extend Iceberg metadata so that query engines can
do this automatically in the future. We already have `sortColumns` in DataFile
but they are not used.
Do we need a notion of sort columns in TableMetadata?
Spark’s sort spec is tightly coupled with bucketing and cannot be used alone.
However, it seems reasonable to have partitioned and sorted tables without
bucketing. How do we see this in Iceberg?
If we decide to have sort spec in the metadata, do we want to make it part of
PartitionSpec or have it separately?
Thanks,
Anton