To be more specific, I think it's sorting by the value after transformation?
On Wed, Jan 31, 2024 at 11:36 AM Amogh Jahagirdar <am...@tabular.io> wrote: > Yeah I think being able to specify the order of the columns to sort by > when rewriting the manifests makes a lot of sense. > > On Tue, Jan 30, 2024 at 5:47 PM Renjie Liu <liurenjie2...@gmail.com> > wrote: > >> Sounds reasonable to me. >> >> On Wed, Jan 31, 2024 at 7:56 AM <russell.spit...@gmail.com> wrote: >> >>> Sounds like a reasonable thing to add? Maybe we could check cardinality >>> to pick out the default order as well? >>> Sent from my iPhone >>> >>> On Jan 30, 2024, at 3:50 PM, Jack Ye <yezhao...@gmail.com> wrote: >>> >>> >>> Hi everyone, >>> >>> Today, the rewrite manifest procedure always orders the data files based >>> on their *data_file.partition* value. Specifically, it sorts data files >>> that have the same partition value, and then does a repartition by range >>> based on the target number of manifest files (ref >>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java#L257-L258> >>> ), >>> >>> I notice that this approach does not always yield the best performance >>> for scan planning because the resulting manifest entries order is basically >>> based on the default order of the partition columns. >>> >>> For example, consider a table partitioned by columns a and b. By default >>> the rewrite procedure will organize manifest entries based on column a and >>> then b. If most of my queries are using b as the predicate, rewriting >>> manifests by sorting first against column b and then a will yield a much >>> shorter scan planning time, because all manifest entries with similar b >>> values are close together, and manifest list can be used to prune many >>> files already without opening the manifest files. >>> >>> This happens a lot for cases like b is an event time timestamp column, >>> which is not the first partition column, but actually the column that is >>> read most frequently in every query. >>> >>> Translated to code, this means we can benefit from something like: >>> >>> SparkActions.rewriteManifests(table) >>> .sort("b", "a") >>> .commit() >>> >>> Any thoughts? >>> >>> Best, >>> Jack Ye >>> >>>