Hi everyone,

Today, the rewrite manifest procedure always orders the data files based on
their *data_file.partition* value. Specifically, it sorts data files that
have the same partition value, and then does a repartition by range based
on the target number of manifest files (ref
<https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java#L257-L258>
),

I notice that this approach does not always yield the best performance for
scan planning because the resulting manifest entries order is basically
based on the default order of the partition columns.

For example, consider a table partitioned by columns a and b. By default
the rewrite procedure will organize manifest entries based on column a and
then b. If most of my queries are using b as the predicate, rewriting
manifests by sorting first against column b and then a will yield a much
shorter scan planning time, because all manifest entries with similar b
values are close together, and manifest list can be used to prune many
files already without opening the manifest files.

This happens a lot for cases like b is an event time timestamp column,
which is not the first partition column, but actually the column that is
read most frequently in every query.

Translated to code, this means we can benefit from something like:

SparkActions.rewriteManifests(table)
  .sort("b", "a")
  .commit()

Any thoughts?

Best,
Jack Ye

Reply via email to