To be more specific, I think it's sorting by the value after transformation?

On Wed, Jan 31, 2024 at 11:36 AM Amogh Jahagirdar <am...@tabular.io> wrote:

> Yeah I think being able to specify the order of the columns to sort by
> when rewriting the manifests makes a lot of sense.
>
> On Tue, Jan 30, 2024 at 5:47 PM Renjie Liu <liurenjie2...@gmail.com>
> wrote:
>
>> Sounds reasonable to me.
>>
>> On Wed, Jan 31, 2024 at 7:56 AM <russell.spit...@gmail.com> wrote:
>>
>>> Sounds like a reasonable thing to add? Maybe we could check cardinality
>>> to pick out the default order as well?
>>> Sent from my iPhone
>>>
>>> On Jan 30, 2024, at 3:50 PM, Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>> 
>>> Hi everyone,
>>>
>>> Today, the rewrite manifest procedure always orders the data files based
>>> on their *data_file.partition* value. Specifically, it sorts data files
>>> that have the same partition value, and then does a repartition by range
>>> based on the target number of manifest files (ref
>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java#L257-L258>
>>> ),
>>>
>>> I notice that this approach does not always yield the best performance
>>> for scan planning because the resulting manifest entries order is basically
>>> based on the default order of the partition columns.
>>>
>>> For example, consider a table partitioned by columns a and b. By default
>>> the rewrite procedure will organize manifest entries based on column a and
>>> then b. If most of my queries are using b as the predicate, rewriting
>>> manifests by sorting first against column b and then a will yield a much
>>> shorter scan planning time, because all manifest entries with similar b
>>> values are close together, and manifest list can be used to prune many
>>> files already without opening the manifest files.
>>>
>>> This happens a lot for cases like b is an event time timestamp column,
>>> which is not the first partition column, but actually the column that is
>>> read most frequently in every query.
>>>
>>> Translated to code, this means we can benefit from something like:
>>>
>>> SparkActions.rewriteManifests(table)
>>>   .sort("b", "a")
>>>   .commit()
>>>
>>> Any thoughts?
>>>
>>> Best,
>>> Jack Ye
>>>
>>>

Reply via email to