Hi Ed,

In the latest version of spark(>3.5), for both hash and range
distribution mode we can control the size of partition by spark property
"spark.sql.adaptive.advisoryPartitionSizeInBytes". This will control the
small files problem.

Regards,
Namratha

On Mon, Apr 7, 2025 at 8:44 AM Ed Mancebo <edmanc...@gmail.com> wrote:

> Hi all,
>
> First time posting here
>
> I’m using MERGE INTO to upsert into a table with daily partitions.  More
> recent days tend to have many more updates, which is causing skew in the
> write stage when write.distribution-mode=hash (the most recent day of data
> will get assigned to a single task, which takes much longer to finish than
> older days).
>
> I tried write.distribution-mode=range instead, but this only helps a
> little bit.  I think this does a good job of splitting up the most recent
> days across multiple tasks, but probably clusters the very oldest/smallest
> days on a single task, which is slow due to opening and closing too many
> small files.
>
> I’m wondering if there’s a mode that works well for this use case that I
> may have missed, or if not, is there any appetite for supporting one?  One
> idea is to add an option for a user-specified column in the clustering in
> SparkDistributionAndOrderingUtil.  This would allow the caller to provide
> an additional column to split up large partitions while writing, without
> changing the table partitioning.
>
> Thanks in advance -
>
> Ed
>
>

Reply via email to