Re: Hive 2 -> Hive 3 difference in Distribute by behaviour

Patrick Duin Mon, 22 Sep 2025 06:19:46 -0700

Sharing as we found the solution by setting:
hive.optimize.sort.dynamic.partition.threshold='-1' (default was 1 in
EMR7.6.0)


We already did set: hive.optimize.sort.dynamic.partition='false' But the
threshold needs to be set as well to get our behaviour back. The job
distributes data nicely to the reducers this way.



Op wo 17 sep 2025 om 15:34 schreef Patrick Duin <patd...@gmail.com>:

> Hi,
>
> We've noticed a difference in behaviour in doing our insert overwrite
> queries when moving to Hive3.
> We are using dynamic partitions where our queries look something like this:
>
> insert overwrite tableX partition('col0', 'col1') select * from t1
> distribute by col0, col1, hash(columns)%100
>
> Where 100 is the number of files we want in the partition folder. So we
> would generate 100 1GB files per partition.
>
> In Hive2 the 'distribute by' clause caused the data to get nicely
> distributed to the number of configured reducers. In Hive3 running with the
> same amount of reducers the data gets distributed to just 1 reducer.
> Generating 1 very large 100 GB file.
>
> The plan for the queries also differ in the 'Map-reduce partition columns'
> in the last reducer stage.
> For Hive 2 it says:  Map-reduce partition columns:_col0, col1,
> hash(cols)%100
> For Hive 3 it says:  Map-reduce partition columns:_col0, col1
> Which corresponds with the behaviour we see.
>
> We've been trying every possible conf that we can find but we can't find a
> way to reproduce our hive2 behaviour. Any suggestions?
>
>
> Thanks,
>  Patrick
>

Re: Hive 2 -> Hive 3 difference in Distribute by behaviour

Reply via email to