[ https://issues.apache.org/jira/browse/HIVE-26110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516906#comment-17516906 ]
Ádám Szita commented on HIVE-26110: ----------------------------------- Thanks for finding this [~rbalamohan]. I was able to reproduce this locally, and found what most probably was missing from my patch in HIVE-25975 dealing with this feature: The explain plan lacked this line for the Reduce Sink operator: {code:java} Map-reduce partition columns: _col23 (type: bigint) {code} ..so although the SortedDynPartitionOptimizer has taken care of setting the Iceberg table's partition column as a KEY column, it didn't mark it as _*partition*_ {*}column in Map/Reduce terms{*}. (I have probably missed this because of the confusing terminology, as I see it, this is not a Hive table partition, but a MR job relevant term for adjusting the shuffle phase... ) This made all reducers write a certain amount of rows irrespective of key distribution (nevertheless the rows within the reducer tasks were sorted on the key column - otherwise an exception would have been thrown from ClusteredWriter class as you mentioned it earlier) and we ended up with {code:java} 1 <= n <= reducerCount {code} files for each partition. Anyhow what we probably need is to add a partCols.add() invocation beside [https://github.com/apache/hive/pull/3060/files#diff-b28bcf13b1a3e2d73139df50ed102fae4625032ea73337aa1c295bb3069e1499R651] , it actually solved the issue on my local repro. > bulk insert into partitioned table creates lots of files in iceberg > ------------------------------------------------------------------- > > Key: HIVE-26110 > URL: https://issues.apache.org/jira/browse/HIVE-26110 > Project: Hive > Issue Type: Bug > Reporter: Rajesh Balamohan > Priority: Major > > For e.g, create web_returns table in tpcds in iceberg format and try to copy > over data from regular table. More like "insert into web_returns_iceberg as > select * from web_returns". > This inserts the data correctly, however there are lot of files present in > each partition. IMO, dynamic sort optimisation isn't working fine and this > causes records not to be grouped in the final phase. -- This message was sent by Atlassian Jira (v8.20.1#820001)