[jira] [Work logged] (HIVE-26110) bulk insert into partitioned table creates lots of files in iceberg

ASF GitHub Bot (Jira) Tue, 05 Apr 2022 02:01:04 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-26110?focusedWorklogId=752721&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-752721
 ]


ASF GitHub Bot logged work on HIVE-26110:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 05/Apr/22 09:00
            Start Date: 05/Apr/22 09:00
    Worklog Time Spent: 10m 
      Work Description: szlta commented on code in PR #3174:
URL: https://github.com/apache/hive/pull/3174#discussion_r842544398


##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java:
##########
@@ -648,7 +648,12 @@ public ReduceSinkOperator getReduceSinkOp(List<Integer> 
partitionPositions, List
       ArrayList<ExprNodeDesc> partCols = Lists.newArrayList();
 
       for (Function<List<ExprNodeDesc>, ExprNodeDesc> customSortExpr : 
customSortExprs) {
-        keyCols.add(customSortExpr.apply(allCols));
+        ExprNodeDesc colExpr = customSortExpr.apply(allCols);
+        // Custom sort expressions are marked as KEYs, which is required for 
sorting the rows that are going for
+        // a particular reducer instance. They also need to be marked as 
'partition' columns for MapReduce shuffle
+        // phase, in order to gather the same keys to the same reducer 
instances.
+        keyCols.add(colExpr);
+        partCols.add(colExpr);

Review Comment:
   Thx!





Issue Time Tracking
-------------------

    Worklog Id:     (was: 752721)
    Time Spent: 50m  (was: 40m)

> bulk insert into partitioned table creates lots of files in iceberg
> -------------------------------------------------------------------
>
>                 Key: HIVE-26110
>                 URL: https://issues.apache.org/jira/browse/HIVE-26110
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> For e.g, create web_returns table in tpcds in iceberg format and try to copy 
> over data from regular table. More like "insert into web_returns_iceberg as 
> select * from web_returns".
> This inserts the data correctly, however there are lot of files present in 
> each partition. IMO, dynamic sort optimisation isn't working fine and this 
> causes records not to be grouped in the final phase.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-26110) bulk insert into partitioned table creates lots of files in iceberg

Reply via email to