[jira] [Work logged] (HIVE-26110) bulk insert into partitioned table creates lots of files in iceberg

ASF GitHub Bot (Jira) Mon, 04 Apr 2022 20:19:05 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-26110?focusedWorklogId=752617&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-752617
 ]


ASF GitHub Bot logged work on HIVE-26110:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 05/Apr/22 03:18
            Start Date: 05/Apr/22 03:18
    Worklog Time Spent: 10m 
      Work Description: rbalamohan commented on code in PR #3174:
URL: https://github.com/apache/hive/pull/3174#discussion_r842299555


##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java:
##########
@@ -648,7 +648,12 @@ public ReduceSinkOperator getReduceSinkOp(List<Integer> 
partitionPositions, List
       ArrayList<ExprNodeDesc> partCols = Lists.newArrayList();
 
       for (Function<List<ExprNodeDesc>, ExprNodeDesc> customSortExpr : 
customSortExprs) {
-        keyCols.add(customSortExpr.apply(allCols));
+        ExprNodeDesc colExpr = customSortExpr.apply(allCols);
+        // Custom sort expressions are marked as KEYs, which is required for 
sorting the rows that are going for
+        // a particular reducer instance. They also need to be marked as 
'partition' columns for MapReduce shuffle
+        // phase, in order to gather the same keys to the same reducer 
instances.
+        keyCols.add(colExpr);
+        partCols.add(colExpr);

Review Comment:
   In the case of iceberg, "getDPColNames", "getNumDPCols", etc would not be 
available in the context. There are some historical assumptions that partition 
names will be present in the end of the schema. When iceberg tables used, these 
assumptions are not valid. 
   
   It will be good to add "colExpr" to partCols when "partitionPositions && 
dpCtx.getDPColNames()" are empty?





Issue Time Tracking
-------------------

    Worklog Id:     (was: 752617)
    Time Spent: 20m  (was: 10m)

> bulk insert into partitioned table creates lots of files in iceberg
> -------------------------------------------------------------------
>
>                 Key: HIVE-26110
>                 URL: https://issues.apache.org/jira/browse/HIVE-26110
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> For e.g, create web_returns table in tpcds in iceberg format and try to copy 
> over data from regular table. More like "insert into web_returns_iceberg as 
> select * from web_returns".
> This inserts the data correctly, however there are lot of files present in 
> each partition. IMO, dynamic sort optimisation isn't working fine and this 
> causes records not to be grouped in the final phase.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-26110) bulk insert into partitioned table creates lots of files in iceberg

Reply via email to