[ 
https://issues.apache.org/jira/browse/HIVE-8151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154115#comment-14154115
 ] 

Zhichun Wu commented on HIVE-8151:
----------------------------------

@ [~prasanth_j] , I find that the explain of the insert sql in the testcase 
diff a little when enable/disable this optimization. After digging into the 
code it seems that before applying NonBlockingOpDeDupProc optimization, there 
are three select operators in a row before FileSink operator. 
NonBlockingOpDeDupProc would try to deduplicate these select operators. Casting 
_col1 into int before writing to file is lost durning the deduplication 
process. More precisely, cSELExprNodeDesc  backtracks fails due to missing of 
columnExprMap :
{code}
ExprNodeDesc newPSELExprNodeDesc =
                ExprNodeDescUtils.backtrack(cSELExprNodeDesc, cSEL, pSEL);
{code}
Here I try to include the columnExprMap in 
SemanticAnalyzer#genConversionSelectOperator and the testcase passes.
Please correct me if I'm wrong.

> Dynamic partition sort optimization inserts record wrongly to partition when 
> used with GroupBy
> ----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8151
>                 URL: https://issues.apache.org/jira/browse/HIVE-8151
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.14.0, 0.13.1
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>            Priority: Blocker
>             Fix For: 0.14.0
>
>         Attachments: HIVE-8151.1.patch, HIVE-8151.2.patch, HIVE-8151.3.patch, 
> HIVE-8151.4.patch, HIVE-8151.5.patch, HIVE-8151.6.patch, HIVE-8151.7.patch, 
> HIVE-8151.8.patch
>
>
> HIVE-6455 added dynamic partition sort optimization. It added startGroup() 
> method to FileSink operator to look for changes in reduce key for creating 
> partition directories. This method however is not reliable as the key called 
> with startGroup() is different from the key called with processOp(). 
> startGroup() is called with newly changed key whereas processOp() is called 
> with previously aggregated key. This will result in processOp() writing the 
> last row of previous group as the first row of next group. This happens only 
> when used with group by operator.
> The fix is to not rely on startGroup() and do the partition directory 
> creation in processOp() itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to