[ 
https://issues.apache.org/jira/browse/HIVE-28196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836034#comment-17836034
 ] 

Sungwoo Park commented on HIVE-28196:
-------------------------------------

To give the context in which this JIRA is investigated, we are trying to 
stabilize the performance of Hive 4.0.0 with 10TB TPC-DS benchmark.

When compared with Hive 3.1.3, query 24-b is hit the hardest: about 300 seconds 
in Hive 3.1.3 to about 750 seconds in Hive 4.0.0. We have observed that this 
noticeable slowdown is due to using the default value of avgColLen after 
applying UDF upper/lower.

After applying this patch, the running time decreases to about 300 seconds 
because the query plan correctly uses MapJoin instead of MergeJoin. This patch 
targets Hive 4.1.0, but ideally it should be merged to 4.0.1.


> Preserve column stats when applying UDF upper/lower.
> ----------------------------------------------------
>
>                 Key: HIVE-28196
>                 URL: https://issues.apache.org/jira/browse/HIVE-28196
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 4.0.0
>            Reporter: Seonggon Namgung
>            Assignee: Seonggon Namgung
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.1.0
>
>
> Current Hive re-estimates column stats (including avgColLen) when it 
> encounters UDF.
> In the case of upper and lower, Hive sets avgColLen to 
> hive.stats.max.variable.length.
> But these UDFs do not change column stats and the default value(100) is too 
> high for string type key columns, on which upper/lower are usually applied.
> This patch keeps input data's avgColLen after applying UDF upper/lower to 
> make a better query plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to