[ https://issues.apache.org/jira/browse/HIVE-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157994#comment-16157994 ]
Ke Jia commented on HIVE-17139: ------------------------------- Upload the latest patch to fix the failed tests and the remain failed tests seem not patch related. I test the patch with table product_reviews of TPCx-BB using the following sql statement: {code:java} select case when pr_review_rating=4 then upper(pr_review_content) when pr_review_rating=3 then upper(pr_review_content) end from product_reviews; {code} The cluster includes 8 nodes, 230G/per node. CPU is Intel(R) Xeon(R) CPU E5-2699. With 3TB data scale and spark as executor engine, the following is the result: || ||without patch||with patch||improvement(s)||improvement(%)|| |Hos|28.25s|16.14s|12.11s|42.8%| |VectorSelectOperator |2.99s|12.58s|9.59s|76.2%| The result shows the execution time of spark from 28.25s to 16.14s and the time cost of VectorSelectOperator from 12.58s to 2.99s. Here, the total records, "pr_review_rating=4" records and "pr_review_rating=3" records are as following: || ||count|| |total records|9934636| |pr_review_rating=4 records|1897804| |pr_review_rating=3 records|792278| With this patch, only (1897804+792278) records do the upper operation of the above sql statement and without this patch, there are (9934636+9934636) records doing the upper operation. > Conditional expressions optimization: skip the expression evaluation if the > condition is not satisfied for vectorization engine. > -------------------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-17139 > URL: https://issues.apache.org/jira/browse/HIVE-17139 > Project: Hive > Issue Type: Improvement > Reporter: Ke Jia > Assignee: Ke Jia > Attachments: HIVE-17139.1.patch, HIVE-17139.2.patch, > HIVE-17139.3.patch, HIVE-17139.4.patch, HIVE-17139.5.patch, > HIVE-17139.6.patch, HIVE-17139.7.patch, HIVE-17139.8.patch > > > The case when and if statement execution for Hive vectorization is not > optimal, which all the conditional and else expressions are evaluated for > current implementation. The optimized approach is to update the selected > array of batch parameter after the conditional expression is executed. Then > the else expression will only do the selected rows instead of all. -- This message was sent by Atlassian JIRA (v6.4.14#64029)