[ https://issues.apache.org/jira/browse/HIVE-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xin Hao updated HIVE-13634: --------------------------- Description: Hive-on-Spark performed worse than Hive-on-MR, for queries with external scripts. For TPCx-BB Q2/Q3/Q4, they are Python Streaming related cases and will call external scripts to handle reduce tasks. We found that for these 3 queries Hive-on-Spark shows lower performance than Hive-on-MR when processing reduce tasks with external (Python) scripts. So ‘Improve HoS performance for queries with external scripts’ seems a performance optimization opportunity. The following shows the Q2/Q3/Q4 test result on 8-worker-node cluster with TPCx-BB 3TB data size. TPCx-BB Query 2 (1)Hive-on-MR Total Query Execution Time (sec): 2172.180 Execution Time of External Scripts (sec): 736 (2)Hive-on-Spark Total Query Execution Time (sec): 2283.604 Execution Time of External Scripts (sec): 1197 TPCx-BB Query 3 (1)Hive-on-MR Total Query Execution Time (sec): 1070.632 Execution Time of External Scripts (sec): 513 (2)Hive-on-Spark Total Query Execution Time (sec): 1287.679 Execution Time of External Scripts (sec): 919 TPCx-BB Query 4 (1)Hive-on-MR Total Query Execution Time (sec): 1781.864 Execution Time of External Scripts (sec): 1518 (2)Hive-on-Spark Total Query Execution Time (sec): 2028.023 Execution Time of External Scripts (sec): 1599 was: Hive-on-Spark performed worse than Hive-on-MR, for queries with external scripts. For TPCx-BB Q2/Q3/Q4, they are Python Streaming related cases and will call external scripts to handle reduce tasks. We found that for these 3 queries Hive-on-Spark shows lower performance than Hive-on-MR when processing reduce tasks with external (Python) scripts. So ‘Improve HoS performance for queries with external scripts’ seems a performance optimization opportunity. > Hive-on-Spark performed worse than Hive-on-MR, for queries with external > scripts > -------------------------------------------------------------------------------- > > Key: HIVE-13634 > URL: https://issues.apache.org/jira/browse/HIVE-13634 > Project: Hive > Issue Type: Bug > Reporter: Xin Hao > > Hive-on-Spark performed worse than Hive-on-MR, for queries with external > scripts. > For TPCx-BB Q2/Q3/Q4, they are Python Streaming related cases and will call > external scripts to handle reduce tasks. We found that for these 3 queries > Hive-on-Spark shows lower performance than Hive-on-MR when processing reduce > tasks with external (Python) scripts. So ‘Improve HoS performance for queries > with external scripts’ seems a performance optimization opportunity. > The following shows the Q2/Q3/Q4 test result on 8-worker-node cluster with > TPCx-BB 3TB data size. > TPCx-BB Query 2 > (1)Hive-on-MR > Total Query Execution Time (sec): 2172.180 > Execution Time of External Scripts (sec): 736 > (2)Hive-on-Spark > Total Query Execution Time (sec): 2283.604 > Execution Time of External Scripts (sec): 1197 > TPCx-BB Query 3 > (1)Hive-on-MR > Total Query Execution Time (sec): 1070.632 > Execution Time of External Scripts (sec): 513 > (2)Hive-on-Spark > Total Query Execution Time (sec): 1287.679 > Execution Time of External Scripts (sec): 919 > TPCx-BB Query 4 > (1)Hive-on-MR > Total Query Execution Time (sec): 1781.864 > Execution Time of External Scripts (sec): 1518 > (2)Hive-on-Spark > Total Query Execution Time (sec): 2028.023 > Execution Time of External Scripts (sec): 1599 -- This message was sent by Atlassian JIRA (v6.3.4#6332)