[ https://issues.apache.org/jira/browse/HIVE-24262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mustafa Iman reassigned HIVE-24262: ----------------------------------- Assignee: Mustafa Iman > Optimise NullScanTaskDispatcher for cloud storage > ------------------------------------------------- > > Key: HIVE-24262 > URL: https://issues.apache.org/jira/browse/HIVE-24262 > Project: Hive > Issue Type: Improvement > Reporter: Rajesh Balamohan > Assignee: Mustafa Iman > Priority: Major > > {noformat} > select count(DISTINCT ss_sold_date_sk) from store_sales; > ---------------------------------------------------------------------------------------------- > VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING > FAILED KILLED > ---------------------------------------------------------------------------------------------- > Map 1 .......... container SUCCEEDED 1 1 0 0 > 0 0 > Reducer 2 ...... container SUCCEEDED 1 1 0 0 > 0 0 > ---------------------------------------------------------------------------------------------- > VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 5.55 s > ---------------------------------------------------------------------------------------------- > INFO : Status: DAG finished successfully in 5.44 seconds > INFO : > INFO : Query Execution Summary > INFO : > ---------------------------------------------------------------------------------------------- > INFO : OPERATION DURATION > INFO : > ---------------------------------------------------------------------------------------------- > INFO : Compile Query 102.02s > INFO : Prepare Plan 0.51s > INFO : Get Query Coordinator (AM) 0.01s > INFO : Submit Plan 0.33s > INFO : Start DAG 0.56s > INFO : Run DAG 5.44s > INFO : > ---------------------------------------------------------------------------------------------- > {noformat} > Reason for "102 seconds" compilation time is that, it ends up doing > "isEmptyPath" check for every partition path and takes lot of time in > compilation phase. > If the parent directory of all paths belong to the same path, we could just > do a recursive listing just once (instead of listing each directory one at a > time sequentially) in cloud storage systems. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java#L158 > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java#L121 > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java#L101 > With a temp hacky fix, it comes down to 2 seconds from 100+ seconds. > {noformat} > INFO : Dag name: select count(DISTINCT ss_sold_...store_sales (Stage-1) > INFO : Status: Running (Executing on YARN cluster with App id > application_1602500203747_0003) > ---------------------------------------------------------------------------------------------- > VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING > FAILED KILLED > ---------------------------------------------------------------------------------------------- > Map 1 .......... container SUCCEEDED 1 1 0 0 > 0 0 > Reducer 2 ...... container SUCCEEDED 1 1 0 0 > 0 0 > ---------------------------------------------------------------------------------------------- > VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 1.23 s > ---------------------------------------------------------------------------------------------- > INFO : Status: DAG finished successfully in 1.20 seconds > INFO : > INFO : Query Execution Summary > INFO : > ---------------------------------------------------------------------------------------------- > INFO : OPERATION DURATION > INFO : > ---------------------------------------------------------------------------------------------- > INFO : Compile Query 0.85s > INFO : Prepare Plan 0.17s > INFO : Get Query Coordinator (AM) 0.00s > INFO : Submit Plan 0.03s > INFO : Start DAG 0.03s > INFO : Run DAG 1.20s > INFO : > ---------------------------------------------------------------------------------------------- > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)