[ 
https://issues.apache.org/jira/browse/HIVE-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Chung reassigned HIVE-24948:
-----------------------------------


> Reducing overhead of OrcInputFormat getSplits with bucket pruning
> -----------------------------------------------------------------
>
>                 Key: HIVE-24948
>                 URL: https://issues.apache.org/jira/browse/HIVE-24948
>             Project: Hive
>          Issue Type: Bug
>          Components: ORC, Query Processor, Tez
>            Reporter: Eugene Chung
>            Assignee: Eugene Chung
>            Priority: Major
>             Fix For: 4.0.0
>
>
> The summarized flow of generating input splits at Tez AM is like below; (by 
> calling HiveSplitGenerator.initialize())
>  # Perform dynamic partition pruning
>  # Get the list of InputSplit by calling InputFormat.getSplits()
> https://github.com/apache/hive/blob/624f62aadc08577cafaa299cfcf17c71fa6cdb3a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L260-L260
>  # Perform bucket pruning with the list above if it's possible
> https://github.com/apache/hive/blob/624f62aadc08577cafaa299cfcf17c71fa6cdb3a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L299-L301
> But I observed that the action 2, getting the list of InputSplit, can make 
> big overhead when the inputs are ORC files in HDFS.
> For example, there is a ORC table T partitioned by 'log_date' and each 
> partitions is bucketed by a column 'q'. There are 240 buckets in each 
> partition and the size of each bucket(ORC file) is, let's say, 100MB.
> The SQL is like this. 
>  
> {noformat}
> set hive.tez.bucket.pruning=true;
> select q, count(*) from T
> where log_date between '2020-01-01' and '2020-06-30'
>     and q = 'foobar'
> group by q;{noformat}
> It means there are 240 * 183(days) = 43680 buckets or ORC files in the input 
> path, but thanks to bucket pruning, only 183 buckets will be processed.
>  
> In my company's environment, the whole processing time of the SQL was roughly 
> 5 minutes. However, I've checked that it took more than 3 minutes to make the 
> list of OrcSplit for 43680 ORC files! The logs with tez.am.log.level=DEBUG 
> showed like below;
>  
> {noformat}
> 2021-03-25 01:21:31,850 [DEBUG] [InputInitializer {Map 1} #0] 
> |orc.OrcInputFormat|: getSplits started
> ...
> 2021-03-25 01:24:51,435 [DEBUG] [InputInitializer {Map 1} #0] 
> |orc.OrcInputFormat|: getSplits finished
> 2021-03-25 01:24:51,444 [INFO] [InputInitializer {Map 1} #0] 
> |io.HiveInputFormat|: number of splits 43680
> 2021-03-25 01:24:51,444 [DEBUG] [InputInitializer {Map 1} #0] 
> |log.PerfLogger|: </PERFLOG method=getSplits start=1616602891776 
> end=1616602891776 duration=199668 
> from=org.apache.hadoop.hive.ql.io.HiveInputFormat>
> ...
> 2021-03-25 01:26:03,385 [INFO] [Dispatcher thread {Central}] 
> |app.DAGAppMaster|: DAG completed, dagId=dag_1615862187190_731117_1, 
> dagState=SUCCEEDED{noformat}
>  
> With bucket pruning, I think making the whole list of ORC files is not 
> necessary. Therefore, I suggest that the flow would be like this;
>  # Perform dynamic partition pruning
>  # Get the list of InputSplit by calling InputFormat.getSplits()
>  ## OrcInputFormat.getSplits() returns the bucket-pruned list if BitSet from 
> FixedBucketPruningOptimizer exists



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to