Re: Query Using Stats

Prasanth Jayachandran Fri, 16 May 2014 17:48:07 -0700

Bryan,

The flag you are looking for is hive.compute.query.using.stats. By default this 
optimization is disabled. You might need to enable it to use it. Also the 
min/max/sum metadata are not looked up from the file but instead from 
metastore. Although file formats like ORC contains stats, they are not used to 
answer metadata only queries. Hive considers metastore as the only source of 
truth for answering such queries. You can look at this jira for further details 
https://issues.apache.org/jira/browse/HIVE-5483


Thanks
Prasanth Jayachandran

On May 16, 2014, at 5:35 AM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote:

> All,
> 
> I am executing the following query using Hadoop 2.2.0 and Hive 0.13.0.
> 
> /opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database -n 
> root --hiveconf hive.compute.query.using.stats=true -e "select min(seconds), 
> max(seconds), range from data where range > 1400204700 group by range"
> 
> 'range' above is our partition.  I would expect that this would provide a 
> reasonably fast response time by simply looking at the metadata for each file 
> in a given partition (maybe one mapper per range).  Instead we're seeing 140+ 
> mappers, and the query takes a long time.
> 
> Here is the explain plan:
> 
> /opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database -n 
> root --hiveconf hive.compute.query.using.stats=true -e "explain select 
> min(seconds), max(seconds), range from data where range > 1400204700 group by 
> range"
> scan complete in 4ms
> Connecting to jdbc:hive2://server:10002/database
> Connected to: Apache Hive (version 0.13.0)
> Driver: Hive JDBC (version 0.13.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> +---------------------------------------------------------------------------------------------------------------------+
> |                                                       Explain               
>                                         |
> +---------------------------------------------------------------------------------------------------------------------+
> | STAGE DEPENDENCIES:                                                         
>                                         |
> |   Stage-1 is a root stage                                                   
>                                         |
> |   Stage-0 is a root stage                                                   
>                                         |
> |                                                                             
>                                         |
> | STAGE PLANS:                                                                
>                                         |
> |   Stage: Stage-1                                                            
>                                         |
> |     Map Reduce                                                              
>                                         |
> |       Map Operator Tree:                                                    
>                                         |
> |           TableScan                                                         
>                                         |
> |             alias: data                                                     
>                                        |
> |             Statistics: Num rows: 4860251901 Data size: 38882015268 Basic 
> stats: PARTIAL Column stats: NONE         |
> |             Select Operator                                                 
>                                         |
> |               expressions: range (type: int), seconds (type: bigint)        
>                                |
> |               outputColumnNames: range, seconds                             
>                                |
> |               Statistics: Num rows: 4860251901 Data size: 38882015268 Basic 
> stats: PARTIAL Column stats: NONE       |
> |               Group By Operator                                             
>                                         |
> |                 aggregations: min(seconds), max(end_time_seconds)           
>                                |
> |                 keys: range (type: int)                                     
>                                         |
> |                 mode: hash                                                  
>                                         |
> |                 outputColumnNames: _col0, _col1, _col2                      
>                                         |
> |                 Statistics: Num rows: 4860251901 Data size: 38882015268 
> Basic stats: PARTIAL Column stats: NONE     |
> |                 Reduce Output Operator                                      
>                                         |
> |                   key expressions: _col0 (type: int)                        
>                                         |
> |                   sort order: +                                             
>                                         |
> |                   Map-reduce partition columns: _col0 (type: int)           
>                                         |
> |                   Statistics: Num rows: 4860251901 Data size: 38882015268 
> Basic stats: COMPLETE Column stats: NONE  |
> |                   value expressions: _col1 (type: bigint), _col2 (type: 
> bigint)                                     |
> |       Reduce Operator Tree:                                                 
>                                         |
> |         Group By Operator                                                   
>                                         |
> |           aggregations: min(VALUE._col0), max(VALUE._col1)                  
>                                         |
> |           keys: KEY._col0 (type: int)                                       
>                                         |
> |           mode: mergepartial                                                
>                                         |
> |           outputColumnNames: _col0, _col1, _col2                            
>                                         |
> |           Statistics: Num rows: 2430125950 Data size: 19441007630 Basic 
> stats: COMPLETE Column stats: NONE          |
> |           Select Operator                                                   
>                                         |
> |             expressions: _col1 (type: bigint), _col2 (type: bigint), _col0 
> (type: int)                              |
> |             outputColumnNames: _col0, _col1, _col2                          
>                                         |
> |             Statistics: Num rows: 2430125950 Data size: 19441007630 Basic 
> stats: COMPLETE Column stats: NONE        |
> |             File Output Operator                                            
>                                         |
> |               compressed: false                                             
>                                         |
> |               Statistics: Num rows: 2430125950 Data size: 19441007630 Basic 
> stats: COMPLETE Column stats: NONE      |
> |               table:                                                        
>                                         |
> |                   input format: org.apache.hadoop.mapred.TextInputFormat    
>                                         |
> |                   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                    
>      |
> |                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 
>                                         |
> |                                                                             
>                                         |
> |   Stage: Stage-0                                                            
>                                         |
> |     Fetch Operator                                                          
>                                         |
> |       limit: -1                                                             
>                                         |
> |                                                                             
>                                         |
> +---------------------------------------------------------------------------------------------------------------------+
> 50 rows selected (0.417 seconds)
> Beeline version 0.13.0 by Apache Hive
> Closing: 0: jdbc:hive2://viper:10002/intrepid
> 
> Can anyone enlighten me as to how this could be optimized?
> 
> Regards,
> 
> Bryan Jeffrey
> 
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Query Using Stats

Reply via email to