Re: Query Using Stats

Edward Capriolo Fri, 16 May 2014 21:00:07 -0700

Hive does not know that the values of column `seconds` and partition
`range` or related.


Hive can only use the WHERE clause to remove partitions that do not match
the range criteria. All the data inside the partition is not ordered in any
way so the minimum seconds and maximum seconds could be in any part of any
file.


On Fri, May 16, 2014 at 8:35 AM, Bryan Jeffrey <bryan.jeff...@gmail.com>wrote:

> All,
>
> I am executing the following query using Hadoop 2.2.0 and Hive 0.13.0.
>
> /opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database
> -n root --hiveconf hive.compute.query.using.stats=true -e "select
> min(seconds), max(seconds), range from data where range > 1400204700 group
> by range"
>
> 'range' above is our partition.  I would expect that this would provide a
> reasonably fast response time by simply looking at the metadata for each
> file in a given partition (maybe one mapper per range).  Instead we're
> seeing 140+ mappers, and the query takes a long time.
>
> Here is the explain plan:
>
> /opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database
> -n root --hiveconf hive.compute.query.using.stats=true -e "explain select
> min(seconds), max(seconds), range from data where range > 1400204700
> group by range"
> scan complete in 4ms
> Connecting to jdbc:hive2://server:10002/database
> Connected to: Apache Hive (version 0.13.0)
> Driver: Hive JDBC (version 0.13.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
>
> +---------------------------------------------------------------------------------------------------------------------+
> |                                                       Explain
>                                             |
>
> +---------------------------------------------------------------------------------------------------------------------+
> | STAGE DEPENDENCIES:
>                                             |
> |   Stage-1 is a root stage
>                                             |
> |   Stage-0 is a root stage
>                                             |
> |
>                                             |
> | STAGE PLANS:
>                                            |
> |   Stage: Stage-1
>                                            |
> |     Map Reduce
>                                            |
> |       Map Operator Tree:
>                                            |
> |           TableScan
>                                             |
> |             alias: data
>                                            |
> |             Statistics: Num rows: 4860251901 Data size: 38882015268
> Basic stats: PARTIAL Column stats: NONE         |
> |             Select Operator
>                                             |
> |               expressions: range (type: int), seconds (type: bigint)
>                                   |
> |               outputColumnNames: range, seconds
>                                    |
> |               Statistics: Num rows: 4860251901 Data size: 38882015268
> Basic stats: PARTIAL Column stats: NONE       |
> |               Group By Operator
>                                             |
> |                 aggregations: min(seconds), max(end_time_seconds)
>                                    |
> |                 keys: range (type: int)
>                                             |
> |                 mode: hash
>                                            |
> |                 outputColumnNames: _col0, _col1, _col2
>                                            |
> |                 Statistics: Num rows: 4860251901 Data size: 38882015268
> Basic stats: PARTIAL Column stats: NONE     |
> |                 Reduce Output Operator
>                                            |
> |                   key expressions: _col0 (type: int)
>                                            |
> |                   sort order: +
>                                             |
> |                   Map-reduce partition columns: _col0 (type: int)
>                                             |
> |                   Statistics: Num rows: 4860251901 Data size:
> 38882015268 Basic stats: COMPLETE Column stats: NONE  |
> |                   value expressions: _col1 (type: bigint), _col2 (type:
> bigint)                                     |
> |       Reduce Operator Tree:
>                                             |
> |         Group By Operator
>                                             |
> |           aggregations: min(VALUE._col0), max(VALUE._col1)
>                                            |
> |           keys: KEY._col0 (type: int)
>                                             |
> |           mode: mergepartial
>                                            |
> |           outputColumnNames: _col0, _col1, _col2
>                                            |
> |           Statistics: Num rows: 2430125950 Data size: 19441007630 Basic
> stats: COMPLETE Column stats: NONE          |
> |           Select Operator
>                                             |
> |             expressions: _col1 (type: bigint), _col2 (type: bigint),
> _col0 (type: int)                              |
> |             outputColumnNames: _col0, _col1, _col2
>                                            |
> |             Statistics: Num rows: 2430125950 Data size: 19441007630
> Basic stats: COMPLETE Column stats: NONE        |
> |             File Output Operator
>                                            |
> |               compressed: false
>                                             |
> |               Statistics: Num rows: 2430125950 Data size: 19441007630
> Basic stats: COMPLETE Column stats: NONE      |
> |               table:
>                                            |
> |                   input format: org.apache.hadoop.mapred.TextInputFormat
>                                            |
> |                   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>         |
> |                   serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                 |
> |
>                                             |
> |   Stage: Stage-0
>                                            |
> |     Fetch Operator
>                                            |
> |       limit: -1
>                                             |
> |
>                                             |
>
> +---------------------------------------------------------------------------------------------------------------------+
> 50 rows selected (0.417 seconds)
> Beeline version 0.13.0 by Apache Hive
> Closing: 0: jdbc:hive2://viper:10002/intrepid
>
> Can anyone enlighten me as to how this could be optimized?
>
> Regards,
>
> Bryan Jeffrey
>
>
>

Re: Query Using Stats

Reply via email to