All, I am executing the following query using Hadoop 2.2.0 and Hive 0.13.0.
/opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database -n root --hiveconf hive.compute.query.using.stats=true -e "select min(seconds), max(seconds), range from data where range > 1400204700 group by range" 'range' above is our partition. I would expect that this would provide a reasonably fast response time by simply looking at the metadata for each file in a given partition (maybe one mapper per range). Instead we're seeing 140+ mappers, and the query takes a long time. Here is the explain plan: /opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database -n root --hiveconf hive.compute.query.using.stats=true -e "explain select min(seconds), max(seconds), range from data where range > 1400204700 group by range" scan complete in 4ms Connecting to jdbc:hive2://server:10002/database Connected to: Apache Hive (version 0.13.0) Driver: Hive JDBC (version 0.13.0) Transaction isolation: TRANSACTION_REPEATABLE_READ +---------------------------------------------------------------------------------------------------------------------+ | Explain | +---------------------------------------------------------------------------------------------------------------------+ | STAGE DEPENDENCIES: | | Stage-1 is a root stage | | Stage-0 is a root stage | | | | STAGE PLANS: | | Stage: Stage-1 | | Map Reduce | | Map Operator Tree: | | TableScan | | alias: data | | Statistics: Num rows: 4860251901 Data size: 38882015268 Basic stats: PARTIAL Column stats: NONE | | Select Operator | | expressions: range (type: int), seconds (type: bigint) | | outputColumnNames: range, seconds | | Statistics: Num rows: 4860251901 Data size: 38882015268 Basic stats: PARTIAL Column stats: NONE | | Group By Operator | | aggregations: min(seconds), max(end_time_seconds) | | keys: range (type: int) | | mode: hash | | outputColumnNames: _col0, _col1, _col2 | | Statistics: Num rows: 4860251901 Data size: 38882015268 Basic stats: PARTIAL Column stats: NONE | | Reduce Output Operator | | key expressions: _col0 (type: int) | | sort order: + | | Map-reduce partition columns: _col0 (type: int) | | Statistics: Num rows: 4860251901 Data size: 38882015268 Basic stats: COMPLETE Column stats: NONE | | value expressions: _col1 (type: bigint), _col2 (type: bigint) | | Reduce Operator Tree: | | Group By Operator | | aggregations: min(VALUE._col0), max(VALUE._col1) | | keys: KEY._col0 (type: int) | | mode: mergepartial | | outputColumnNames: _col0, _col1, _col2 | | Statistics: Num rows: 2430125950 Data size: 19441007630 Basic stats: COMPLETE Column stats: NONE | | Select Operator | | expressions: _col1 (type: bigint), _col2 (type: bigint), _col0 (type: int) | | outputColumnNames: _col0, _col1, _col2 | | Statistics: Num rows: 2430125950 Data size: 19441007630 Basic stats: COMPLETE Column stats: NONE | | File Output Operator | | compressed: false | | Statistics: Num rows: 2430125950 Data size: 19441007630 Basic stats: COMPLETE Column stats: NONE | | table: | | input format: org.apache.hadoop.mapred.TextInputFormat | | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | | | Stage: Stage-0 | | Fetch Operator | | limit: -1 | | | +---------------------------------------------------------------------------------------------------------------------+ 50 rows selected (0.417 seconds) Beeline version 0.13.0 by Apache Hive Closing: 0: jdbc:hive2://viper:10002/intrepid Can anyone enlighten me as to how this could be optimized? Regards, Bryan Jeffrey