Hi all, I am researching on the effect of bitmap indexes on queries and I happened to come across bitmap indexes on Hive. But I have a few areas that I need help in understanding how the bitmaps are implemented and used in hive.
1) The bitmap column of the index table is always the same value ([1,2,4,8589934592,1,0]). How can the bitmap of two different values be the same? 2) The bitmap index used query and the one that didn't use bitmap indexes gave the following results for their explain statement. The blue colored sections are the same steps. According to this explain statements, the bitmap index usage part seems redundant. Can you please help me to clear out these issues? Regards, Samadhi Poornima. hive> EXPLAIN SELECT count(*) FROM truck_mileage_stage WHERE truckid = "A30" AND driverid="A30"; OK STAGE DEPENDENCIES: Stage-7 is a root stage Stage-4 depends on stages: Stage-7 Stage-2 depends on stages: Stage-4 Stage-1 depends on stages: Stage-2 Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-7 Map Reduce Local Work Alias -> Map Local Tables: $hdt$_0:default__truck_mileage_stage_bittruck_index__ Fetch Operator limit: -1 Alias -> Map Local Operator Tree: $hdt$_0:default__truck_mileage_stage_bittruck_index__ TableScan alias: default__truck_mileage_stage_bittruck_index__ filterExpr: ((truckid = 'A30') and _bucketname is not null and _offset is not null) (type: boolean) Filter Operator predicate: ((truckid = 'A30') and _bucketname is not null and _offset is not null) (type: boolean) Select Operator expressions: _bucketname (type: string), _offset (type: bigint), _bitmaps (type: array<bigint>) outputColumnNames: _col0, _col1, _col2 HashTable Sink Operator keys: 0 _col0 (type: string), _col1 (type: bigint) 1 _col0 (type: string), _col1 (type: bigint) Stage: Stage-4 Map Reduce Map Operator Tree: TableScan alias: default__truck_mileage_stage_bitdriver_index__ filterExpr: ((driverid = 'A30') and _bucketname is not null and _offset is not null) (type: boolean) Filter Operator predicate: ((driverid = 'A30') and _bucketname is not null and _offset is not null) (type: boolean) Select Operator expressions: _bucketname (type: string), _offset (type: bigint), _bitmaps (type: array<bigint>) outputColumnNames: _col0, _col1, _col2 Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col0 (type: string), _col1 (type: bigint) 1 _col0 (type: string), _col1 (type: bigint) outputColumnNames: _col0, _col2, _col4, _col5 Filter Operator predicate: (not EWAH_BITMAP_EMPTY(EWAH_BITMAP_AND(_col2,_col5))) (type: boolean) Select Operator expressions: _col0 (type: string), _col4 (type: bigint) outputColumnNames: _col0, _col4 Group By Operator aggregations: collect_set(_col4) keys: _col0 (type: string) mode: hash outputColumnNames: _col0, _col1 Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) value expressions: _col1 (type: array<bigint>) Local Work: Map Reduce Local Work Reduce Operator Tree: Group By Operator aggregations: collect_set(VALUE._col0) keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0, _col1 File Output Operator compressed: false table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-2 Move Operator files: hdfs directory: true destination: file:/tmp/samadhik/392bef0d-0e26-448f-bdce-7621965b1ad0/hive_2016-12-30_09-56-12_455_2464238236026798959-1/-mr-10004 Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: truck_mileage_stage filterExpr: ((truckid = 'A30') and (driverid = 'A30')) (type: boolean) Statistics: Num rows: 1148 Data size: 229744 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((truckid = 'A30') and (driverid = 'A30')) (type: boolean) Statistics: Num rows: 287 Data size: 57436 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 287 Data size: 57436 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count() mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint) Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink hive> explain SELECT count(*) FROM truck_mileage_stage WHERE truckid = "A30" AND driverid="A30"; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: truck_mileage_stage filterExpr: ((truckid = 'A30') and (driverid = 'A30')) (type: boolean) Statistics: Num rows: 1148 Data size: 229744 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((truckid = 'A30') and (driverid = 'A30')) (type: boolean) Statistics: Num rows: 287 Data size: 57436 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 287 Data size: 57436 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count() mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint) Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink