Bitmap index implementation in Hive

poornima kumarasinghe Fri, 30 Dec 2016 07:04:25 -0800

Hi all,

I am researching on the effect of bitmap indexes on queries and I happened
to come across bitmap indexes on Hive. But I have a few areas that I need
help in understanding how the bitmaps are implemented and used in hive.


1) The bitmap column of the index table is always the same value
([1,2,4,8589934592,1,0]). How can the bitmap of two different values be the
same?

2) The bitmap index used query and the one that didn't use bitmap indexes
gave the following results for their explain statement. The blue colored
sections are the same steps. According to this explain statements, the
bitmap index usage part seems redundant. Can you please help me to clear
out these issues?

Regards,
Samadhi Poornima.

hive> EXPLAIN SELECT count(*) FROM truck_mileage_stage WHERE truckid =
"A30" AND driverid="A30";
OK
STAGE DEPENDENCIES:
  Stage-7 is a root stage
  Stage-4 depends on stages: Stage-7
  Stage-2 depends on stages: Stage-4
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-7
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_0:default__truck_mileage_stage_bittruck_index__
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $hdt$_0:default__truck_mileage_stage_bittruck_index__
          TableScan
            alias: default__truck_mileage_stage_bittruck_index__
            filterExpr: ((truckid = 'A30') and _bucketname is not null and
_offset is not null) (type: boolean)
            Filter Operator
              predicate: ((truckid = 'A30') and _bucketname is not null and
_offset is not null) (type: boolean)
              Select Operator
                expressions: _bucketname (type: string), _offset (type:
bigint), _bitmaps (type: array<bigint>)
                outputColumnNames: _col0, _col1, _col2
                HashTable Sink Operator
                  keys:
                    0 _col0 (type: string), _col1 (type: bigint)
                    1 _col0 (type: string), _col1 (type: bigint)

  Stage: Stage-4
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: default__truck_mileage_stage_bitdriver_index__
            filterExpr: ((driverid = 'A30') and _bucketname is not null and
_offset is not null) (type: boolean)
            Filter Operator
              predicate: ((driverid = 'A30') and _bucketname is not null
and _offset is not null) (type: boolean)
              Select Operator
                expressions: _bucketname (type: string), _offset (type:
bigint), _bitmaps (type: array<bigint>)
                outputColumnNames: _col0, _col1, _col2
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col0 (type: string), _col1 (type: bigint)
                    1 _col0 (type: string), _col1 (type: bigint)
                  outputColumnNames: _col0, _col2, _col4, _col5
                  Filter Operator
                    predicate: (not
EWAH_BITMAP_EMPTY(EWAH_BITMAP_AND(_col2,_col5))) (type: boolean)
                    Select Operator
                      expressions: _col0 (type: string), _col4 (type:
bigint)
                      outputColumnNames: _col0, _col4
                      Group By Operator
                        aggregations: collect_set(_col4)
                        keys: _col0 (type: string)
                        mode: hash
                        outputColumnNames: _col0, _col1
                        Reduce Output Operator
                          key expressions: _col0 (type: string)
                          sort order: +
                          Map-reduce partition columns: _col0 (type: string)
                          value expressions: _col1 (type: array<bigint>)
      Local Work:
        Map Reduce Local Work
      Reduce Operator Tree:
        Group By Operator
          aggregations: collect_set(VALUE._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-2
    Move Operator
      files:
          hdfs directory: true
          destination:
file:/tmp/samadhik/392bef0d-0e26-448f-bdce-7621965b1ad0/hive_2016-12-30_09-56-12_455_2464238236026798959-1/-mr-10004

  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: truck_mileage_stage
            filterExpr: ((truckid = 'A30') and (driverid = 'A30')) (type:
boolean)
            Statistics: Num rows: 1148 Data size: 229744 Basic stats:
COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((truckid = 'A30') and (driverid = 'A30')) (type:
boolean)
              Statistics: Num rows: 287 Data size: 57436 Basic stats:
COMPLETE Column stats: NONE
              Select Operator
                Statistics: Num rows: 287 Data size: 57436 Basic stats:
COMPLETE Column stats: NONE
                Group By Operator
                  aggregations: count()
                  mode: hash
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                  Reduce Output Operator
                    sort order:
                    Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                    value expressions: _col0 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          mode: mergepartial
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column
stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
            table:
                input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
                output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink


hive> explain SELECT count(*) FROM truck_mileage_stage WHERE truckid =
"A30" AND driverid="A30";
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: truck_mileage_stage
            filterExpr: ((truckid = 'A30') and (driverid = 'A30')) (type:
boolean)
            Statistics: Num rows: 1148 Data size: 229744 Basic stats:
COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((truckid = 'A30') and (driverid = 'A30')) (type:
boolean)
              Statistics: Num rows: 287 Data size: 57436 Basic stats:
COMPLETE Column stats: NONE
              Select Operator
                Statistics: Num rows: 287 Data size: 57436 Basic stats:
COMPLETE Column stats: NONE
                Group By Operator
                  aggregations: count()
                  mode: hash
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                  Reduce Output Operator
                    sort order:
                    Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                    value expressions: _col0 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          mode: mergepartial
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column
stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
            table:
                input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
                output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Bitmap index implementation in Hive

Reply via email to