RE: Hive indexing optimization

Bennie Leo Mon, 29 Jun 2015 17:26:43 -0700

I've attached the output. Thanks.
  B
                    
 Subject: Re: Hive indexing optimization
From: jpullokka...@hortonworks.com
To: user@hive.apache.org
Date: Mon, 29 Jun 2015 19:17:44 +0000







Could you post explain extended output?





From: Bennie Leo <tben...@hotmail.com>

Reply-To: "user@hive.apache.org" <user@hive.apache.org>

Date: Monday, June 29, 2015 at 10:35 AM

To: "user@hive.apache.org" <user@hive.apache.org>

Subject: RE: Hive indexing optimization







Here is the explain output:

 

STAGE PLANS:

  Stage: Stage-1

    Tez

      Edges:

        Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)

      Vertices:

        Map 1 

            Map Operator Tree:

                TableScan

                  alias: logontable

                  filterExpr: isipv4(ip) (type: boolean)

                  Statistics: Num rows: 0 Data size: 550 Basic stats: PARTIAL 
Column stats: NONE

                  Filter Operator

                    predicate: isipv4(ip) (type: boolean)

                    Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE

                    Select Operator

                      expressions: ip (type: bigint)

                      outputColumnNames: _col0

                      Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE

                      Reduce Output Operator

                        sort order: 

                        Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE

                        value expressions: _col0 (type: bigint)

        Map 3 

            Map Operator Tree:

                TableScan

                  alias: ipv4geotable

                  Statistics: Num rows: 41641243 Data size: 5144651200 Basic 
stats: COMPLETE Column stats: NONE

                  Select Operator

                    expressions: startip (type: bigint), endip (type: bigint), 
country (type: string)

                    outputColumnNames: _col0, _col1, _col2

                    Statistics: Num rows: 41641243 Data size: 5144651200 Basic 
stats: COMPLETE Column stats: NONE

                    Reduce Output Operator

                      sort order: 

                      Statistics: Num rows: 41641243 Data size: 5144651200 
Basic stats: COMPLETE Column stats: NONE

                      value expressions: _col0 (type: bigint), _col1 (type: 
bigint), _col2 (type: string)

        Reducer 2 

            Reduce Operator Tree:

              Join Operator

                condition map:

                     Left Outer Join0 to 1

                condition expressions:

                  0 {VALUE._col0} 

                  1 {VALUE._col0} {VALUE._col1} {VALUE._col2}

                filter predicates:

                  0 {isipv4(VALUE._col0)}

                  1 

                outputColumnNames: _col0, _col1, _col2, _col3

                Statistics: Num rows: 43281312 Data size: 5020632576 Basic 
stats: COMPLETE Column stats: NONE

                Filter Operator

                  predicate: ((_col1 <= _col0) and (_col0 <= _col2)) (type: 
boolean)

                  Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE

                  Select Operator

                    expressions: _col0 (type: bigint), _col3 (type: string)

                    outputColumnNames: _col0, _col1

                    Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE

                    File Output Operator

                      compressed: false

                      Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE

                      table:

                          input format: org.apache.hadoop.mapred.TextInputFormat

                          output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                          serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0

    Fetch Operator

      limit: -1


 

Thank you,

B

 

> Subject: Re: Hive indexing optimization

> From: jpullokka...@hortonworks.com

> To: user@hive.apache.org

> CC: tben...@hotmail.com

> Date: Sat, 27 Jun 2015 16:02:08 +0000

> 

> "SELECT StartIp, EndIp, Country FROM ipv4geotable” should have been

> rewritten as a scan against index table.

> 

> BitMap Indexes seems to support inequalities (<=, <, >=).

> 

> Post the explain plan.

> 

> On 6/26/15, 8:56 PM, "Gopal Vijayaraghavan" <gop...@apache.org> wrote:

> 

> >Hi,

> >

> >Hive indexes won¹t really help you speed up that query right now, because

> >of the plan it generates due to the <= clauses.

> >

> >> CREATETABLE ipv4table

> >> AS

> >> SELECT logon.IP, ipv4.Country

> >> FROM

> >> (SELECT * FROM logontable WHERE isIpv4(IP)) logon

> >> LEFT OUTER JOIN

> >> (SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON

> >> isIpv4(logon.IP)

> >> WHERE ipv4.StartIp <=logon.IP AND logon.IP <= ipv4.EndIp;

> >

> >That¹s a cross-product join, which can¹t be distributed at all & will take

> >forever, even if you use Tez/LLAP.

> >

> >Range join queries have partial distribution rewrites, but AFAIK none of

> >them use filter indexes in hive.

> >

> >But before I suggest a rewrite, can you post the ³explain <query>;² for

> >the above query, so that I can check if this is indeed producing a

> >cross-product + filter?

> >

> >Cheers,

> >Gopal

> >

> >

> >

> >

> >

> >

> >

> >

>

ABSTRACT SYNTAX TREE:
  
TOK_QUERY
   TOK_FROM
      TOK_LEFTOUTERJOIN
         TOK_SUBQUERY
            TOK_QUERY
               TOK_FROM
                  TOK_TABREF
                     TOK_TABNAME
                        logontable
               TOK_INSERT
                  TOK_DESTINATION
                     TOK_DIR
                        TOK_TMP_FILE
                  TOK_SELECT
                     TOK_SELEXPR
                        TOK_ALLCOLREF
                  TOK_WHERE
                     TOK_FUNCTION
                        isIpv4
                        TOK_TABLE_OR_COL
                           IP
            logon
         TOK_SUBQUERY
            TOK_QUERY
               TOK_FROM
                  TOK_TABREF
                     TOK_TABNAME
                        ipv4geotable
               TOK_INSERT
                  TOK_DESTINATION
                     TOK_DIR
                        TOK_TMP_FILE
                  TOK_SELECT
                     TOK_SELEXPR
                        TOK_TABLE_OR_COL
                           StartIp
                     TOK_SELEXPR
                        TOK_TABLE_OR_COL
                           EndIp
                     TOK_SELEXPR
                        TOK_TABLE_OR_COL
                           Country
            ipv4
         TOK_FUNCTION
            isIpv4
            .
               TOK_TABLE_OR_COL
                  logon
               IP
   TOK_INSERT
      TOK_DESTINATION
         TOK_DIR
            TOK_TMP_FILE
      TOK_SELECT
         TOK_SELEXPR
            .
               TOK_TABLE_OR_COL
                  logon
               IP
         TOK_SELEXPR
            .            
            TOK_TABLE_OR_COL
               ipv4
            Country
         
      TOK_WHERE
         AND
            <=
               .
                  TOK_TABLE_OR_COL
                     ipv4
                  StartIp               
               .
                  TOK_TABLE_OR_COL
                     logon
                  IP
            <=               
               .
                  TOK_TABLE_OR_COL
                     logon
                  IP
               .
                  TOK_TABLE_OR_COL
                     ipv4
                  EndIp


STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
      DagName: hdp_20150629235151_e4606465-cfaa-4072-a2d9-f577c3da9ab9:1
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: ipv4geotable
                  Statistics: Num rows: 41641243 Data size: 5144651200  Basic 
stats: COMPLETE Column stats: NONE
                  GatherStats: false
                  Select Operator
                    expressions: startip (type: bigint), endip (type: bigint), 
Country (type: string)
                    outputColumnNames: _col0, _col1, _col2
                    Statistics: Num rows: 41641243 Data size: 5144651200  Basic 
stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      sort order: 
                      Statistics: Num rows: 41641243 Data size: 5144651200  
Basic stats: COMPLETE Column stats: NONE
                      tag: 1
                      value expressions: _col0 (type: bigint), _col1 (type: 
bigint), _col2 (type: string)
            Path -> Alias:-
            cadoop.hive.serde2.lazy.LazySimpleSerDe
                
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    properties:
                      EXTERNAL TRUE
                      bucket_count -1
                      columns startip,endip,Country
                      columns.comments 
                      columns.types bigint:bigint:string
                      field.delim ,
                      file.inputformat org.apache.hadoop.mapred.TextInputFormat
                      file.outputformat 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      location -
                      name default.ipv4geotable
                      serialization.ddl struct ipv4geotable { i64 startip, i64 
endip, string Country}
                      serialization.format ,
                      serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      skip.header.line.count 1
                      transient_lastDdlTime 1435621873
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: default.ipv4geotable
                  name: default.ipv4geotable
            Truncated Path -> Alias: -
        Map 3 
            Map Operator Tree:
                TableScan
                  alias: logontable
                  filterExpr: isipv4(ip) (type: boolean)
                  Statistics: Num rows: 5 Data size: 550 Basic stats: COMPLETE 
Column stats: NONE
                  GatherStats: false
                  Filter Operator
                    isSamplingPred: false
                    predicate: isipv4(ip) (type: boolean)
                    Statistics: Num rows: 2 Data size: 220 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: ip (type: bigint)
                      outputColumnNames: _col0
                      Statistics: Num rows: 2 Data size: 220 Basic stats: 
COMPLETE Column stats: NONE
                      Reduce Output Operator
                        sort order: 
                        Statistics: Num rows: 2 Data size: 220 Basic stats: 
COMPLETE Column stats: NONE
                        tag: 0
                        value expressions: _col0 (type: bigint)
            Path -> Alias:-
            Path -> Partition:-
                Partition
                  base file name: logontable
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  properties:
                    bucket_count -1
                    columns ip
                    columns.comments 
                    columns.types bigint
                    file.inputformat org.apache.hadoop.mapred.TextInputFormat
                    file.outputformat 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    location -
                    name default.logontable
                    serialization.ddl struct logontable {i64 ip}
                    serialization.format 1
                    serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    transient_lastDdlTime 1435606340
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    properties:
                      bucket_count -1
                      columns ip
                      columns.comments 
                      columns.types bigint
                      file.inputformat org.apache.hadoop.mapred.TextInputFormat
                      file.outputformat 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      location -
                      name default.logontable
                      serialization.ddl struct logontable {i64 ip}
                      serialization.format 1
                      serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      transient_lastDdlTime 1435606340
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: default.logontable
                  name: default.logontable
            Truncated Path -> Alias:
              /logontable [logontable]
        Reducer 2 
            Needs Tagging: true
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Left Outer Join0 to 1
                condition expressions:
                  0 {VALUE._col0}
                  1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
                filter mappings:
                  0 [1, 1]
                filter predicates:
                  0 {isipv4(VALUE._col0)}
                  1 
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 43281312 Data size: 5020632576 Basic 
stats: COMPLETE Column stats: NONE
                Filter Operator
                  isSamplingPred: false
                  predicate: ((_col1 <= _col0) and (_col0 <= _col2)) (type: 
boolean)
                  Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: expressions: _col0 (type: bigint), _col3 
(type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE
                    File Output Operator
                      compressed: false
                      GlobalTableId: 0
                      directory: -
                      NumFilesPerFileSink: 1
                      Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE
                      Stats Publishing Key Prefix: -
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          properties:
                            columns _col0,_col1
                            columns.types bigint:string
                            escape.delim \
                            hive.serialization.extend.nesting.levels true
                            serialization.format 1
                            serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                          serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      TotalFiles: 1
                      GatherStats: false
                      MultiFileSpray: false

  Stage: Stage-0
    Fetch Operator
      limit: -1

RE: Hive indexing optimization

Reply via email to