I've attached the output. Thanks. B Subject: Re: Hive indexing optimization From: jpullokka...@hortonworks.com To: user@hive.apache.org Date: Mon, 29 Jun 2015 19:17:44 +0000
Could you post explain extended output? From: Bennie Leo <tben...@hotmail.com> Reply-To: "user@hive.apache.org" <user@hive.apache.org> Date: Monday, June 29, 2015 at 10:35 AM To: "user@hive.apache.org" <user@hive.apache.org> Subject: RE: Hive indexing optimization Here is the explain output: STAGE PLANS: Stage: Stage-1 Tez Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE) Vertices: Map 1 Map Operator Tree: TableScan alias: logontable filterExpr: isipv4(ip) (type: boolean) Statistics: Num rows: 0 Data size: 550 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: isipv4(ip) (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Select Operator expressions: ip (type: bigint) outputColumnNames: _col0 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: _col0 (type: bigint) Map 3 Map Operator Tree: TableScan alias: ipv4geotable Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: startip (type: bigint), endip (type: bigint), country (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: string) Reducer 2 Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col0} {VALUE._col1} {VALUE._col2} filter predicates: 0 {isipv4(VALUE._col0)} 1 outputColumnNames: _col0, _col1, _col2, _col3 Statistics: Num rows: 43281312 Data size: 5020632576 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((_col1 <= _col0) and (_col0 <= _col2)) (type: boolean) Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: bigint), _col3 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Thank you, B > Subject: Re: Hive indexing optimization > From: jpullokka...@hortonworks.com > To: user@hive.apache.org > CC: tben...@hotmail.com > Date: Sat, 27 Jun 2015 16:02:08 +0000 > > "SELECT StartIp, EndIp, Country FROM ipv4geotable” should have been > rewritten as a scan against index table. > > BitMap Indexes seems to support inequalities (<=, <, >=). > > Post the explain plan. > > On 6/26/15, 8:56 PM, "Gopal Vijayaraghavan" <gop...@apache.org> wrote: > > >Hi, > > > >Hive indexes won¹t really help you speed up that query right now, because > >of the plan it generates due to the <= clauses. > > > >> CREATETABLE ipv4table > >> AS > >> SELECT logon.IP, ipv4.Country > >> FROM > >> (SELECT * FROM logontable WHERE isIpv4(IP)) logon > >> LEFT OUTER JOIN > >> (SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON > >> isIpv4(logon.IP) > >> WHERE ipv4.StartIp <=logon.IP AND logon.IP <= ipv4.EndIp; > > > >That¹s a cross-product join, which can¹t be distributed at all & will take > >forever, even if you use Tez/LLAP. > > > >Range join queries have partial distribution rewrites, but AFAIK none of > >them use filter indexes in hive. > > > >But before I suggest a rewrite, can you post the ³explain <query>;² for > >the above query, so that I can check if this is indeed producing a > >cross-product + filter? > > > >Cheers, > >Gopal > > > > > > > > > > > > > > > > >
ABSTRACT SYNTAX TREE: TOK_QUERY TOK_FROM TOK_LEFTOUTERJOIN TOK_SUBQUERY TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME logontable TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF TOK_WHERE TOK_FUNCTION isIpv4 TOK_TABLE_OR_COL IP logon TOK_SUBQUERY TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME ipv4geotable TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL StartIp TOK_SELEXPR TOK_TABLE_OR_COL EndIp TOK_SELEXPR TOK_TABLE_OR_COL Country ipv4 TOK_FUNCTION isIpv4 . TOK_TABLE_OR_COL logon IP TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR . TOK_TABLE_OR_COL logon IP TOK_SELEXPR . TOK_TABLE_OR_COL ipv4 Country TOK_WHERE AND <= . TOK_TABLE_OR_COL ipv4 StartIp . TOK_TABLE_OR_COL logon IP <= . TOK_TABLE_OR_COL logon IP . TOK_TABLE_OR_COL ipv4 EndIp STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Tez Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE) DagName: hdp_20150629235151_e4606465-cfaa-4072-a2d9-f577c3da9ab9:1 Vertices: Map 1 Map Operator Tree: TableScan alias: ipv4geotable Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE Column stats: NONE GatherStats: false Select Operator expressions: startip (type: bigint), endip (type: bigint), Country (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE Column stats: NONE tag: 1 value expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: string) Path -> Alias:- cadoop.hive.serde2.lazy.LazySimpleSerDe input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: EXTERNAL TRUE bucket_count -1 columns startip,endip,Country columns.comments columns.types bigint:bigint:string field.delim , file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location - name default.ipv4geotable serialization.ddl struct ipv4geotable { i64 startip, i64 endip, string Country} serialization.format , serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe skip.header.line.count 1 transient_lastDdlTime 1435621873 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.ipv4geotable name: default.ipv4geotable Truncated Path -> Alias: - Map 3 Map Operator Tree: TableScan alias: logontable filterExpr: isipv4(ip) (type: boolean) Statistics: Num rows: 5 Data size: 550 Basic stats: COMPLETE Column stats: NONE GatherStats: false Filter Operator isSamplingPred: false predicate: isipv4(ip) (type: boolean) Statistics: Num rows: 2 Data size: 220 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: ip (type: bigint) outputColumnNames: _col0 Statistics: Num rows: 2 Data size: 220 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 2 Data size: 220 Basic stats: COMPLETE Column stats: NONE tag: 0 value expressions: _col0 (type: bigint) Path -> Alias:- Path -> Partition:- Partition base file name: logontable input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: bucket_count -1 columns ip columns.comments columns.types bigint file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location - name default.logontable serialization.ddl struct logontable {i64 ip} serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe transient_lastDdlTime 1435606340 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: bucket_count -1 columns ip columns.comments columns.types bigint file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location - name default.logontable serialization.ddl struct logontable {i64 ip} serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe transient_lastDdlTime 1435606340 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.logontable name: default.logontable Truncated Path -> Alias: /logontable [logontable] Reducer 2 Needs Tagging: true Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col0} {VALUE._col1} {VALUE._col2} filter mappings: 0 [1, 1] filter predicates: 0 {isipv4(VALUE._col0)} 1 outputColumnNames: _col0, _col1, _col2, _col3 Statistics: Num rows: 43281312 Data size: 5020632576 Basic stats: COMPLETE Column stats: NONE Filter Operator isSamplingPred: false predicate: ((_col1 <= _col0) and (_col0 <= _col2)) (type: boolean) Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: expressions: _col0 (type: bigint), _col3 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false GlobalTableId: 0 directory: - NumFilesPerFileSink: 1 Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE Column stats: NONE Stats Publishing Key Prefix: - table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: columns _col0,_col1 columns.types bigint:string escape.delim \ hive.serialization.extend.nesting.levels true serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe TotalFiles: 1 GatherStats: false MultiFileSpray: false Stage: Stage-0 Fetch Operator limit: -1