[ https://issues.apache.org/jira/browse/HIVE-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15754322#comment-15754322 ]
Jesus Camacho Rodriguez edited comment on HIVE-15122 at 12/16/16 12:45 PM: --------------------------------------------------------------------------- [~ashutoshc], could you review this patch? For new test case, PK-FK inference can be checked in the logs. For that particular case, observe the different row count, which is the same as the one in the query without cast. Stats without patch: {code} Statistics: Num rows: 889 Data size: 7112 Basic stats: COMPLETE Column stats: COMPLETE {code} While stats with patch: {code} Statistics: Num rows: 964 Data size: 7712 Basic stats: COMPLETE Column stats: COMPLETE {code} was (Author: jcamachorodriguez): [~ashutoshc], could you review this patch? For new test case, PK-FK inference can be checked in the logs. For that particular case, stats without patch: {code} Statistics: Num rows: 889 Data size: 7112 Basic stats: COMPLETE Column stats: COMPLETE {code} While stats with patch: {code} Statistics: Num rows: 964 Data size: 7712 Basic stats: COMPLETE Column stats: COMPLETE {code} > Hive: Upcasting types should not obscure stats (min/max/ndv) > ------------------------------------------------------------ > > Key: HIVE-15122 > URL: https://issues.apache.org/jira/browse/HIVE-15122 > Project: Hive > Issue Type: Bug > Reporter: Siddharth Seth > Assignee: Jesus Camacho Rodriguez > Attachments: HIVE-15122.01.patch, HIVE-15122.patch > > > A UDFToLong breaks PK/FK inferences and triggers mis-estimation of joins in > LLAP. > Snippet from the bad plan. > {code} > | STAGE PLANS: > > | > | Stage: Stage-1 > > | > | Tez > > | > | DagId: hive_20161031222730_a700058f-78eb-40d6-a67d-43add60a50e2:6 > > | > | Edges: > > | > | Map 2 <- Map 1 (BROADCAST_EDGE) > > | > | Map 3 <- Map 2 (BROADCAST_EDGE) > > | > | Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE), Map 7 > (CUSTOM_SIMPLE_EDGE), Map 8 (BROADCAST_EDGE), Map 9 (BROADCAST_EDGE) > | > | Reducer 5 <- Reducer 4 (SIMPLE_EDGE) > > | > | Reducer 6 <- Reducer 5 (SIMPLE_EDGE) > > | > | DagName: > > | > | Vertices: > > | > | Map 1 > > | > | Map Operator Tree: > > | > | TableScan > > | > | alias: supplier > > | > | filterExpr: (s_suppkey is not null and s_nationkey is not > null) (type: boolean) > | > | Statistics: Num rows: 10000000 Data size: 160000000 Basic > stats: COMPLETE Column stats: COMPLETE > | > | Filter Operator > > | > | predicate: (s_suppkey is not null and s_nationkey is > not null) (type: boolean) > | > | Statistics: Num rows: 10000000 Data size: 160000000 > Basic stats: COMPLETE Column stats: COMPLETE > | > | Select Operator > > | > | expressions: s_suppkey (type: bigint), s_nationkey > (type: bigint) > | > | outputColumnNames: _col0, _col1 > > | > | Statistics: Num rows: 10000000 Data size: 160000000 > Basic stats: COMPLETE Column stats: COMPLETE > | > | Reduce Output Operator > > | > | key expressions: _col0 (type: bigint) > > | > | sort order: + > > | > | Map-reduce partition columns: _col0 (type: bigint) > > | > | Statistics: Num rows: 10000000 Data size: 160000000 > Basic stats: COMPLETE Column stats: COMPLETE > | > | value expressions: _col1 (type: bigint) > > | > | Execution mode: vectorized, llap > > | > | LLAP IO: all inputs > > | > | Map 2 > > | > | Map Operator Tree: > > | > | TableScan > > | > | alias: lineitem > > | > | filterExpr: (l_suppkey is not null and l_orderkey is not > null) (type: boolean) > | > | Statistics: Num rows: 2285121364 Data size: 63983407882 > Basic stats: COMPLETE Column stats: PARTIAL > | > | Filter Operator > > | > | predicate: (l_suppkey is not null and l_orderkey is not > null) (type: boolean) > | > | Statistics: Num rows: 2285121364 Data size: > 127966796384 Basic stats: COMPLETE Column stats: PARTIAL > | > | Select Operator > > | > | expressions: l_orderkey (type: bigint), l_suppkey > (type: int), l_extendedprice (type: double), l_discount (type: double), > l_shipdate (type: date) | > | outputColumnNames: _col0, _col1, _col2, _col3, _col4 > > | > | Statistics: Num rows: 2285121364 Data size: > 127966796384 Basic stats: COMPLETE Column stats: PARTIAL > | > | Map Join Operator > > | > | condition map: > > | > | Inner Join 0 to 1 > > | > | keys: > > | > | 0 _col0 (type: bigint) > > | > | 1 UDFToLong(_col1) (type: bigint) > > | > | outputColumnNames: _col1, _col2, _col4, _col5, > _col6 > | > | input vertices: > > | > | 0 Map 1 > > | > | Statistics: Num rows: 10000000 Data size: 880000000 > Basic stats: COMPLETE Column stats: PARTIAL > | > | Reduce Output Operator > > | > | key expressions: _col2 (type: bigint) > > | > | sort order: + > > | > | Map-reduce partition columns: _col2 (type: > bigint) > | > | Statistics: Num rows: 10000000 Data size: > 880000000 Basic stats: COMPLETE Column stats: PARTIAL > | > | value expressions: _col1 (type: bigint), _col4 > (type: double), _col5 (type: double), _col6 (type: date) > | > | Execution mode: vectorized, llap > > | > | LLAP IO: all inputs > > | > | Map 3 > > | > | Map Operator Tree: > > | > | TableScan > > | > | alias: orders > > | > | filterExpr: (o_orderkey is not null and o_custkey is not > null) (type: boolean) > | > | Statistics: Num rows: 4318801126 Data size: 51825626753 > Basic stats: COMPLETE Column stats: NONE > | > | Filter Operator > > | > | predicate: (o_orderkey is not null and o_custkey is not > null) (type: boolean) > | > | Statistics: Num rows: 4318801126 Data size: 51825626753 > Basic stats: COMPLETE Column stats: NONE > | > | Select Operator > > | > | expressions: o_orderkey (type: int), o_custkey (type: > bigint) > | > | outputColumnNames: _col0, _col1 > > | > | Statistics: Num rows: 4318801126 Data size: > 51825626753 Basic stats: COMPLETE Column stats: NONE > | > | Map Join Operator > > | > | condition map: > > | > | Inner Join 0 to 1 > > | > | keys: > > | > | 0 _col2 (type: bigint) > > | > | 1 UDFToLong(_col0) (type: bigint) > > | > | outputColumnNames: _col1, _col4, _col5, _col6, > _col8 > | > | input vertices: > > | > | 0 Map 2 > > | > | Statistics: Num rows: 4750681341 Data size: > 57008190663 Basic stats: COMPLETE Column stats: NONE > | > | Reduce Output Operator > > | > | key expressions: _col8 (type: bigint) > > | > | sort order: + > > | > | Map-reduce partition columns: _col8 (type: > bigint) > | > | Statistics: Num rows: 4750681341 Data size: > 57008190663 Basic stats: COMPLETE Column stats: NONE > | > | value expressions: _col1 (type: bigint), _col4 > (type: double), _col5 (type: double), _col6 (type: date) > | > | Execution mode: vectorized, llap > > | > | LLAP IO: all inputs > > | > | Map 7 > > {code} > Note the Map2 to Map3 output. > This causes a rather large join (120GB) to be categorized as a map-join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)