[ https://issues.apache.org/jira/browse/HIVE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144548#comment-14144548 ]
Hive QA commented on HIVE-8196: ------------------------------- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12670585/HIVE-8196.1.patch {color:red}ERROR:{color} -1 due to 24 failed/errored test(s), 6313 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_alter_partition_coltype org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_12 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_combine2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_opt_vectorization org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_11 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_input24 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_limit_partition_metadataonly org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_query_multiskew_3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_metadataonly1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_nullgroup3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_constant_where org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_regex_col org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats10 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats12 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats13 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats7 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_view org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_dynamic_partition_pruning org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_dynpart_sort_opt_vectorization org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_metadataonly1 org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_limit_partition_stats org.apache.hive.hcatalog.pig.TestOrcHCatLoader.testReadDataPrimitiveTypes {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/937/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/937/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-937/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 24 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12670585 > Joining on partition columns with fetch column stats enabled results it very > small CE which negatively affects query performance > --------------------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-8196 > URL: https://issues.apache.org/jira/browse/HIVE-8196 > Project: Hive > Issue Type: Bug > Components: Physical Optimizer > Affects Versions: 0.14.0 > Reporter: Mostafa Mokhtar > Assignee: Prasanth J > Priority: Critical > Labels: performance > Fix For: 0.14.0 > > Attachments: HIVE-8196.1.patch > > > To make the best out of dynamic partition pruning joins should be on the > partitioning columns which results in dynamically pruning the partitions from > the fact table based on the qualifying column keys from the dimension table, > this type of joins negatively effects on cardinality estimates with fetch > column stats enabled. > Currently we don't have statistics for partition columns and as a result NDV > is set to row count, doing that negatively affects the estimated join > selectivity from the join. > Workaround is to capture statistics for partition columns or use number of > partitions incase dynamic partitioning is used. > In StatsUtils.getColStatisticsFromExpression is where count distincts gets > set to row count > {code} > if (encd.getIsPartitionColOrVirtualCol()) { > // vitual columns > colType = encd.getTypeInfo().getTypeName(); > countDistincts = numRows; > oi = encd.getWritableObjectInspector(); > {code} > Query used to repro the issue : > {code} > set hive.stats.fetch.column.stats=ture; > set hive.tez.dynamic.partition.pruning=true; > explain select d_date > from store_sales, date_dim > where > store_sales.ss_sold_date_sk = date_dim.d_date_sk and > date_dim.d_year = 1998; > {code} > Plan > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Tez > Edges: > Map 1 <- Map 2 (BROADCAST_EDGE) > DagName: mmokhtar_20140919180404_945d29f5-d041-4420-9666-1c5d64fa6540:8 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: store_sales > filterExpr: ss_sold_date_sk is not null (type: boolean) > Statistics: Num rows: 550076554 Data size: 47370018816 > Basic stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {ss_sold_date_sk} > 1 {d_date_sk} {d_date} > keys: > 0 ss_sold_date_sk (type: int) > 1 d_date_sk (type: int) > outputColumnNames: _col22, _col26, _col28 > input vertices: > 1 Map 2 > Statistics: Num rows: 652 Data size: 66504 Basic stats: > COMPLETE Column stats: COMPLETE > Filter Operator > predicate: (_col22 = _col26) (type: boolean) > Statistics: Num rows: 326 Data size: 33252 Basic stats: > COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col28 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 326 Data size: 30644 Basic > stats: COMPLETE Column stats: COMPLETE > File Output Operator > compressed: false > Statistics: Num rows: 326 Data size: 30644 Basic > stats: COMPLETE Column stats: COMPLETE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Execution mode: vectorized > Map 2 > Map Operator Tree: > TableScan > alias: date_dim > filterExpr: (d_date_sk is not null and (d_year = 1998)) > (type: boolean) > Statistics: Num rows: 73049 Data size: 81741831 Basic > stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: (d_date_sk is not null and (d_year = 1998)) > (type: boolean) > Statistics: Num rows: 652 Data size: 66504 Basic stats: > COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: d_date_sk (type: int) > sort order: + > Map-reduce partition columns: d_date_sk (type: int) > Statistics: Num rows: 652 Data size: 66504 Basic stats: > COMPLETE Column stats: COMPLETE > value expressions: d_date (type: string) > Select Operator > expressions: d_date_sk (type: int) > outputColumnNames: _col0 > Statistics: Num rows: 652 Data size: 0 Basic stats: > PARTIAL Column stats: COMPLETE > Group By Operator > keys: _col0 (type: int) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 652 Data size: 0 Basic stats: > PARTIAL Column stats: COMPLETE > Dynamic Partitioning Event Operator > Target Input: store_sales > Partition key expr: ss_sold_date_sk > Statistics: Num rows: 652 Data size: 0 Basic stats: > PARTIAL Column stats: COMPLETE > Target column: ss_sold_date_sk > Target Vertex: Map 1 > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)