[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297104#comment-16297104 ]
Hive QA commented on HIVE-18149: -------------------------------- Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12902874/HIVE-18149.03.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 19 failed/errored test(s), 11528 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin_hook] (batchId=12) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_join5] (batchId=35) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_2] (batchId=152) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2] (batchId=157) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata] (batchId=165) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] (batchId=169) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast] (batchId=160) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[smb_mapjoin_15] (batchId=168) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucketizedhiveinputformat] (batchId=178) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_part] (batchId=93) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[stats_aggregator_error_1] (batchId=93) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_10] (batchId=138) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucketsortoptimize_insert_7] (batchId=128) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ppd_join5] (batchId=120) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] (batchId=113) org.apache.hadoop.hive.cli.control.TestDanglingQOuts.checkDanglingQOut (batchId=209) org.apache.hadoop.hive.ql.TestAcidOnTez.testMapJoinOnTez (batchId=223) org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints (batchId=226) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8322/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8322/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8322/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 19 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12902874 - PreCommit-HIVE-Build > Stats: rownum estimation from datasize underestimates in most cases > ------------------------------------------------------------------- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics > Reporter: Zoltan Haindrich > Assignee: Zoltan Haindrich > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, > HIVE-18149.02.patch, HIVE-18149.03.patch, HIVE-18149.03wip01.patch, > HIVE-18149.03wip02.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the same....api docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this > explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25] > will estimate the rowsize of the table to be 404 bytes; however the 20 rows > of text is only 169 bytes...so it ends up with 0 rows... -- This message was sent by Atlassian JIRA (v6.4.14#64029)