[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295505#comment-16295505 ]
Hive QA commented on HIVE-18149: -------------------------------- Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12902677/HIVE-18149.03wip02.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 41 failed/errored test(s), 11531 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_table] (batchId=249) org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_table] (batchId=249) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_join25] (batchId=72) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_sortmerge_join_12] (batchId=33) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] (batchId=67) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby10] (batchId=62) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_cube1] (batchId=4) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_rollup1] (batchId=32) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input17] (batchId=37) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input3_limit] (batchId=63) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input4] (batchId=81) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input5] (batchId=14) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath2] (batchId=37) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] (batchId=30) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin_hook] (batchId=12) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge5] (batchId=56) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge6] (batchId=34) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge_incompat1] (batchId=67) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge_incompat2] (batchId=83) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_join5] (batchId=35) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_gather_stats] (batchId=86) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_reduce_groupby_duplicate_cols] (batchId=35) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_2] (batchId=152) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2] (batchId=157) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata] (batchId=165) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] (batchId=169) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast] (batchId=160) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucketizedhiveinputformat] (batchId=178) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning_4] (batchId=179) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_part] (batchId=93) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[stats_aggregator_error_1] (batchId=93) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_10] (batchId=138) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_12] (batchId=119) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucketsortoptimize_insert_7] (batchId=128) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ppd_join5] (batchId=120) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] (batchId=113) org.apache.hadoop.hive.cli.control.TestDanglingQOuts.checkDanglingQOut (batchId=209) org.apache.hadoop.hive.ql.TestAcidOnTez.testMapJoinOnTez (batchId=223) org.apache.hadoop.hive.ql.TestTxnCommandsForOrcMmTable.testInsertOverwriteWithDynamicPartition (batchId=278) org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints (batchId=226) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8300/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8300/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8300/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 41 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12902677 - PreCommit-HIVE-Build > Stats: rownum estimation from datasize underestimates in most cases > ------------------------------------------------------------------- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics > Reporter: Zoltan Haindrich > Assignee: Zoltan Haindrich > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, > HIVE-18149.02.patch, HIVE-18149.03wip01.patch, HIVE-18149.03wip02.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the same....api docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this > explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25] > will estimate the rowsize of the table to be 404 bytes; however the 20 rows > of text is only 169 bytes...so it ends up with 0 rows... -- This message was sent by Atlassian JIRA (v6.4.14#64029)