[ https://issues.apache.org/jira/browse/HIVE-20574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624054#comment-16624054 ]
Ajay Jadhav commented on HIVE-20574: ------------------------------------ Hive exposes this setting- [https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.stats.ndv.error] which determines the error tolerance for "distinct_count". Error tolerance provides a tradeoff between accuracy vs compute cost. In order to get the correct count, suggest setting {color:#FF0000}hive.stats.ndv.error = 0{color} I have tested this on EMR cluster and it indeed improves the accuracy to 100% Another interesting property, if you are using partitions, is to consider tuning hive.metastore.stats.ndv.tuner to be closer to 1. The default is 0. > Column statistics give erraneous numDistinct > -------------------------------------------- > > Key: HIVE-20574 > URL: https://issues.apache.org/jira/browse/HIVE-20574 > Project: Hive > Issue Type: Bug > Components: Metastore, Statistics > Affects Versions: 2.3.2 > Environment: Amazon EMR (BigTop based) from emr-5.9.0 to emr-5.16.0. > Reporter: Mikko Kivistö > Priority: Major > Labels: Statistics, statsCollection > > 1) Download the parquet file to s3/hdfs (e.g. hdfs:///tmp/testi_parquet/) > using some tool (aws cli, hdfs command or anything) > - S3: s3://www.smartdatahub.io/data/test.parquet > - HTTP: [http://www.smartdatahub.io/data/test.parquet] > - or the attachmen > eg. with aws cli, wget/curl/distcp can also be used > {\{ aws s3 cp s3://www.smartdatahub.io/data/test.parquet . }} > {\{ hdfs dfs -mkdir hdfs:///tmp/testi_parquet/}}hdfs dfs -put test.parquet }} > {{ hdfs:///tmp/testi_parquet/test.parquet}} > 2) Create table default.testi_parquet2 on top of that using the schema > provided > {{CREATE TABLE `default.testi_parquet2`(}} > {{ `rakennustu` int, }} > {{ `kohdenimi` string, }} > {{ `tekstisuun` int, }} > {{ `tekstikoko` float, }} > {{ `tekstifont` string, }} > {{ `buix_bid` int, }} > {{ `paivitetty` string, }} > {{ `datanomist` string, }} > {{ `geom_geojson` string, }} > {{ `geom` binary, }} > {{ `extractdate` string)}} > {{ ROW FORMAT SERDE }} > {{ 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' }} > {{ STORED AS INPUTFORMAT }} > {{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' }} > {{ OUTPUTFORMAT }} > {{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'}} > {{ LOCATION}} > {{ 'hdfs:///tmp/testi_parquet/';}} > {{ \{{ – CHANGE THE LOCATION TO THE PREFIX/DIRECTORY YOU DOWNLOADED THE FILE > FROM STEP 1}}}} > 3) To collect the values showing you the actual reality of the data: Query > the distinct count, min and max of column "tekstisuun" > {\{ SELECT COUNT(DISTINCT tekstisuun), MAX(tekstisuun), MIN(tekstisuun) FROM > default.testi_parquet2; }} > and note them (min 0, max 0, distinct 1) > 4) Compute statistics for the table using > {{ANALYZE TABLE default.testi_parquet2 COMPUTE STATISTICS FOR COLUMNS;}} > 5) See erroneous statistics entry for numDistincts: Query the statistics by > using " > {{DESCRIBE FORMATTED default.testi_parquet2 tekstisuun}} > " and note the ERRANEOUS numDistincts value: 2 -- This message was sent by Atlassian JIRA (v7.6.3#76005)