Mikko Kivistö created HIVE-20574: ------------------------------------ Summary: Column statistics give erraneous numDistinct Key: HIVE-20574 URL: https://issues.apache.org/jira/browse/HIVE-20574 Project: Hive Issue Type: Bug Components: Metastore, Statistics Affects Versions: 2.3.2 Environment: Amazon EMR (BigTop based) from emr-5.9.0 to emr-5.16.0. Reporter: Mikko Kivistö
1) Download the parquet file to s3/hdfs (e.g. hdfs:///tmp/testi_parquet/) using some tool (aws cli, hdfs command or anything) - S3: s3://www.smartdatahub.io/data/test.parquet - HTTP: [http://www.smartdatahub.io/data/test.parquet] - or the attachmen eg. with aws cli, wget/curl/distcp can also be used {{ aws s3 cp s3://www.smartdatahub.io/data/test.parquet .}} {{hdfs dfs -mkdir hdfs:///tmp/testi_parquet/}}{{hdfs dfs -put test.parquet }} {{hdfs:///tmp/testi_parquet/test.parquet}} 2) Create table default.testi_parquet2 on top of that using the schema provided {{ CREATE TABLE `default.testi_parquet2`(}} {{ `rakennustu` int, }} {{ `kohdenimi` string, }} {{ `tekstisuun` int, }} {{ `tekstikoko` float, }} {{ `tekstifont` string, }} {{ `buix_bid` int, }} {{ `paivitetty` string, }} {{ `datanomist` string, }} {{ `geom_geojson` string, }} {{ `geom` binary, }} {{ `extractdate` string)}} {{ ROW FORMAT SERDE }} {{ 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' }} {{ STORED AS INPUTFORMAT }} {{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' }} {{ OUTPUTFORMAT }} {{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'}} {{ LOCATION}} {{ 'hdfs:///tmp/testi_parquet/';}} {{ -- CHANGE THE LOCATION TO THE PREFIX/DIRECTORY YOU DOWNLOADED THE FILE FROM STEP 1 IF IT DIFFERS FROM THE EXAMPLE}} 3) To collect the values showing you the actual reality of the data: Query the distinct count, min and max of column "tekstisuun" {{SELECT COUNT(DISTINCT tekstisuun), MAX(tekstisuun), MIN(tekstisuun) FROM default.testi_parquet2; }} and note them (min 0, max 0, distinct 1) 4) Compute statistics for the table using {{ANALYZE TABLE default.testi_parquet2 COMPUTE STATISTICS FOR COLUMNS;}} 5) See erroneous statistics entry for numDistincts: Query the statistics by using " {{DESCRIBE FORMATTED default.testi_parquet2 tekstisuun}} " and note the ERRANEOUS numDistincts value: 2 -- This message was sent by Atlassian JIRA (v7.6.3#76005)