[ 
https://issues.apache.org/jira/browse/HIVE-29476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061674#comment-18061674
 ] 

Thomas Rebele commented on HIVE-29476:
--------------------------------------

Interesting idea. We would need to ensure that the output of the DESCRIBE 
FORMATTED is deterministic, though. When investigating HIVE-29334, I've 
discovered that the result of a merge of two KLL sketches is not-deterministic. 
Some tables in TPC-DS are partitioned. If I understand Hive correctly, the 
histograms of the partitions would be merged together to get the histogram of 
the table. (Indeed, the column, e.g., wr_refunded_customer_sk appears in 
PART_COL_STATS but not in TAB_COL_STATS). So I think there is a risk of a 
non-deterministic "histogram" entry in DESCRIBE FORMATTED.

> Add tests for TPC-DS 30TB metastore content
> -------------------------------------------
>
>                 Key: HIVE-29476
>                 URL: https://issues.apache.org/jira/browse/HIVE-29476
>             Project: Hive
>          Issue Type: Test
>          Components: Test
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>              Labels: pull-request-available
>
> The [TPC-DS 30TB plan regression 
> suite|https://github.com/apache/hive/blob/2fa85ab5f6683e16125b30b63b4189b95b098b5a/itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestTezTPCDS30TBPerfCliDriver.java]
>  is based on a pre-built database dump that is loaded via dockerized 
> [Postgres 
> database|https://github.com/apache/hive/blob/2fa85ab5f6683e16125b30b63b4189b95b098b5a/standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/dbinstall/rules/PostgresTPCDS.java].
>  The content of the dump is not validated anywhere and we can only verify 
> what's inside either by manually inspecting the dump or inferring implicit 
> conclusions from the query plans. The dump has been updated a few times 
> already and there is also an imminent update that is gonna happen in 
> HIVE-26830. The creation of the dump is a manual process so it would be 
> helpful to have a basic set of tests that verify the state of the metastore 
> and how the dump evolves.
> Interesting information that we would like to capture includes:
>  * table and column data types
>  * constraints (FK, NOT NULL)
>  * basic table stats such as num_rows, numPartitions, etc.
>  * basic column stats such as min, max, NDV, num_nulls, etc.
> The above can be captured by adding DESCRIBE FORMATTED qtests for each TPC-DS 
> table and column. As an added bonus this will increase the coverage for 
> DESCRIBE statements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to