Stamatis Zampetakis created HIVE-29476:
------------------------------------------

             Summary: Add tests for TPC-DS 30TB metastore content
                 Key: HIVE-29476
                 URL: https://issues.apache.org/jira/browse/HIVE-29476
             Project: Hive
          Issue Type: Test
          Components: Test
            Reporter: Stamatis Zampetakis
            Assignee: Stamatis Zampetakis


The [TPC-DS 30TB plan regression 
suite|https://github.com/apache/hive/blob/2fa85ab5f6683e16125b30b63b4189b95b098b5a/itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestTezTPCDS30TBPerfCliDriver.java]
 is based on a pre-built database dump that is loaded via dockerized [Postgres 
database|https://github.com/apache/hive/blob/2fa85ab5f6683e16125b30b63b4189b95b098b5a/standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/dbinstall/rules/PostgresTPCDS.java].
 The content of the dump is not validated anywhere and we can only verify 
what's inside either by manually inspecting the dump or inferring implicit 
conclusions from the query plans. The dump has been updated a few times already 
and there is also an imminent update that is gonna happen in HIVE-26830. The 
creation of the dump is a manual process so it would be helpful to have a basic 
set of tests that verify the state of the metastore and how the dump evolves.

Interesting information that we would like to capture includes:
 * table and column data types
 * constraints (FK, NOT NULL)
 * basic table stats such as num_rows, numPartitions, etc.
 * basic column stats such as min, max, NDV, num_nulls, etc.

The above can be captured by adding DESCRIBE FORMATTED qtests for each TPC-DS 
table and column. As an added bonus this will increase the coverage for 
DESCRIBE statements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to