Csaba Ringhofer created IMPALA-14367:
----------------------------------------
Summary: Reduce/rationalize test vector set for compressed file
formats
Key: IMPALA-14367
URL: https://issues.apache.org/jira/browse/IMPALA-14367
Project: IMPALA
Issue Type: Test
Components: Infrastructure, Test
Reporter: Csaba Ringhofer
During exhaustive tests a lot of test vectors are created for some rarely used
file formats (e.g. rc, sequence), because these files can be also compressed
and each file format/compression pair is considered a new item in the
file_format dimension. Block vs record level compression can be an extra
dimension (e.g. seq/gzip/record). Meanwhile the most commonly used file
format Parquet can also use several compression types at page level, but only
snappy compression is heavily tested.
As an example, https://gerrit.cloudera.org/#/c/23342/ fixed pairwise test
vector generation, bumping exhaustive EE/custom cluster tests from 11000 to
17000, and restricting the some tests to use only a single compression per file
format (single_compression_constraint() ) reduced it to 16000.
A few questions arise:
1. what is the priority of testing different file formats? this depends IMO
both on the frequency of usage and the development activity in that area
2. what tests should have a file_format dimension at all?
3. what tests should consider compression in the file format dimension?
4. is it possible to also remove some vectors from test data generation, or all
are needed to get a good coverage? it is possible that some tables are created
but never touched by tests
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]