[ 
https://issues.apache.org/jira/browse/IMPALA-14367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Riza Suminto updated IMPALA-14367:
----------------------------------
    Attachment: Screenshot 2025-08-29 at 5.24.53 PM.png

> Reduce/rationalize test vector set for compressed file formats
> --------------------------------------------------------------
>
>                 Key: IMPALA-14367
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14367
>             Project: IMPALA
>          Issue Type: Test
>          Components: Infrastructure, Test
>            Reporter: Csaba Ringhofer
>            Priority: Major
>         Attachments: Screenshot 2025-08-29 at 5.24.53 PM.png
>
>
> During exhaustive tests a lot of test vectors are created for some rarely 
> used file formats (e.g. rc, sequence), because these files can be also 
> compressed and each file format/compression pair is considered a new item in 
> the file_format dimension. Block vs record level compression can be an extra 
> dimension (e.g.  seq/gzip/record). Meanwhile  the most commonly used file 
> format Parquet can also use several compression types at page level, but only 
> snappy compression is heavily tested.
> As an example, https://gerrit.cloudera.org/#/c/23342/ fixed pairwise test 
> vector generation, bumping exhaustive EE/custom cluster tests from 11000 to 
> 17000, and restricting the some tests to use only a single compression per 
> file format (single_compression_constraint() ) reduced it to 16000.
> A few questions arise:
> 1. what is the priority of testing different file formats? this depends IMO 
> both on the frequency of usage and the development activity in that area
> 2. what tests should have a file_format dimension at all?
> 3.  what tests should consider compression in the file format dimension?
> 4. is it possible to also remove some vectors from test data generation, or 
> all are needed to get a good coverage? it is possible that some tables are 
> created but never touched by tests



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to