[
https://issues.apache.org/jira/browse/IMPALA-14367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18016747#comment-18016747
]
Csaba Ringhofer commented on IMPALA-14367:
------------------------------------------
Trying to give some (not too well informed answers):
1. file format priorities:
high: Parquet
medium: text, ORC
low: avro, json
very low: seq, rc
2. I would only remove the dimension from tests that are obviously not file
format dependent and check if coverage is decreased.
Removing file_format dimension from tests is tricky, as there can be file
format specific logic for predicate push down / cancellation / limit handling /
resource handling (e.g. small buffers due to mem_limit)
Ideally test_scanners.py would give almost full coverage for file format code,
but this is unlikely to be the case.
3. Exact compression codecs are likely to be secondary, but the type of
compression used can lead to very different branches in the code.
What really matters are:
in Parquet (and probably ORC): any compression vs no compression (Parquet
scanners handles buffers differently in the uncompressed case)
formats with file level compression (text, JSON, ... ? ): uncompressed /
streaming compression (e.g. gzip) / non-streaming compression (e.g. snappy)
formats with block vs record compression: uncompressed / block compression /
record compression (probably streaming compression is never utilized?)
I think that at minimum we should cover the types of compressions above in
scanners.py, not sure if it important in other tests.
Compression codecs could be also prioritized, again an uninformed guess:
high: snappy, gzip, lz4, zstd
low: deflate, bzip
4.
It would be great to eliminate some test databases completely (or at least in
core tests, if we split core/exhaustive dataload).
If dropping database is not possible (e.g. we want a few tests tests on
functional_rc_bzip), it would be still nice to create only those tables that
are actually used in tests.
Besides tests, I don't see a real need to keep these tables, during development
I never use low prio formats and compression doesn't matter.
> Reduce/rationalize test vector set for compressed file formats
> --------------------------------------------------------------
>
> Key: IMPALA-14367
> URL: https://issues.apache.org/jira/browse/IMPALA-14367
> Project: IMPALA
> Issue Type: Test
> Components: Infrastructure, Test
> Reporter: Csaba Ringhofer
> Priority: Major
>
> During exhaustive tests a lot of test vectors are created for some rarely
> used file formats (e.g. rc, sequence), because these files can be also
> compressed and each file format/compression pair is considered a new item in
> the file_format dimension. Block vs record level compression can be an extra
> dimension (e.g. seq/gzip/record). Meanwhile the most commonly used file
> format Parquet can also use several compression types at page level, but only
> snappy compression is heavily tested.
> As an example, https://gerrit.cloudera.org/#/c/23342/ fixed pairwise test
> vector generation, bumping exhaustive EE/custom cluster tests from 11000 to
> 17000, and restricting the some tests to use only a single compression per
> file format (single_compression_constraint() ) reduced it to 16000.
> A few questions arise:
> 1. what is the priority of testing different file formats? this depends IMO
> both on the frequency of usage and the development activity in that area
> 2. what tests should have a file_format dimension at all?
> 3. what tests should consider compression in the file format dimension?
> 4. is it possible to also remove some vectors from test data generation, or
> all are needed to get a good coverage? it is possible that some tables are
> created but never touched by tests
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]