Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22718 )

Change subject: IMPALA-13923: Support more compression levels for ZSTD and ZLIB
......................................................................


Patch Set 3:

(4 comments)

There is some logic in hdfs-parquet-table-writer.cc that you need to look at:
https://github.com/apache/impala/blob/master/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L1329-L1331
https://github.com/apache/impala/blob/master/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L1362-L1365
https://github.com/apache/impala/blob/master/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L1383

I think those locations should be preserving the std::optional so that it is 
not set unless there is an explicit value from the user.

If we are supporting negative compression levels, that has some implications 
for some of our performance testing infrastructure. The compression level gets 
baked into the database name used for loading data. "-" is not valid in a 
database name, so the negative levels need some adjustment. See 
https://gerrit.cloudera.org/#/c/17265/
That doesn't necessarily need to be fixed as part of this. It could be 
addressed in a separate JIRA.

For zlib, we should consider not supporting 0/-1 as valid level values. 0 is 
uncompressed. If someone wants to use the default level (-1), they can just 
skip specifying the level.

http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/codec.cc
File be/src/util/codec.cc:

http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/codec.cc@84
PS3, Line 84: Status Codec::CreateCompressor(MemPool* mem_pool, bool reuse, 
const string& codec,
            :     scoped_ptr<Codec>* compressor) {
            :   CodecMap::const_iterator type = CODEC_MAP.find(codec);
            :   if (type == CODEC_MAP.end()) {
            :     return Status(Substitute("$0$1", UNKNOWN_CODEC_ERROR, codec));
            :   }
            :
            :   CodecInfo codec_info(type->second);
            :   RETURN_IF_ERROR(CreateCompressor(mem_pool, reuse, codec_info, 
compressor));
            :   return Status::OK();
            : }
I believe this function is unused.


http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/compress.h
File be/src/util/compress.h:

http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/compress.h@48
PS3, Line 48:       int compression_level = Z_DEFAULT_COMPRESSION);
I think it would make sense to carry through the std::optional usage through 
here. Same for ZstandardCompressor.


http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/compress.h@66
PS3, Line 66:   int compression_level_;
Now that we have more than one compressor that uses levels, it would make sense 
to have a std::optional<int> on Codec that is inherited by compressors. Both 
GzipCompressor and ZstandardCompressor can use it.


http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/parse-util.cc
File be/src/util/parse-util.cc:

http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/parse-util.cc@135
PS3, Line 135:     switch(enum_type) {
             :       case THdfsCompression::ZSTD:
             :       case THdfsCompression::GZIP:
             :       case THdfsCompression::ZLIB:
             :       case THdfsCompression::DEFLATE:
             :         break;
             :       default:
             :         return Status("Compression level only supported for ZSTD 
or"
             :             " ZLIB(GZIP, DEFLATE)");
             :     }
             :
             :     StringParser::ParseResult status;
             :     string& clevel = tokens[1];
             :     trim(clevel);
             :     int compression_level = StringParser::StringToInt<int>(
             :         clevel.c_str(), static_cast<int>(clevel.size()), 
&status);
             :
             :     bool invalid_compression_level = false;
             :     if (status != StringParser::PARSE_SUCCESS) {
             :       invalid_compression_level = true;
             :     } else {
             :       switch(enum_type) {
             :         case THdfsCompression::ZSTD:
             :           invalid_compression_level = compression_level < 
ZSTD_minCLevel()
             :               || compression_level > ZSTD_maxCLevel();
             :           break;
             :         case THdfsCompression::GZIP:
             :         case THdfsCompression::ZLIB:
             :         case THdfsCompression::DEFLATE:
             :           invalid_compression_level = compression_level < 
Z_DEFAULT_COMPRESSION
             :               || compression_level > Z_BEST_COMPRESSION;
             :           break;
             :         default:
             :           invalid_compression_level = true;
             :       }
             :     }
I think it would be cleaner for a lot of the compression-codec specific logic 
to be in codec.h/.cc and/or compress.h/.cc.

For example, this parsing code could verify that the level parses as an integer 
(and return status if it does not), but then call into a function on Codec to 
verify the compression level. e.g.

RETURN_IF_ERROR(Codec::ValidateCompressionLevel(enum_type, compression_level));

That would keep this file from needing to know about zstd levels and zlib 
levels (and potentially other compression codec in future).



--
To view, visit http://gerrit.cloudera.org:8080/22718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5b98c735246f08e04598a4e752c8cca04e31a88a
Gerrit-Change-Number: 22718
Gerrit-PatchSet: 3
Gerrit-Owner: Surya Hebbar <sheb...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <ale...@apache.org>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Michael Smith <michael.sm...@cloudera.com>
Gerrit-Reviewer: Surya Hebbar <sheb...@cloudera.com>
Gerrit-Comment-Date: Tue, 15 Apr 2025 21:38:29 +0000
Gerrit-HasComments: Yes

Reply via email to