Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/22718 )
Change subject: IMPALA-13923: Support more compression levels for ZSTD and ZLIB ...................................................................... Patch Set 3: (4 comments) There is some logic in hdfs-parquet-table-writer.cc that you need to look at: https://github.com/apache/impala/blob/master/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L1329-L1331 https://github.com/apache/impala/blob/master/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L1362-L1365 https://github.com/apache/impala/blob/master/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L1383 I think those locations should be preserving the std::optional so that it is not set unless there is an explicit value from the user. If we are supporting negative compression levels, that has some implications for some of our performance testing infrastructure. The compression level gets baked into the database name used for loading data. "-" is not valid in a database name, so the negative levels need some adjustment. See https://gerrit.cloudera.org/#/c/17265/ That doesn't necessarily need to be fixed as part of this. It could be addressed in a separate JIRA. For zlib, we should consider not supporting 0/-1 as valid level values. 0 is uncompressed. If someone wants to use the default level (-1), they can just skip specifying the level. http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/codec.cc File be/src/util/codec.cc: http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/codec.cc@84 PS3, Line 84: Status Codec::CreateCompressor(MemPool* mem_pool, bool reuse, const string& codec, : scoped_ptr<Codec>* compressor) { : CodecMap::const_iterator type = CODEC_MAP.find(codec); : if (type == CODEC_MAP.end()) { : return Status(Substitute("$0$1", UNKNOWN_CODEC_ERROR, codec)); : } : : CodecInfo codec_info(type->second); : RETURN_IF_ERROR(CreateCompressor(mem_pool, reuse, codec_info, compressor)); : return Status::OK(); : } I believe this function is unused. http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/compress.h File be/src/util/compress.h: http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/compress.h@48 PS3, Line 48: int compression_level = Z_DEFAULT_COMPRESSION); I think it would make sense to carry through the std::optional usage through here. Same for ZstandardCompressor. http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/compress.h@66 PS3, Line 66: int compression_level_; Now that we have more than one compressor that uses levels, it would make sense to have a std::optional<int> on Codec that is inherited by compressors. Both GzipCompressor and ZstandardCompressor can use it. http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/parse-util.cc File be/src/util/parse-util.cc: http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/parse-util.cc@135 PS3, Line 135: switch(enum_type) { : case THdfsCompression::ZSTD: : case THdfsCompression::GZIP: : case THdfsCompression::ZLIB: : case THdfsCompression::DEFLATE: : break; : default: : return Status("Compression level only supported for ZSTD or" : " ZLIB(GZIP, DEFLATE)"); : } : : StringParser::ParseResult status; : string& clevel = tokens[1]; : trim(clevel); : int compression_level = StringParser::StringToInt<int>( : clevel.c_str(), static_cast<int>(clevel.size()), &status); : : bool invalid_compression_level = false; : if (status != StringParser::PARSE_SUCCESS) { : invalid_compression_level = true; : } else { : switch(enum_type) { : case THdfsCompression::ZSTD: : invalid_compression_level = compression_level < ZSTD_minCLevel() : || compression_level > ZSTD_maxCLevel(); : break; : case THdfsCompression::GZIP: : case THdfsCompression::ZLIB: : case THdfsCompression::DEFLATE: : invalid_compression_level = compression_level < Z_DEFAULT_COMPRESSION : || compression_level > Z_BEST_COMPRESSION; : break; : default: : invalid_compression_level = true; : } : } I think it would be cleaner for a lot of the compression-codec specific logic to be in codec.h/.cc and/or compress.h/.cc. For example, this parsing code could verify that the level parses as an integer (and return status if it does not), but then call into a function on Codec to verify the compression level. e.g. RETURN_IF_ERROR(Codec::ValidateCompressionLevel(enum_type, compression_level)); That would keep this file from needing to know about zstd levels and zlib levels (and potentially other compression codec in future). -- To view, visit http://gerrit.cloudera.org:8080/22718 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I5b98c735246f08e04598a4e752c8cca04e31a88a Gerrit-Change-Number: 22718 Gerrit-PatchSet: 3 Gerrit-Owner: Surya Hebbar <sheb...@cloudera.com> Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <ale...@apache.org> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Michael Smith <michael.sm...@cloudera.com> Gerrit-Reviewer: Surya Hebbar <sheb...@cloudera.com> Gerrit-Comment-Date: Tue, 15 Apr 2025 21:38:29 +0000 Gerrit-HasComments: Yes