Surya Hebbar has posted comments on this change. ( http://gerrit.cloudera.org:8080/22718 )
Change subject: IMPALA-13923: Support more compression levels for ZSTD and ZLIB ...................................................................... Patch Set 4: (10 comments) > > There is some logic in hdfs-parquet-table-writer.cc that you need > to look at: > https://github.com/apache/impala/blob/master/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L1329-L1331 > https://github.com/apache/impala/blob/master/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L1362-L1365 > https://github.com/apache/impala/blob/master/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L1383 > > I think those locations should be preserving the std::optional so > that it is not set unless there is an explicit value from the user. Thank you for the comments, I have corrected it now. > If we are supporting negative compression levels, that has some > implications for some of our performance testing infrastructure. > The compression level gets baked into the database name used for > loading data. "-" is not valid in a database name, so the negative > levels need some adjustment. See https://gerrit.cloudera.org/#/c/17265/ > That doesn't necessarily need to be fixed as part of this. It could > be addressed in a separate JIRA. I will make sure to address this in a separate patch. http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/runtime/tmp-file-mgr.h File be/src/runtime/tmp-file-mgr.h: http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/runtime/tmp-file-mgr.h@368 PS3, Line 368: /// and ignored otherwise. > Please update the comment to reflect support for negative levels. Done http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/service/query-options-test.cc File be/src/service/query-options-test.cc: http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/service/query-options-test.cc@626 PS3, Line 626: // Test with compression levels for supported codecs i.e. ZLIB and ZSTD > Is this really "all codecs"? It doesn't cover snappy, lz4, etc. Done http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/service/query-options-test.cc@652 PS3, Line 652: // Test minimum negative compression level > typo: comression -> compression Done http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/service/query-options-test.cc@657 PS3, Line 657: // The values could begin from -11000 or even less > This sentence seems incomplete. Done http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/service/query-options.cc File be/src/service/query-options.cc: http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/service/query-options.cc@310 PS3, Line 310: std::optional<int> compression_level; > nit: this is the default value, explicit initialization to nullopt isn't ne Done http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/codec.cc File be/src/util/codec.cc: http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/codec.cc@84 PS3, Line 84: Status Codec::CreateCompressor(MemPool* mem_pool, bool reuse, const CodecInfo& codec_info, : scoped_ptr<Codec>* compressor) { : THdfsCompression::type format = codec_info.format_; : switch (format) { : case THdfsCompression::NONE: : compressor->reset(nullptr); : return Status::OK(); : case THdfsCompression::GZIP: : compressor->reset(new GzipCompressor(GzipCompressor::GZIP, mem_pool, reuse, : codec_info.compression_level_.value_or(Z_DEFAULT_COMPRESSION))); : > I believe this function is unused. Done. http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/compress.h File be/src/util/compress.h: http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/compress.h@48 PS3, Line 48: int compression_level = Z_DEFAULT_COMPRESSION); > I think it would make sense to carry through the std::optional usage throug As the implementations still need a default value, this seems to be the cleanest way. Please let me know, if something else is preferred. I think, CodecInfo being optional is sufficient. http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/compress.h@66 PS3, Line 66: int compression_level_; > Now that we have more than one compressor that uses levels, it would make s If we approach this, CodecInfo will only have a single attribute "format", also the compression levels are not a part of many codecs like snappy. http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/parse-util.cc File be/src/util/parse-util.cc: http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/parse-util.cc@135 PS3, Line 135: string& clevel = tokens[1]; : trim(clevel); : int compression_level = StringParser::StringToInt<int>( : clevel.c_str(), static_cast<int>(clevel.size()), &status); : : bool invalid_compression_level = false; : if (status == StringParser::PARSE_SUCCESS) { : return Codec::ValidateCompressionLevel(enum_type, compression_level); : } else { : return Status(Substitute("Invalid compression level value - $0", clevel)); : } : } : *type = enum_type; : return Status::OK(); : } : : // Return all enum values in a string format, e.g. FOO(1), BAR(2), BAZ(3). : string GetThriftEnumValues(const map<int, const char*>& enum_values_to_names) { : bool first = true; : stringstream ss; : for (const auto& e : enum_values_to_names) { : if (!first) { : ss << ", "; : } else { : first = false; : } : ss << e.second << "(" << e.first << ")"; : } : return ss.str(); : } : } : : : : : > I think it would be cleaner for a lot of the compression-codec specific log Done. I have moved the validation into Codec class. http://gerrit.cloudera.org:8080/#/c/22718/3/be/src/util/parse-util.cc@173 PS3, Line 173: > I have a slight preference for providing the error message at each site and Ack. I will add these error messages instead. -- To view, visit http://gerrit.cloudera.org:8080/22718 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I5b98c735246f08e04598a4e752c8cca04e31a88a Gerrit-Change-Number: 22718 Gerrit-PatchSet: 4 Gerrit-Owner: Surya Hebbar <sheb...@cloudera.com> Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <ale...@apache.org> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Michael Smith <michael.sm...@cloudera.com> Gerrit-Reviewer: Surya Hebbar <sheb...@cloudera.com> Gerrit-Comment-Date: Tue, 22 Apr 2025 11:57:21 +0000 Gerrit-HasComments: Yes