Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... Patch Set 14: (13 comments) http://gerrit.cloudera.org:8080/#/c/22049/14//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22049/14//COMMIT_MSG@22 PS14, Line 22: confusion of allowing both serde and tbl properties. Can you mention IMPALA-13748? http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/runtime/descriptors.h@452 PS14, Line 452: nit: extra line http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h File be/src/util/char-codec.h: http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@40 PS14, Line 40: , nit: extra , http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@47 PS14, Line 47: progressivly nit: progressively http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@57 PS14, Line 57: MemTrackerHelper this could be mover to mem_tracker.h, e.g. as MemTracker::ScopedAllocation http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@80 PS14, Line 80: /// Delimiter used to separate tuples. : const char tuple_delim_; : : /// Buffer to hold the partial symbol that could not be decoded in the previous call to : /// DecodeBuffer. : std::vector<uint8_t> partial_symbol_; : : /// Can we reuse the output buffer or do we need to allocate on each call? : bool reuse_buffer_; : : /// Buffer to hold transformed data. : /// Either passed from the caller or allocated from memory_pool_. : uint8_t* out_buffer_ = nullptr; : : /// Length of the output buffer. : int64_t buffer_length_ = 0; : }; These are only used by DecodeBuffer - can you add a comment about that, e.g. the following member are only used by decoding, not encoding? http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc File be/src/util/char-codec.cc: http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@77 PS8, Line 77: return memory_pool_->mem_tracker()->MemLimitExceeded(nullptr, details, *bytes_read); : } > Agree, added TryConsume() within a helper class to secure Release() for eve I think that the current estimates are good enough. http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@113 PS8, Line 113: partial_symbol_.clear(); > Agree, but instead of releasing here, I employed the mechanism from Decompr I think that we are still suboptimal here - for example if I understand correctly in case of stream compression + !reuse_buffer we'll attach both the decompressed buffer and the decoded buffer to the pool, while the the decompressed buffer is no longer used. Probably the best would be to clean this up in a follow up commit. >Potential problem is a significant exceeding of row_batch's limit My understanding is that we can't be much smarter here - for example if decompression leads to a very large uncompressed block we need to attach it to the row batch's pool. A way to solve it would be to copy string from the buffer in case the buffer is very large. http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc File be/src/util/char-codec.cc: http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@25 PS14, Line 25: #include "gutil/strings/substitute.h" nit: usually includes to Impala headers are sorted alphabetically http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@50 PS14, Line 50: } Both branches mean that out_buffer_ is no longer usable, right? I think that it would be clearer to set it to nullptr and check only for out_buffer_ == nullptr in line 71. http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@57 PS14, Line 57: (buf_end - buf_start) * 2 Can you add a comment about reason for this size? http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@71 PS14, Line 71: if (!reuse_buffer_ || out_buffer_ == nullptr) { : buffer_length_ = std::max(STREAM_OUT_BUF_SIZE, *bytes_read); Isn't it possible for out_buffer_ to be set but not large enough? For example if there are two compressed blocks and the uncompressed size of the second is smaller than the first's. I think that we should check for out_buffer_ == nullptr || buffer_length_ < *bytes_read and free the old buffer it is not large enough (see also my comment at line 50). I also think that std::max(STREAM_OUT_BUF_SIZE, *bytes_read); can lead to allocating too much memory, for example in case of small files or when using stream compression - if reuse_buffer_ is false, than simply allocating *bytes_read seems the best. If reuse_buffer_ is true, then it may make sense to overallocate to avoid multiple allocations when *bytes_read is growing, but using *bytes_read looks also good and makes the code simpler. http://gerrit.cloudera.org:8080/#/c/22049/14/tests/query_test/test_charcodec.py File tests/query_test/test_charcodec.py: http://gerrit.cloudera.org:8080/#/c/22049/14/tests/query_test/test_charcodec.py@77 PS14, Line 77: lambda v: v.get_value('table_format').file_format == 'text' Is there any test for SequenceFile? -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 14 Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Comment-Date: Thu, 20 Mar 2025 08:27:48 +0000 Gerrit-HasComments: Yes