[Impala-ASF-CR] WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Csaba Ringhofer (Code Review) Thu, 20 Mar 2025 01:28:13 -0700

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22049 )


Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence 
files
......................................................................


Patch Set 14:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/22049/14//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/22049/14//COMMIT_MSG@22
PS14, Line 22: confusion of allowing both serde and tbl properties.
Can you mention IMPALA-13748?


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/runtime/descriptors.h@452
PS14, Line 452:
nit: extra line


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h
File be/src/util/char-codec.h:

http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@40
PS14, Line 40:  ,
nit: extra ,


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@47
PS14, Line 47: progressivly
nit: progressively


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@57
PS14, Line 57: MemTrackerHelper
this could be mover to mem_tracker.h, e.g. as MemTracker::ScopedAllocation


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.h@80
PS14, Line 80:   /// Delimiter used to separate tuples.
             :   const char tuple_delim_;
             : 
             :   /// Buffer to hold the partial symbol that could not be 
decoded in the previous call to
             :   /// DecodeBuffer.
             :   std::vector<uint8_t> partial_symbol_;
             :
             :   /// Can we reuse the output buffer or do we need to allocate 
on each call?
             :   bool reuse_buffer_;
             :
             :   /// Buffer to hold transformed data.
             :   /// Either passed from the caller or allocated from 
memory_pool_.
             :   uint8_t* out_buffer_ = nullptr;
             :
             :   /// Length of the output buffer.
             :   int64_t buffer_length_ = 0;
             : };
These are only used by DecodeBuffer - can you add a comment about that, e.g. 
the following member are only used by decoding, not encoding?


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc
File be/src/util/char-codec.cc:

http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@77
PS8, Line 77:       return 
memory_pool_->mem_tracker()->MemLimitExceeded(nullptr, details, *bytes_read);
            :     }
> Agree, added TryConsume() within a helper class to secure Release() for eve
I think that the current estimates are good enough.


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@113
PS8, Line 113:     partial_symbol_.clear();
> Agree, but instead of releasing here, I employed the mechanism from Decompr
I think that we are still suboptimal here - for example if I understand 
correctly in case of stream compression + !reuse_buffer we'll attach both the 
decompressed buffer and the decoded buffer to the pool, while the the 
decompressed buffer is no longer used. Probably the best would be to clean this 
up in a follow up commit.

>Potential problem is a significant exceeding of row_batch's limit

My understanding is that we can't be much smarter here - for example if 
decompression leads to a very large uncompressed block we need to attach it to 
the row batch's pool.
A way to solve it would be to copy string from the buffer in case the buffer is 
very large.


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc
File be/src/util/char-codec.cc:

http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@25
PS14, Line 25: #include "gutil/strings/substitute.h"
nit: usually includes to Impala headers are sorted alphabetically


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@50
PS14, Line 50:     }
Both branches mean that out_buffer_ is no longer usable, right? I think that it 
would be clearer to set it to nullptr and check only for out_buffer_ == nullptr 
in line 71.


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@57
PS14, Line 57: (buf_end - buf_start) * 2
Can you add a comment about reason for this size?


http://gerrit.cloudera.org:8080/#/c/22049/14/be/src/util/char-codec.cc@71
PS14, Line 71:   if (!reuse_buffer_ || out_buffer_ == nullptr) {
             :     buffer_length_ = std::max(STREAM_OUT_BUF_SIZE, *bytes_read);
Isn't it possible for out_buffer_ to be set but not large enough? For example 
if there are two compressed blocks and the uncompressed size of the second is 
smaller than the first's.

I think that we should check for out_buffer_ == nullptr || buffer_length_ < 
*bytes_read and free the old buffer it is not large enough (see also my comment 
at line 50).

I also think that std::max(STREAM_OUT_BUF_SIZE, *bytes_read); can lead to 
allocating too much memory, for example in case of small files or when using 
stream compression - if reuse_buffer_ is false, than simply allocating 
*bytes_read seems the best. If reuse_buffer_  is true, then it may make sense 
to overallocate to avoid multiple allocations when *bytes_read is growing, but 
using *bytes_read looks also good and makes the code simpler.


http://gerrit.cloudera.org:8080/#/c/22049/14/tests/query_test/test_charcodec.py
File tests/query_test/test_charcodec.py:

http://gerrit.cloudera.org:8080/#/c/22049/14/tests/query_test/test_charcodec.py@77
PS14, Line 77: lambda v: v.get_value('table_format').file_format == 'text'
Is there any test for SequenceFile?



--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 14
Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Comment-Date: Thu, 20 Mar 2025 08:27:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Reply via email to