[Impala-ASF-CR] WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Csaba Ringhofer (Code Review) Tue, 04 Feb 2025 08:13:35 -0800

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22049 )


Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence 
files
......................................................................


Patch Set 8:

(12 comments)

The decoding logic looks good to me, my new comments focus mainly on memory 
management.

http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@19
PS8, Line 19: impelents
typo


http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@33
PS8, Line 33: 4. Test complex types
Complex types do not seem relevant here as those are only supported in 
Parquet/ORC.


http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@35
PS8, Line 35: 5. Test per-partition cases
It doesn't seem that critical to me to test per-partition different encoding - 
generally we should test that per-partition serde properties are respected, but 
it is not necessary to do that for each serde property.


http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@37
PS8, Line 37: 6. Juxtapose java vs boost encodings supported.
This could be done by exposing a c++ function to check compatibility which 
encodes some string using the given encoding. In practice I don't think that 
anyone would use an encoding that is not supported at any side, so I am ok with 
skipping this one. In case someone uses an encoding that gets through Java but 
is rejected in c++ then a proper error will be returned by the current code.


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/hdfs-text-table-writer.cc@162
PS8, Line 162:     RETURN_IF_ERROR(encoder_->EncodeBuffer(
             :         rowbatch_stringstream_.str(), rowbatch_string));
We don't needb to care about splitting multibyte characters because flush only 
happens after full lines, right? A comment could be added about this.


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/text/hdfs-text-scanner.cc
File be/src/exec/text/hdfs-text-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/text/hdfs-text-scanner.cc@532
PS8, Line 532:     *eosr = stream_->eosr();
What will happen with decoder_ if it is end of stream, but the last characters 
could not be decoded? If I understand things correctly then the remaining bytes 
will remain in partial_symbol_ and be ignored. I think that we should return 
decode error in this case.


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/text/hdfs-text-scanner.cc@546
PS8, Line 546:     
RETURN_IF_ERROR(decoder_->DecodeBuffer(reinterpret_cast<uint8_t**>(&byte_buffer_ptr_),
             :         &byte_buffer_read_size_));
optional: for me it looks clearer to have +2 output parameters instead of using 
the same ptr+int for input and output


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.h
File be/src/util/char-codec.h:

http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.h@47
PS8, Line 47:   char tuple_delim_;
This could be const IMO. If line delimiter could change, the question of multi 
byte characters would come up again, so char may not be enough to store it.


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc
File be/src/util/char-codec.cc:

http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@36
PS8, Line 36: DecodeBuffer
It would be very nice to have some comment that describes the steps of this 
function as it is quite complicated.

I would also consider extracting a few smaller functions, though I don't have 
exact ideas for the interfaces.


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@47
PS8, Line 47:
can you add DCHECK_LT(partial_symbol_.size(), MAX_SYMBOL?


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@77
PS8, Line 77:       result_core = boost::locale::conv::to_utf<char>(
            :           (char*)buf_start, (char*)last_delim + 1, encoding_, 
boost::locale::conv::stop);
This can do a quite large untracked memory allocation. MemTracker's 
Consume/TryConsume/Release can be used to notify Impala's memory management 
about memory usage by library's that do not use Impala's memory functions. 
While we do not know the size needed for the utf8 buffer, a generally safe 
estimate could work, e.g. 2*encode_buffer_size


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@113
PS8, Line 113:
We could release the decompressed buffer's memory at this point if we wouldn't 
allocate the decoded buffer to the same pool. This mainly matters if the 
decompression buffer is large, for example a 256MB decompressed buffer will 
lead  to 768 MB peak memory usage per scanner and 512MB after result_core is 
released. Impala can hold on to this memory for a long time as it won't be 
released until the last row batch of 1024 is extracted from it.

As this feature is probably not used that often I don't stick to optimizing it, 
but a follow up jira could be created to streamline memory management.

Note that I don't know if compressions where Impala doesn't support streaming 
(snappy, lz4 ...) are actually used for compressed text files. So far I met 
only gzip, where streaming is supported, so the decoded blocks won't be that 
large.



--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 8
Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Comment-Date: Tue, 04 Feb 2025 16:07:39 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Reply via email to