Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... Patch Set 8: (12 comments) The decoding logic looks good to me, my new comments focus mainly on memory management. http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@19 PS8, Line 19: impelents typo http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@33 PS8, Line 33: 4. Test complex types Complex types do not seem relevant here as those are only supported in Parquet/ORC. http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@35 PS8, Line 35: 5. Test per-partition cases It doesn't seem that critical to me to test per-partition different encoding - generally we should test that per-partition serde properties are respected, but it is not necessary to do that for each serde property. http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@37 PS8, Line 37: 6. Juxtapose java vs boost encodings supported. This could be done by exposing a c++ function to check compatibility which encodes some string using the given encoding. In practice I don't think that anyone would use an encoding that is not supported at any side, so I am ok with skipping this one. In case someone uses an encoding that gets through Java but is rejected in c++ then a proper error will be returned by the current code. http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/hdfs-text-table-writer.cc File be/src/exec/hdfs-text-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/hdfs-text-table-writer.cc@162 PS8, Line 162: RETURN_IF_ERROR(encoder_->EncodeBuffer( : rowbatch_stringstream_.str(), rowbatch_string)); We don't needb to care about splitting multibyte characters because flush only happens after full lines, right? A comment could be added about this. http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/text/hdfs-text-scanner.cc File be/src/exec/text/hdfs-text-scanner.cc: http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/text/hdfs-text-scanner.cc@532 PS8, Line 532: *eosr = stream_->eosr(); What will happen with decoder_ if it is end of stream, but the last characters could not be decoded? If I understand things correctly then the remaining bytes will remain in partial_symbol_ and be ignored. I think that we should return decode error in this case. http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/text/hdfs-text-scanner.cc@546 PS8, Line 546: RETURN_IF_ERROR(decoder_->DecodeBuffer(reinterpret_cast<uint8_t**>(&byte_buffer_ptr_), : &byte_buffer_read_size_)); optional: for me it looks clearer to have +2 output parameters instead of using the same ptr+int for input and output http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.h File be/src/util/char-codec.h: http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.h@47 PS8, Line 47: char tuple_delim_; This could be const IMO. If line delimiter could change, the question of multi byte characters would come up again, so char may not be enough to store it. http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc File be/src/util/char-codec.cc: http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@36 PS8, Line 36: DecodeBuffer It would be very nice to have some comment that describes the steps of this function as it is quite complicated. I would also consider extracting a few smaller functions, though I don't have exact ideas for the interfaces. http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@47 PS8, Line 47: can you add DCHECK_LT(partial_symbol_.size(), MAX_SYMBOL? http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@77 PS8, Line 77: result_core = boost::locale::conv::to_utf<char>( : (char*)buf_start, (char*)last_delim + 1, encoding_, boost::locale::conv::stop); This can do a quite large untracked memory allocation. MemTracker's Consume/TryConsume/Release can be used to notify Impala's memory management about memory usage by library's that do not use Impala's memory functions. While we do not know the size needed for the utf8 buffer, a generally safe estimate could work, e.g. 2*encode_buffer_size http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@113 PS8, Line 113: We could release the decompressed buffer's memory at this point if we wouldn't allocate the decoded buffer to the same pool. This mainly matters if the decompression buffer is large, for example a 256MB decompressed buffer will lead to 768 MB peak memory usage per scanner and 512MB after result_core is released. Impala can hold on to this memory for a long time as it won't be released until the last row batch of 1024 is extracted from it. As this feature is probably not used that often I don't stick to optimizing it, but a follow up jira could be created to streamline memory management. Note that I don't know if compressions where Impala doesn't support streaming (snappy, lz4 ...) are actually used for compressed text files. So far I met only gzip, where streaming is supported, so the decoded blocks won't be that large. -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 8 Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Comment-Date: Tue, 04 Feb 2025 16:07:39 +0000 Gerrit-HasComments: Yes