Mihaly Szjatinya has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... Patch Set 17: (18 comments) http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-scanner.h@324 PS17, Line 324: > nit: extra space. Same goes for next line. ack http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.h File be/src/exec/hdfs-text-table-writer.h: http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.h@93 PS17, Line 93: boost::scoped_ptr > We prefer std::unique_ptr over scoped_ptr yeah, good idea http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc File be/src/exec/hdfs-text-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@56 PS17, Line 56: HdfsTextTableWriter::~HdfsTextTableWriter() { : } > Why did you move the empty destructor to the .cc file? After adding a scoped_ptr<CharCodec> to class, its destructor now requires a complete CharCodec. This persists for std::unique_ptr (but not for shared_ptr) and IIRC has to do with the implementation details of the unique_ptr's deleter. http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@62 PS17, Line 62: auto > No need to copy the string, it could be const auto& Yes, good point. http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@64 PS17, Line 64: NULL > nit: we prefer nullptr over NULL ack http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@161 PS17, Line 161: if (encoder_.get() != nullptr) > This could be just 'if (encoder_)' Agree, I just saw this pattern all over the code. However not in this particular file, so probably ok to change it here. http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.cc File be/src/exec/text/hdfs-text-scanner.cc: http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.cc@253 PS17, Line 253: auto > This could be const auto& Ack http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.cc@545 PS17, Line 545: if (decoder_.get() != nullptr) > nit: Could be 'if (decoder_)' In this and related modules it is scoped_ptr with this pattern everywhere. Even in this very function there is 'if (decompressor_.get() == nullptr)' above. What is the general approach in such cases, if any? To clarify all similar comments. http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/runtime/descriptors.h@402 PS17, Line 402: std::string > Do you need to return a new object? If not, it could return const std::stri Agree, fixed http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.h File be/src/util/char-codec.h: http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.h@21 PS17, Line 21: #include <boost/scoped_ptr.hpp> > Unused include ack http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.h@41 PS17, Line 41: > nit: redundant space ack http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc File be/src/util/char-codec.cc: http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc@63 PS17, Line 63: buf_end - buf_start > nit: *bytes_read ack http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc@86 PS17, Line 86: *buffer > This should be 'out_buffer_'. Yeah, my bad. http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc@104 PS17, Line 104: std::vector<uint8_t> prefix = partial_symbol_; > You could reserve memory after this line to decrease number of re-allocatio Good idea, but I think it's even more simple: prefix.reserve(MAX_SYMBOL); http://gerrit.cloudera.org:8080/#/c/22049/17/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java: http://gerrit.cloudera.org:8080/#/c/22049/17/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java@940 PS17, Line 940: > nit: indentation is off ack http://gerrit.cloudera.org:8080/#/c/22049/17/tests/query_test/test_charcodec.py File tests/query_test/test_charcodec.py: http://gerrit.cloudera.org:8080/#/c/22049/17/tests/query_test/test_charcodec.py@29 PS17, Line 29: def get_table_loc(db, tbl_name): : return '%s/%s.db/%s/' % (WAREHOUSE, db, tbl_name,) > We should probably move this to ImpalaTestSuite. We have a similar function Actually there's a similar function already: 'ImpalaTestSuite::_get_table_location()', which is totally fine for this test. So I'll just use that one instead. http://gerrit.cloudera.org:8080/#/c/22049/17/tests/query_test/test_charcodec.py@34 PS17, Line 34: problematic_symbols_hiragana = {0x3040, 0x3094, 0x3095, 0x3096, 0x3097, 0x3098, 0x3099, > Please add comment why they are problematic. boost::locale::conv() stambles on these and if I get it right strictly speaking they are not even part of the encoding. In any case it is not crucial for this test to use the exhaustive set of all possible symbols. I rewrote this part to avoid confusion. http://gerrit.cloudera.org:8080/#/c/22049/17/tests/query_test/test_charcodec.py@44 PS17, Line 44: u > flake8: F821 undefined name 'unichr' @Zoltan, do you know how to fix this btw? -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 17 Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Sun, 27 Apr 2025 11:57:39 +0000 Gerrit-HasComments: Yes