Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... Patch Set 17: (19 comments) Overall looks good. Found only minor issues. http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-scanner.h@324 PS17, Line 324: nit: extra space. Same goes for next line. http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.h File be/src/exec/hdfs-text-table-writer.h: http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.h@93 PS17, Line 93: boost::scoped_ptr We prefer std::unique_ptr over scoped_ptr http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc File be/src/exec/hdfs-text-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@56 PS17, Line 56: HdfsTextTableWriter::~HdfsTextTableWriter() { : } Why did you move the empty destructor to the .cc file? http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@62 PS17, Line 62: auto No need to copy the string, it could be const auto& http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@64 PS17, Line 64: NULL nit: we prefer nullptr over NULL http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@161 PS17, Line 161: if (encoder_.get() != nullptr) This could be just 'if (encoder_)' http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.h File be/src/exec/text/hdfs-text-scanner.h: http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.h@272 PS17, Line 272: boost::scoped_ptr We prefer std::unique_ptr over boost:scoped_ptr http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.cc File be/src/exec/text/hdfs-text-scanner.cc: http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.cc@253 PS17, Line 253: auto This could be const auto& http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.cc@545 PS17, Line 545: if (decoder_.get() != nullptr) nit: Could be 'if (decoder_)' http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.cc@548 PS17, Line 548: decompressor_.get() != nullptr nit: 'decompressor_ == nullptr' should be enough. Same goos for L749. http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/runtime/descriptors.h@402 PS17, Line 402: std::string Do you need to return a new object? If not, it could return const std::string& http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.h File be/src/util/char-codec.h: http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.h@21 PS17, Line 21: #include <boost/scoped_ptr.hpp> Unused include http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.h@41 PS17, Line 41: nit: redundant space http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc File be/src/util/char-codec.cc: http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc@63 PS17, Line 63: buf_end - buf_start nit: *bytes_read http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc@86 PS17, Line 86: *buffer This should be 'out_buffer_'. http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc@104 PS17, Line 104: std::vector<uint8_t> prefix = partial_symbol_; You could reserve memory after this line to decrease number of re-allocations and copyings: std::vector<uint8_t> prefix; prefix.reserve(partial_symbol_.size() + buf_end - *buf_start); prefix.assign(partial_symbol_.begin(), partial_symbol_.end(); http://gerrit.cloudera.org:8080/#/c/22049/17/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java: http://gerrit.cloudera.org:8080/#/c/22049/17/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java@940 PS17, Line 940: nit: indentation is off http://gerrit.cloudera.org:8080/#/c/22049/17/tests/query_test/test_charcodec.py File tests/query_test/test_charcodec.py: http://gerrit.cloudera.org:8080/#/c/22049/17/tests/query_test/test_charcodec.py@29 PS17, Line 29: def get_table_loc(db, tbl_name): : return '%s/%s.db/%s/' % (WAREHOUSE, db, tbl_name,) We should probably move this to ImpalaTestSuite. We have a similar function in test_iceberg.py as well. http://gerrit.cloudera.org:8080/#/c/22049/17/tests/query_test/test_charcodec.py@34 PS17, Line 34: problematic_symbols_hiragana = {0x3040, 0x3094, 0x3095, 0x3096, 0x3097, 0x3098, 0x3099, Please add comment why they are problematic. -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 17 Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Fri, 25 Apr 2025 12:56:05 +0000 Gerrit-HasComments: Yes