Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22049 )

Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence 
files
......................................................................


Patch Set 17:

(19 comments)

Overall looks good. Found only minor issues.

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-scanner.h@324
PS17, Line 324:
nit: extra space. Same goes for next line.


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.h
File be/src/exec/hdfs-text-table-writer.h:

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.h@93
PS17, Line 93: boost::scoped_ptr
We prefer std::unique_ptr over scoped_ptr


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@56
PS17, Line 56: HdfsTextTableWriter::~HdfsTextTableWriter() {
             : }
Why did you move the empty destructor to the .cc file?


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@62
PS17, Line 62: auto
No need to copy the string, it could be const auto&


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@64
PS17, Line 64: NULL
nit: we prefer nullptr over NULL


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@161
PS17, Line 161: if (encoder_.get() != nullptr)
This could be just 'if (encoder_)'


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.h
File be/src/exec/text/hdfs-text-scanner.h:

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.h@272
PS17, Line 272: boost::scoped_ptr
We prefer std::unique_ptr over boost:scoped_ptr


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.cc
File be/src/exec/text/hdfs-text-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.cc@253
PS17, Line 253: auto
This could be const auto&


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.cc@545
PS17, Line 545: if (decoder_.get() != nullptr)
nit: Could be 'if (decoder_)'


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.cc@548
PS17, Line 548: decompressor_.get() != nullptr
nit: 'decompressor_ == nullptr' should be enough. Same goos for L749.


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/runtime/descriptors.h@402
PS17, Line 402: std::string
Do you need to return a new object? If not, it could return const std::string&


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.h
File be/src/util/char-codec.h:

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.h@21
PS17, Line 21: #include <boost/scoped_ptr.hpp>
Unused include


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.h@41
PS17, Line 41:
nit: redundant space


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc
File be/src/util/char-codec.cc:

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc@63
PS17, Line 63: buf_end - buf_start
nit: *bytes_read


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc@86
PS17, Line 86: *buffer
This should be 'out_buffer_'.


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc@104
PS17, Line 104:     std::vector<uint8_t> prefix = partial_symbol_;
You could reserve memory after this line to decrease number of re-allocations 
and copyings:

 std::vector<uint8_t> prefix;
 prefix.reserve(partial_symbol_.size() + buf_end - *buf_start);
 prefix.assign(partial_symbol_.begin(), partial_symbol_.end();


http://gerrit.cloudera.org:8080/#/c/22049/17/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java:

http://gerrit.cloudera.org:8080/#/c/22049/17/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java@940
PS17, Line 940:
nit: indentation is off


http://gerrit.cloudera.org:8080/#/c/22049/17/tests/query_test/test_charcodec.py
File tests/query_test/test_charcodec.py:

http://gerrit.cloudera.org:8080/#/c/22049/17/tests/query_test/test_charcodec.py@29
PS17, Line 29: def get_table_loc(db, tbl_name):
             :   return '%s/%s.db/%s/' % (WAREHOUSE, db, tbl_name,)
We should probably move this to ImpalaTestSuite. We have a similar function in 
test_iceberg.py as well.


http://gerrit.cloudera.org:8080/#/c/22049/17/tests/query_test/test_charcodec.py@34
PS17, Line 34: problematic_symbols_hiragana = {0x3040, 0x3094, 0x3095, 0x3096, 
0x3097, 0x3098, 0x3099,
Please add comment why they are problematic.



--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 17
Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Fri, 25 Apr 2025 12:56:05 +0000
Gerrit-HasComments: Yes

Reply via email to