Mihaly Szjatinya has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22049 )

Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence 
files
......................................................................


Patch Set 17:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-scanner.h@324
PS17, Line 324:
> nit: extra space. Same goes for next line.
ack


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.h
File be/src/exec/hdfs-text-table-writer.h:

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.h@93
PS17, Line 93: boost::scoped_ptr
> We prefer std::unique_ptr over scoped_ptr
yeah, good idea


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@56
PS17, Line 56: HdfsTextTableWriter::~HdfsTextTableWriter() {
             : }
> Why did you move the empty destructor to the .cc file?
After adding a scoped_ptr<CharCodec> to class, its destructor now requires a 
complete CharCodec. This persists for std::unique_ptr (but not for shared_ptr) 
and IIRC has to do with the implementation details of the unique_ptr's deleter.


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@62
PS17, Line 62: auto
> No need to copy the string, it could be const auto&
Yes, good point.


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@64
PS17, Line 64: NULL
> nit: we prefer nullptr over NULL
ack


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/hdfs-text-table-writer.cc@161
PS17, Line 161: if (encoder_.get() != nullptr)
> This could be just 'if (encoder_)'
Agree, I just saw this pattern all over the code. However not in this 
particular file, so probably ok to change it here.


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.cc
File be/src/exec/text/hdfs-text-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.cc@253
PS17, Line 253: auto
> This could be const auto&
Ack


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/exec/text/hdfs-text-scanner.cc@545
PS17, Line 545: if (decoder_.get() != nullptr)
> nit: Could be 'if (decoder_)'
In this and related modules it is scoped_ptr with this pattern everywhere. Even 
in this very function there is 'if (decompressor_.get() == nullptr)' above.

What is the general approach in such cases, if any? To clarify all similar 
comments.


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/runtime/descriptors.h@402
PS17, Line 402: std::string
> Do you need to return a new object? If not, it could return const std::stri
Agree, fixed


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.h
File be/src/util/char-codec.h:

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.h@21
PS17, Line 21: #include <boost/scoped_ptr.hpp>
> Unused include
ack


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.h@41
PS17, Line 41:
> nit: redundant space
ack


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc
File be/src/util/char-codec.cc:

http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc@63
PS17, Line 63: buf_end - buf_start
> nit: *bytes_read
ack


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc@86
PS17, Line 86: *buffer
> This should be 'out_buffer_'.
Yeah, my bad.


http://gerrit.cloudera.org:8080/#/c/22049/17/be/src/util/char-codec.cc@104
PS17, Line 104:     std::vector<uint8_t> prefix = partial_symbol_;
> You could reserve memory after this line to decrease number of re-allocatio
Good idea, but I think it's even more simple:
prefix.reserve(MAX_SYMBOL);


http://gerrit.cloudera.org:8080/#/c/22049/17/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java:

http://gerrit.cloudera.org:8080/#/c/22049/17/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java@940
PS17, Line 940:
> nit: indentation is off
ack


http://gerrit.cloudera.org:8080/#/c/22049/17/tests/query_test/test_charcodec.py
File tests/query_test/test_charcodec.py:

http://gerrit.cloudera.org:8080/#/c/22049/17/tests/query_test/test_charcodec.py@29
PS17, Line 29: def get_table_loc(db, tbl_name):
             :   return '%s/%s.db/%s/' % (WAREHOUSE, db, tbl_name,)
> We should probably move this to ImpalaTestSuite. We have a similar function
Actually there's a similar function already: 
'ImpalaTestSuite::_get_table_location()', which is totally fine for this test. 
So I'll just use that one instead.


http://gerrit.cloudera.org:8080/#/c/22049/17/tests/query_test/test_charcodec.py@34
PS17, Line 34: problematic_symbols_hiragana = {0x3040, 0x3094, 0x3095, 0x3096, 
0x3097, 0x3098, 0x3099,
> Please add comment why they are problematic.
boost::locale::conv() stambles on these and if I get it right strictly speaking 
they are not even part of the encoding. In any case it is not crucial for this 
test to use the exhaustive set of all possible symbols. I rewrote this part to 
avoid confusion.


http://gerrit.cloudera.org:8080/#/c/22049/17/tests/query_test/test_charcodec.py@44
PS17, Line 44: u
> flake8: F821 undefined name 'unichr'
@Zoltan, do you know how to fix this btw?



--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 17
Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Sun, 27 Apr 2025 11:57:39 +0000
Gerrit-HasComments: Yes

Reply via email to