[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters

Daniel Becker (Code Review) Mon, 29 Apr 2024 05:40:15 -0700

Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21131 )


Change subject: IMPALA-11499: Refactor UrlEncode function to handle special 
characters
......................................................................


Patch Set 11:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@9
PS11, Line 9: The
An error - we're introducing it now.


http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@14
PS11, Line 14: was matching with
matched one of the ...


http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@15
PS11, Line 15: '\u00FF'
'\u00FF' is not a valid character literal in C++. "\u00FF" was included in a 
string, it was part of a string literal. At last I found out that it can 
actually be used in C++ to specify arbitrary unicode values, but of course it 
may be encoded on multiple bytes, see 
https://en.cppreference.com/w/cpp/language/escape.

Also include in the commit message that including this character was probably a 
mistake from the beginning, it should have been '\x7F'.


http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@18
PS11, Line 18: '\xFF'
See the comment above, we should explain that it was a mistake from the 
beginning.


http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@22
PS11, Line 22: Some
It would be good to include in which file the new tests are.


http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@22
PS11, Line 22: for both parquet and iceberg
Parquet and Iceberg are not exclusive things, we also use Parquet in our 
Iceberg tables. Parquet is a file format and Iceberg is a table format. We 
could say "tests on both traditional Hive tables and Iceberg tables ...".


http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc
File be/src/util/coding-util.cc:

http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc@25
PS11, Line 25: #include <boost/algorithm/string.hpp>
The includes within a group should be in alphabetical order, so this should go 
before L23.


http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc@31
PS11, Line 31: #include "sasl/saslutil.h"
Includes from the standard library should be the first group after the header, 
i.e. this should come after #include <sstream>.


http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc@51
PS11, Line 51: '\x20'
> Represents space character
For those characters that have a readable literal representation, e.g. " ", 
"\n" etc., we should use that even if Hive doesn't.


http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc@57
PS11, Line 57: Used
It would be clearer like this: "uppercase" and "hex" only affect the insertion 
of integers, not that of char values.


http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc@62
PS11, Line 62:     // Hive-compat mode, and the character is not alphanumeric 
or is one
This doesn't match the code. Try this:
"... b) we are not in Hive-compat mode and the character is not alphanumeric 
and it is not one of the
characters specifically excluded from escaping (see ShouldNotEscape())."


http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test
File 
testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test:

http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@302
PS11, Line 302: show tables;
I don't think it's needed. Also, if another table is added later before this 
query, this will have to be changed as well.


http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@309
PS11, Line 309: insert into unicode_partition_values partition(p='运营业务数据') 
values (0);
You don't need to put these inserts into separate QUERY segments. It could also 
go together with the select on L318.


http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@327
PS11, Line 327: drop table unicode_partition_values;
You don't need to drop the table, the tables are created in a unique database, 
see test_column_unicode.py::TestColumnUnicode::test_create_unicode_table.


http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@337
PS11, Line 337: show tables;
I don't think this is needed.


http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@344
PS11, Line 344: insert into unicode_partition_values_iceberg values (0, '运');
You should do the same insertions into the Iceberg table as into the other one.


http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@354
PS11, Line 354: drop table unicode_partition_values_iceberg;
See L327.



--
To view, visit http://gerrit.cloudera.org:8080/21131
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb
Gerrit-Change-Number: 21131
Gerrit-PatchSet: 11
Gerrit-Owner: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zihao Ye <[email protected]>
Gerrit-Comment-Date: Mon, 29 Apr 2024 12:34:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters

Reply via email to