Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21131 )

Change subject: IMPALA-11499: Refactor UrlEncode function to handle special 
characters
......................................................................


Patch Set 6:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@7
PS6, Line 7: IMPALA-11499: Refactor UrlEncode function to handle special
We don't usually break the title into two lines, and it is not too long either.


http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@12
PS6, Line 12: specialCharacterMap
Nit: surround with quotes ('specialCharacterMap').


http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@18
PS6, Line 18: Additionally, the function clears the output string
'*out' was assigned to also before this change, so its old contents were 
discarded. This sentence could be removed.


http://gerrit.cloudera.org:8080/#/c/21131/3/be/src/util/coding-util.cc
File be/src/util/coding-util.cc:

http://gerrit.cloudera.org:8080/#/c/21131/3/be/src/util/coding-util.cc@48
PS3, Line 48:     {'#', "%23"},
> Should be const.
This comment has not been addressed.


http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc
File be/src/util/coding-util.cc:

http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@23
PS6, Line 23: #include <iostream>
I don't think we need iostream.


http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@46
PS6, Line 46: std::unordered_map
Could also be static. Alternatively, we could put all these static functions 
and variables into an anonymous namespace, which is a more modern way of 
ensuring that these are not visible outside this translation unit.


http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@55
PS6, Line 55: '\xFF
The problem in the old code was that "\u00FF" translates to two bytes: 
[194,191]. Note that the notation '\uXXXX' is Java-specific. The process is 
broken because the code takes the input bytes one by one.

The hanzi character '?', in UTF-8, is [232, 191, 144]. Its second character, 
191, matched with the second character of "\u00FF" and that caused the problem.

I don't know why '\xFF' was included in the first place. If we'd like to handle 
UTF-8 only, then '\xFF' is not a valid byte. If we'd like to handle non-UTF-8 
bytes, then we should handle all other bytes > 127.

After some digging in Hive, I found this commit that expanded the list of 
escaped characters: 
https://github.com/apache/hive/commit/32b046bf93e3d041b2bb07a1c9b4e1ef5d977ddf
The list before that commit is very similar to what we have here in the old 
code, except they have '\u007F' instead of '\u00FF'. That makes more sense as 
that is a control character for DELETE, see 
https://www.compart.com/en/unicode/U+007F.

If we remove '\xFF', we can see that the escaped values in the map are the same 
as how we HEX encode values on L85, so there is no need for these values to be 
included in a map. Instead, we could have a set (std::unordered_set) with the 
keys in the current map (without '\xFF' but including '\x7F'), and we would 
encode input characters if they are in the set (taking into account hive_compat 
of course). That would also simplify the code.


http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@66
PS6, Line 66:   (*out).clear();
No need to clear it if it is assigned to at the end of the function.


http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@69
PS6, Line 69:   for (int i = 0; i < in_len; ++i) {
Could use a range-based for-loop: for (char ch : in) {...}.


http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@74
PS6, Line 74: is not
            :     // alphanumeric or one of the commonly excluded characters
Could be misleading: does "not" also apply to "one of the commonly..."? If not, 
we could write for example "is not alphanumeric or is one of the commonly 
excluded ...".


http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@80
PS6, Line 80: !(std::isalnum(ch) || ShouldNotEscape(ch))
This is easier to understand if converted to a form without parentheses:

!std::isalnum(ch) && !ShouldNotEscape(ch)


http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@85
PS6, Line 85: std::uppercase << std::hex
We could put std::uppercase and std::hex after L68, before the loop, as these 
settings don't get reset automatically, so no need to set them every time. On 
the other hand, I would include a comment stating that this only applies to 
when integer values are inserted and not when char values, therefore it doesn't 
cause a problem when not escaping characters (like on L88).


http://gerrit.cloudera.org:8080/#/c/21131/3/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test
File 
testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test:

http://gerrit.cloudera.org:8080/#/c/21131/3/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@298
PS3, Line 298: my_part_tbl
> A more descriptive name would be better, for example "unicode_partition_val
This comment has not been addressed.



--
To view, visit http://gerrit.cloudera.org:8080/21131
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb
Gerrit-Change-Number: 21131
Gerrit-PatchSet: 6
Gerrit-Owner: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zihao Ye <[email protected]>
Gerrit-Comment-Date: Fri, 26 Apr 2024 11:42:01 +0000
Gerrit-HasComments: Yes

Reply via email to