Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 )
Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters ...................................................................... Patch Set 6: (13 comments) http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@7 PS6, Line 7: IMPALA-11499: Refactor UrlEncode function to handle special We don't usually break the title into two lines, and it is not too long either. http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@12 PS6, Line 12: specialCharacterMap Nit: surround with quotes ('specialCharacterMap'). http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@18 PS6, Line 18: Additionally, the function clears the output string '*out' was assigned to also before this change, so its old contents were discarded. This sentence could be removed. http://gerrit.cloudera.org:8080/#/c/21131/3/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/3/be/src/util/coding-util.cc@48 PS3, Line 48: {'#', "%23"}, > Should be const. This comment has not been addressed. http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@23 PS6, Line 23: #include <iostream> I don't think we need iostream. http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@46 PS6, Line 46: std::unordered_map Could also be static. Alternatively, we could put all these static functions and variables into an anonymous namespace, which is a more modern way of ensuring that these are not visible outside this translation unit. http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@55 PS6, Line 55: '\xFF The problem in the old code was that "\u00FF" translates to two bytes: [194,191]. Note that the notation '\uXXXX' is Java-specific. The process is broken because the code takes the input bytes one by one. The hanzi character '?', in UTF-8, is [232, 191, 144]. Its second character, 191, matched with the second character of "\u00FF" and that caused the problem. I don't know why '\xFF' was included in the first place. If we'd like to handle UTF-8 only, then '\xFF' is not a valid byte. If we'd like to handle non-UTF-8 bytes, then we should handle all other bytes > 127. After some digging in Hive, I found this commit that expanded the list of escaped characters: https://github.com/apache/hive/commit/32b046bf93e3d041b2bb07a1c9b4e1ef5d977ddf The list before that commit is very similar to what we have here in the old code, except they have '\u007F' instead of '\u00FF'. That makes more sense as that is a control character for DELETE, see https://www.compart.com/en/unicode/U+007F. If we remove '\xFF', we can see that the escaped values in the map are the same as how we HEX encode values on L85, so there is no need for these values to be included in a map. Instead, we could have a set (std::unordered_set) with the keys in the current map (without '\xFF' but including '\x7F'), and we would encode input characters if they are in the set (taking into account hive_compat of course). That would also simplify the code. http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@66 PS6, Line 66: (*out).clear(); No need to clear it if it is assigned to at the end of the function. http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@69 PS6, Line 69: for (int i = 0; i < in_len; ++i) { Could use a range-based for-loop: for (char ch : in) {...}. http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@74 PS6, Line 74: is not : // alphanumeric or one of the commonly excluded characters Could be misleading: does "not" also apply to "one of the commonly..."? If not, we could write for example "is not alphanumeric or is one of the commonly excluded ...". http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@80 PS6, Line 80: !(std::isalnum(ch) || ShouldNotEscape(ch)) This is easier to understand if converted to a form without parentheses: !std::isalnum(ch) && !ShouldNotEscape(ch) http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@85 PS6, Line 85: std::uppercase << std::hex We could put std::uppercase and std::hex after L68, before the loop, as these settings don't get reset automatically, so no need to set them every time. On the other hand, I would include a comment stating that this only applies to when integer values are inserted and not when char values, therefore it doesn't cause a problem when not escaping characters (like on L88). http://gerrit.cloudera.org:8080/#/c/21131/3/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test File testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test: http://gerrit.cloudera.org:8080/#/c/21131/3/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@298 PS3, Line 298: my_part_tbl > A more descriptive name would be better, for example "unicode_partition_val This comment has not been addressed. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 6 Gerrit-Owner: Anonymous Coward <[email protected]> Gerrit-Reviewer: Anonymous Coward <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Zihao Ye <[email protected]> Gerrit-Comment-Date: Fri, 26 Apr 2024 11:42:01 +0000 Gerrit-HasComments: Yes
