Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21131 )

Change subject: IMPALA-11499: Refactor UrlEncode function to handle special 
characters
......................................................................


Patch Set 10:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util-test.cc
File be/src/util/coding-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util-test.cc@94
PS10, Line 94:   TestUrl("/home/impala/directory/", 
"%2Fhome%2Fimpala%2Fdirectory%2F", false);
             :   TestUrl("/home/impala/directory/", 
"%2Fhome%2Fimpala%2Fdirectory%2F", true);
Please add some tests like these for newly added marks like '\n', '\r', '^', 
etc.

EDIT: please include all the escaped characters.


http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util.cc
File be/src/util/coding-util.cc:

http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util.cc@a40
PS10, Line 40:
             :
             :
             :
Can you keep this comment? We should still mention 
common/src/java/org/apache/hadoop/hive/common/FileUtils.java since the list 
would change there.

Also mention here that "Impala creates the partition directories using these 
encoded strings. So it's crucial to keep it consistent with Hive."


http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util.cc@45
PS10, Line 45: static const std::unordered_set<char> SpecialCharacters = {'"', 
'#', '\\', '*', '/',
             :     ':', '=', '?', '%', '^', '[', ']', '{', '\x7F'};
We are still missing several characters here, e.g. single quote, '\n' and '\r'. 
I think we should use the same list as Hive's. The list in Hive consists of 
control characters, i.e. ASCII characters from 0 to 31 and 127 (0x7F), and 
marks like '"', '#', '%', '\'', '*', '/', ':', '=', '?', '\\', '{', '[', ']', 
'^'.

Code snipper of Hive:

  static BitSet charToEscape = new BitSet(128);
  static {
    for (char c = 0; c < ' '; c++) {
      charToEscape.set(c);
    }

    /**
     * ASCII 01-1F are HTTP control characters that need to be escaped.
     * \u000A and \u000D are \n and \r, respectively.
     */
    char[] clist = new char[] {'\u0001', '\u0002', '\u0003', '\u0004',
        '\u0005', '\u0006', '\u0007', '\u0008', '\u0009', '\n', '\u000B',
        '\u000C', '\r', '\u000E', '\u000F', '\u0010', '\u0011', '\u0012',
        '\u0013', '\u0014', '\u0015', '\u0016', '\u0017', '\u0018', '\u0019',
        '\u001A', '\u001B', '\u001C', '\u001D', '\u001E', '\u001F',
        '"', '#', '%', '\'', '*', '/', ':', '=', '?', '\\', '\u007F', '{',
        '[', ']', '^'};

    for (char c : clist) {
      charToEscape.set(c);
    }

Note that Hive adds some characters twice. ASCII code of ' ' is 32. So in the 
first loop, 0-31 are already added.
In Java, char is a "16-bit integer" that represents a single 16-bit Unicode 
character. So Hive code uses unicodes like '\u007F'. But in C++, such unicodes 
are represented by UTF-8 bytes. And char in C++ is just 8 bits. So in the 
original code "\u00FF" is actually two chars.


http://gerrit.cloudera.org:8080/#/c/21131/10/testdata/workloads/functional-query/queries/QueryTest/insert.test
File testdata/workloads/functional-query/queries/QueryTest/insert.test:

http://gerrit.cloudera.org:8080/#/c/21131/10/testdata/workloads/functional-query/queries/QueryTest/insert.test@452
PS10, Line 452: INSERT INTO TABLE insert_string_partitioned PARTITION(s2="_.~ 
+")
              : SELECT "value" FROM functional.alltypessmall LIMIT 1;
              : ---- RESULTS
              : s2=_.~ +: 1
              : ====
              : ---- QUERY
              : # select with unencoded partition key
              : SELECT * FROM insert_string_partitioned;
              : ---- RESULTS
              : 'value','_.~ +'
              : ---- TYPES
              : string, string
e2e tests for ASCII characters can be added here, e.g.

INSERT INTO TABLE insert_string_partitioned PARTITION(s2="O'Reilly") values 
('0');

Or use a string with all escaped marks.



--
To view, visit http://gerrit.cloudera.org:8080/21131
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb
Gerrit-Change-Number: 21131
Gerrit-PatchSet: 10
Gerrit-Owner: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zihao Ye <[email protected]>
Gerrit-Comment-Date: Sun, 28 Apr 2024 02:12:16 +0000
Gerrit-HasComments: Yes

Reply via email to