Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21131 )

Change subject: IMPALA-11499: Refactor UrlEncode function to handle special 
characters
......................................................................


Patch Set 15:

(2 comments)

Thanks Pranav! In the meantime I discovered something else, but that shouldn't 
be too hard to fix.

http://gerrit.cloudera.org:8080/#/c/21131/15/be/src/util/coding-util.cc
File be/src/util/coding-util.cc:

http://gerrit.cloudera.org:8080/#/c/21131/15/be/src/util/coding-util.cc@66
PS15, Line 66:       ss << '%' << static_cast<uint32_t>(ch);
I played with it a bit and it doesn't work with for example '\t' and '\n' (for 
example
  insert into my_part_tbl partition(p='a\tt') values (0);
).
Let's take '\t'. Its value is 9, which only has a single hexadecimal digit. Now 
we'll escape it as "%9" but Hive has "%09". To match Hive's behaviour, do this:
1. Add this include in the appropriate include group: #include <iomanip>
2. Add "<< setfill('0')" after "hex" on L58.
3. Add "<< setw(2)" after inserting the % sign on L66.

This will pad hex values so that they are always at least 2 chars wide.
Note that setw is reset after insertion, so it has to be within the loop.

This also reveals a test deficiency: we should ideally test every character 
that is in 'SpecialCharacters'. I don't know if we can insert arbitrary byte 
values in Impala strings, maybe Quanlong can help there. But even if that's not 
possible, we should cover the ones for which  an escape sequence exists (i.e. 
'\t', '\n', '\v' etc.).


http://gerrit.cloudera.org:8080/#/c/21131/15/be/src/util/coding-util.cc@66
PS15, Line 66: ch
I think it would be good if we first cast it to unsigned char because of what 
happens if we encounter a character > 127. This is possible if the input 
contains one, and we are in non-Hive-compat mode.

If char is a signed type (that is implementation dependent), then the value > 
127 will be interpreted as a negative number. When we convert it to uint32_t, 
it will be sign-extended. For example, instead of "%FF" we will have 
"%FFFFFFFF". If we convert it first to unsigned char, it will not be 
sign-extended.



--
To view, visit http://gerrit.cloudera.org:8080/21131
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb
Gerrit-Change-Number: 21131
Gerrit-PatchSet: 15
Gerrit-Owner: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zihao Ye <[email protected]>
Gerrit-Comment-Date: Tue, 30 Apr 2024 14:01:31 +0000
Gerrit-HasComments: Yes

Reply via email to