[Impala-ASF-CR] WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Csaba Ringhofer (Code Review) Mon, 16 Dec 2024 10:35:18 -0800

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22049 )


Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence 
files
......................................................................


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/exec/text/hdfs-text-scanner.cc
File be/src/exec/text/hdfs-text-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/exec/text/hdfs-text-scanner.cc@545
PS4, Line 545: DecodeBuffer
Is it possible for the end of buffer to cut a multi-line character in two? In 
this case decoding the buffer would not succeed/give incorrect result.

I have two ideas on how to handle this (both rely on the line separator \n 
being single character):
1. find the last line separator in the buffer and save the remainder as a state 
of CharCodec and append it to the start of the next buffer
2. do the decoding per line instead of per buffer


http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/util/decoder.cc
File be/src/util/decoder.cc:

http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/util/decoder.cc@45
PS4, Line 45: encoding_ == "UTF-8"
hdfs-text-scanner.cc doesn't create a decoder if encoding is UTF-8, so I think 
that this won't get executed


http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/util/decoder.cc@51
PS4, Line 51: to_utf
Based on 
https://beta.boost.org/doc/libs/1_85_0/libs/locale/doc/html/group__Charset.html#ga2f2e2da92b8ca7041241f7f1082d5cf8
 this can throw exception, which would result in a crash in Impala. The 
possible exceptions should be caught and converted to error status.


http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/util/decoder.cc@52
PS4, Line 52:       (char *)(*buffer + start), (char *)(*buffer + *bytes_read), 
encoding_);
There is also a "method_type how" optional parameter that decides what to do on 
decode error. The default is skipping the character, which seems error prone to 
me. It may be better to return an error in this case.


http://gerrit.cloudera.org:8080/#/c/22049/4/fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
File fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java:

http://gerrit.cloudera.org:8080/#/c/22049/4/fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java@233
PS4, Line 233:     // Extract encoding
             :     String encoding = DEFAULT_ENCODING;
             :     String encodingValue = 
parameters.get(serdeConstants.SERIALIZATION_ENCODING);
             :     if (encodingValue != null) {
             :       encoding = encodingValue;
             :     }
It is not clear to me, does the patch support using this this property as table 
property, serde property, or both? I saw examples for both among the tests, 
e.g.  stored as textfile tblproperties("serialization.encoding"="GBK"); vs 
ALTER TABLE {table_name} SET SERDEPROPERTIES('serialization.encoding'='GBK');

This decision should be also mentioned in the commit message, as it seems a 
difference compared to the Jira.



--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 4
Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Comment-Date: Mon, 16 Dec 2024 18:29:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Reply via email to