Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... Patch Set 4: (5 comments) http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/exec/text/hdfs-text-scanner.cc File be/src/exec/text/hdfs-text-scanner.cc: http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/exec/text/hdfs-text-scanner.cc@545 PS4, Line 545: DecodeBuffer Is it possible for the end of buffer to cut a multi-line character in two? In this case decoding the buffer would not succeed/give incorrect result. I have two ideas on how to handle this (both rely on the line separator \n being single character): 1. find the last line separator in the buffer and save the remainder as a state of CharCodec and append it to the start of the next buffer 2. do the decoding per line instead of per buffer http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/util/decoder.cc File be/src/util/decoder.cc: http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/util/decoder.cc@45 PS4, Line 45: encoding_ == "UTF-8" hdfs-text-scanner.cc doesn't create a decoder if encoding is UTF-8, so I think that this won't get executed http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/util/decoder.cc@51 PS4, Line 51: to_utf Based on https://beta.boost.org/doc/libs/1_85_0/libs/locale/doc/html/group__Charset.html#ga2f2e2da92b8ca7041241f7f1082d5cf8 this can throw exception, which would result in a crash in Impala. The possible exceptions should be caught and converted to error status. http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/util/decoder.cc@52 PS4, Line 52: (char *)(*buffer + start), (char *)(*buffer + *bytes_read), encoding_); There is also a "method_type how" optional parameter that decides what to do on decode error. The default is skipping the character, which seems error prone to me. It may be better to return an error in this case. http://gerrit.cloudera.org:8080/#/c/22049/4/fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java File fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java: http://gerrit.cloudera.org:8080/#/c/22049/4/fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java@233 PS4, Line 233: // Extract encoding : String encoding = DEFAULT_ENCODING; : String encodingValue = parameters.get(serdeConstants.SERIALIZATION_ENCODING); : if (encodingValue != null) { : encoding = encodingValue; : } It is not clear to me, does the patch support using this this property as table property, serde property, or both? I saw examples for both among the tests, e.g. stored as textfile tblproperties("serialization.encoding"="GBK"); vs ALTER TABLE {table_name} SET SERDEPROPERTIES('serialization.encoding'='GBK'); This decision should be also mentioned in the commit message, as it seems a difference compared to the Jira. -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 4 Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me> Gerrit-Comment-Date: Mon, 16 Dec 2024 18:29:48 +0000 Gerrit-HasComments: Yes