Mihaly Szjatinya has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... Patch Set 5: (5 comments) 1. Fixed some of the comments. 2. First draft text encoding + encoding-decoding tests. 3. Restructured tests so far. http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/exec/text/hdfs-text-scanner.cc File be/src/exec/text/hdfs-text-scanner.cc: http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/exec/text/hdfs-text-scanner.cc@545 PS4, Line 545: DecodeBuffer > Is it possible for the end of buffer to cut a multi-line character in two? Good point, although I'm not sure HdfsTextScanner doesn't handle this on a different level. Yet there is a special case for line.delim being two bytes: \r\n. I'll try to fix this in the next patch. http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/util/decoder.cc File be/src/util/decoder.cc: http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/util/decoder.cc@45 PS4, Line 45: > hdfs-text-scanner.cc doesn't create a decoder if encoding is UTF-8, so I th Removed. http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/util/decoder.cc@51 PS4, Line 51: > Based on https://beta.boost.org/doc/libs/1_85_0/libs/locale/doc/html/group_ Agree, done. http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/util/decoder.cc@52 PS4, Line 52: > There is also a "method_type how" optional parameter that decides what to d Good point. http://gerrit.cloudera.org:8080/#/c/22049/4/fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java File fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java: http://gerrit.cloudera.org:8080/#/c/22049/4/fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java@233 PS4, Line 233: // Extract encoding : String encoding = DEFAULT_ENCODING; : String encodingValue = parameters.get(serdeConstants.SERIALIZATION_ENCODING); : if (encodingValue != null) { : encoding = encodingValue; : } > I saw examples for both among the tests I guess those were statements from HIVE_QUERY section, which support both. Updated to use only serdeproperties. > This decision should be also mentioned in the commit message, Agree, done. -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 5 Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me> Gerrit-Comment-Date: Mon, 23 Dec 2024 00:08:55 +0000 Gerrit-HasComments: Yes