Mihaly Szjatinya has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... Patch Set 7: (2 comments) Changes: 1. Fixed split symbols by storing partial symbol. 2. Added self generating tests for arbitrarily large volumes. 3. Improved 'alter table' analysis to check for current line.delim instead of just '\n'. 4. Changed encodingValue from required to optional 5. Bugfixing. http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/exec/text/hdfs-text-scanner.cc File be/src/exec/text/hdfs-text-scanner.cc: http://gerrit.cloudera.org:8080/#/c/22049/4/be/src/exec/text/hdfs-text-scanner.cc@545 PS4, Line 545: r) { > Good point, although I'm not sure HdfsTextScanner doesn't handle this on a Implemented the 1st option for this. Applied heuristic to find split symbol at the beginning and at the end of the buffer, to avoid copying. http://gerrit.cloudera.org:8080/#/c/22049/6/common/thrift/CatalogObjects.thrift File common/thrift/CatalogObjects.thrift: http://gerrit.cloudera.org:8080/#/c/22049/6/common/thrift/CatalogObjects.thrift@359 PS6, Line 359: 9: optional string encodingValue > Please use 'optional' instead of 'required' Ack -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 7 Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Comment-Date: Sun, 26 Jan 2025 23:11:05 +0000 Gerrit-HasComments: Yes