Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... Patch Set 2: (3 comments) Added some high level comments. The code looks good, but there are some decisions to be made about the feature that will affect the implementation. http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG@7 PS2, Line 7: Support The patch only adds read support, this could be highlighted in the title. It is also a question what to do when someone tries to insert to a table where serialization.encoding is set - currently Impala won't respect the encoding. My suggestion is the following: - by default give an error when trying to write a table where encoding is set - add a way to completely ignore serialization.encoding, both during reading and writing - not sure which granularity would be the best (flag, query option, table property), at first a flag would be the easiest to implement http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG@9 PS2, Line 9: [Draft] In Impala project the non-final status of a patch is usually marked in the title, e.g. to make it clear that it is work in progress: WIP MPALA-10319: Support ... http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG@23 PS2, Line 23: Note: although the described behavior allows keeping the range split Can you add some pointers to the encoder logic in Hive? Are there comments in Hive code that explain why it was implemented this way? It is not clear to me whether this is intentional behavior in Hive or they had some specific encodings in mind and forgot about multi byte encodings. My guess is that this is an oversight - the single character new line MAY make sense for splitting, but I see no valid reason behind including BOM in every line. Found a Hive ticket that treats this as a bug: https://issues.apache.org/jira/browse/HIVE-14846 (there is no confirmation though from Hive people on whether that consider this a bug). Probably it would make sense to aim for encodings where the new line character is the same as in ascii. -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 2 Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Comment-Date: Tue, 12 Nov 2024 09:26:46 +0000 Gerrit-HasComments: Yes