[Impala-ASF-CR] IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Csaba Ringhofer (Code Review) Tue, 12 Nov 2024 01:39:08 -0800

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22049 )


Change subject: IMPALA-10319: Support arbitrary encodings on Text/Sequence files
......................................................................


Patch Set 2:

(3 comments)

Added some high level comments. The code looks good, but there are some 
decisions to be made about the feature that will affect the implementation.

http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG@7
PS2, Line 7: Support
The patch only adds read support, this could be highlighted in the title.

It is also a question what to do when someone tries to insert to a table where 
serialization.encoding is set - currently Impala won't respect the encoding. My 
suggestion is the following:
- by default give an error when trying to write a table where encoding is set
- add a way to completely ignore serialization.encoding, both during reading 
and writing - not sure which granularity would be the best (flag, query option, 
table property), at first a flag would be the easiest to implement


http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG@9
PS2, Line 9: [Draft]
In Impala project the non-final status of a patch is usually marked in the 
title, e.g. to make it clear that it is work in progress: WIP MPALA-10319: 
Support ...


http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG@23
PS2, Line 23: Note: although the described behavior allows keeping the range 
split
Can you add some pointers to the encoder logic in Hive? Are there comments in 
Hive code that explain why it was implemented this way?

It is not clear to me whether this is intentional behavior in Hive or they had 
some specific encodings in mind and forgot about multi byte encodings. My guess 
is that this is an oversight - the single character new line MAY make sense for 
splitting, but I see no valid reason behind including BOM in every line.

Found a Hive ticket that treats this as a bug: 
https://issues.apache.org/jira/browse/HIVE-14846 (there is no confirmation 
though from Hive people on whether that consider this a bug). Probably it would 
make sense to aim for encodings where the new line character is the same as in 
ascii.



--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 2
Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 Nov 2024 09:26:46 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Reply via email to