[Impala-ASF-CR] WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Csaba Ringhofer (Code Review) Wed, 20 Nov 2024 03:55:32 -0800

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22049 )


Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence 
files
......................................................................


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG@7
PS2, Line 7: 19: Sup
> Frankly, at this point it's probably easier to simply add support for writi
I am ok with implementing writing too, it doesn't look hard and would make 
loading testdata easier (though it is still useful to test Hive interop + fixed 
text files).

>add a way to completely ignore serialization.encoding, both during reading and 
>writing

This still seems useful to me, it would be simple to add a flag to ignore 
encoding.


http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG@23
PS2, Line 23: logic simpler, on the negative side it renders Hive files not 
readable
> I haven't found the concrete code for this in Hive yet. I was able to debug
We discussed this offline with Mihaly and the current consensus is to implement 
this only for encodings where the Hive behavior seems ok, so skip utf16/utf32.

I tried exporting to csv with openoffice and when the encoding is utf-16, it 
also encodes new line characters (and adds no BOM), so Hive won't read it 
correctly. AFAIK excel works the same way. So I am pretty certain that 
HIVE-14846 is a bug, but I would still prefer to avoid implementing a solution 
that is incompatible with Hive, so it seems better to just give an error for 
encodings where \n\r is different than in ASCII.


http://gerrit.cloudera.org:8080/#/c/22049/3/common/thrift/CatalogObjects.thrift
File common/thrift/CatalogObjects.thrift:

http://gerrit.cloudera.org:8080/#/c/22049/3/common/thrift/CatalogObjects.thrift@527
PS3, Line 527: encodingValue
This should be stored in THdfsStorageDescriptor with other SERDE properties: 
https://github.com/apache/impala/blob/2085edbe1cdeb6e01b14b83113b5bc5b9e70f86f/common/thrift/CatalogObjects.thrift#L350

This would also enable setting it per partition.


http://gerrit.cloudera.org:8080/#/c/22049/3/tests/query_test/test_decoding.py
File tests/query_test/test_decoding.py:

http://gerrit.cloudera.org:8080/#/c/22049/3/tests/query_test/test_decoding.py@36
PS3, Line 36: CREATE TABLE
It would be nice to also create a table during dataload:
https://github.com/apache/impala/blob/master/testdata/datasets/functional/functional_schema_template.sql

Section DEPENDENT_LOAD_HIVE can be used to run DMLs in Hive.


http://gerrit.cloudera.org:8080/#/c/22049/3/tests/query_test/test_decoding.py@49
PS3, Line 49:     prepare_test_table('gbk_names', 'gbk_names.txt', 'GBK')
These small tables are useful to test encoding, but having a bigger table could 
provide more coverage. If all rows are returned in a single batch than some 
parts of memory handling may not get executed.



--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 3
Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Comment-Date: Tue, 19 Nov 2024 12:50:56 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Reply via email to