Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... Patch Set 3: (5 comments) http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG@7 PS2, Line 7: 19: Sup > Frankly, at this point it's probably easier to simply add support for writi I am ok with implementing writing too, it doesn't look hard and would make loading testdata easier (though it is still useful to test Hive interop + fixed text files). >add a way to completely ignore serialization.encoding, both during reading and >writing This still seems useful to me, it would be simple to add a flag to ignore encoding. http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG@23 PS2, Line 23: logic simpler, on the negative side it renders Hive files not readable > I haven't found the concrete code for this in Hive yet. I was able to debug We discussed this offline with Mihaly and the current consensus is to implement this only for encodings where the Hive behavior seems ok, so skip utf16/utf32. I tried exporting to csv with openoffice and when the encoding is utf-16, it also encodes new line characters (and adds no BOM), so Hive won't read it correctly. AFAIK excel works the same way. So I am pretty certain that HIVE-14846 is a bug, but I would still prefer to avoid implementing a solution that is incompatible with Hive, so it seems better to just give an error for encodings where \n\r is different than in ASCII. http://gerrit.cloudera.org:8080/#/c/22049/3/common/thrift/CatalogObjects.thrift File common/thrift/CatalogObjects.thrift: http://gerrit.cloudera.org:8080/#/c/22049/3/common/thrift/CatalogObjects.thrift@527 PS3, Line 527: encodingValue This should be stored in THdfsStorageDescriptor with other SERDE properties: https://github.com/apache/impala/blob/2085edbe1cdeb6e01b14b83113b5bc5b9e70f86f/common/thrift/CatalogObjects.thrift#L350 This would also enable setting it per partition. http://gerrit.cloudera.org:8080/#/c/22049/3/tests/query_test/test_decoding.py File tests/query_test/test_decoding.py: http://gerrit.cloudera.org:8080/#/c/22049/3/tests/query_test/test_decoding.py@36 PS3, Line 36: CREATE TABLE It would be nice to also create a table during dataload: https://github.com/apache/impala/blob/master/testdata/datasets/functional/functional_schema_template.sql Section DEPENDENT_LOAD_HIVE can be used to run DMLs in Hive. http://gerrit.cloudera.org:8080/#/c/22049/3/tests/query_test/test_decoding.py@49 PS3, Line 49: prepare_test_table('gbk_names', 'gbk_names.txt', 'GBK') These small tables are useful to test encoding, but having a bigger table could provide more coverage. If all rows are returned in a single batch than some parts of memory handling may not get executed. -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 3 Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me> Gerrit-Comment-Date: Tue, 19 Nov 2024 12:50:56 +0000 Gerrit-HasComments: Yes