Mihaly Szjatinya has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/22049 )
Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files As proposed in Jira, this implements decoding of text buffers for Impala/Hive text tables. Given a table with 'serialization.encoding' property set, Hive is able to encode the inserted data into charset specified, consequently saving it into a text file. Objective is to perform the opposite operation upon reading data buffers from text files by employing boost::locale::conv::to_utf() library capability. Since Hive doesn't encode line delimiters, charsets that would have delimiters stored differently from ASCII are not allowed. TODO / Nice to Have: 1. Once design approved, this should be extracted into class (hierarchy) and used for Sequence files. Potentially expandable for encoding by writers. 2. Test / think through all possible delimiter (FIELD_DELIM, LINE_DELIM, COLLECTION_DELIM, ESCAPE_CHAR) scenarios. 3. Test archived scenarios. 4. Test complex types 5. Test per-partition cases 6. Juxtapose java vs boost encodings supported. 7. Test all errors/warnings scenarious / corner cases. Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 --- M be/src/exec/hdfs-scanner.h M be/src/exec/text/hdfs-text-scanner.cc M be/src/exec/text/hdfs-text-scanner.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/util/CMakeLists.txt A be/src/util/decoder.cc A be/src/util/decoder.h M common/thrift/CatalogObjects.thrift M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/decoding/cp1251_names.txt A testdata/decoding/gbk_names.txt A testdata/decoding/koi8r_names.txt A testdata/decoding/latin1_names.txt A testdata/decoding/shift_jis_names.txt A testdata/decoding/utf16_names.txt A testdata/decoding/utf16be_names.txt A testdata/decoding/utf16le_names.txt A testdata/workloads/functional-query/queries/QueryTest/decoding.test A testdata/workloads/functional-query/queries/QueryTest/decoding_big.test A tests/query_test/test_decoding.py 25 files changed, 537 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/22049/4 -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 4 Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me>