Mihaly Szjatinya has uploaded a new patch set (#6). ( 
http://gerrit.cloudera.org:8080/22049 )

Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence 
files
......................................................................

WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

As proposed in Jira, this implements decoding of text buffers for
Impala/Hive text tables. Given a table with 'serialization.encoding'
property set, Hive is able to encode the inserted data into charset
specified, consequently saving it into a text file. Objective is to
perform the opposite operation upon reading data buffers from text files
by employing boost::locale::conv::to_utf() library capability.

Since Hive doesn't encode line delimiters, charsets that would have
delimiters stored differently from ASCII are not allowed.

One difference from Hive is that Impala impelents
'serialization.encoding' only as a per partition serdeproperty to avoid
confusion of allowing both serde and tbl properties.

TODO / Nice to Have:
1. Once design approved, this should be extracted into class (hierarchy)
and used for Sequence files. Potentially expandable for encoding by
writers.

2. Test / think through all possible delimiter (FIELD_DELIM, LINE_DELIM,
COLLECTION_DELIM, ESCAPE_CHAR) scenarios.

3. Test archived scenarios.

4. Test complex types

5. Test per-partition cases

6. Juxtapose java vs boost encodings supported.

7. Test all errors/warnings scenarious / corner cases.

Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
---
M be/src/exec/hdfs-scanner.h
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M be/src/exec/text/hdfs-text-scanner.cc
M be/src/exec/text/hdfs-text-scanner.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/util/CMakeLists.txt
A be/src/util/char-codec.cc
A be/src/util/char-codec.h
M common/thrift/CatalogObjects.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
A testdata/charcodec/cp1251_big_utf8.txt
A testdata/charcodec/cp1251_names.txt
A testdata/charcodec/cp1251_names_utf8.txt
A testdata/charcodec/gbk_big_utf8.txt
A testdata/charcodec/gbk_names.txt
A testdata/charcodec/gbk_names_utf8.txt
A testdata/charcodec/koi8r_big_utf8.txt
A testdata/charcodec/koi8r_names.txt
A testdata/charcodec/koi8r_names_utf8.txt
A testdata/charcodec/latin1_big_utf8.txt
A testdata/charcodec/latin1_names.txt
A testdata/charcodec/latin1_names_utf8.txt
A testdata/charcodec/shift_jis_big_utf8.txt
A testdata/charcodec/shift_jis_names.txt
A testdata/charcodec/shift_jis_names_utf8.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/decoding-alltypes.test
A testdata/workloads/functional-query/queries/QueryTest/decoding-small.test
A 
testdata/workloads/functional-query/queries/QueryTest/encoding-decoding-alltypes.test
A 
testdata/workloads/functional-query/queries/QueryTest/encoding-decoding-big.test
A 
testdata/workloads/functional-query/queries/QueryTest/encoding-decoding-small.test
A tests/query_test/test_charcodec.py
38 files changed, 50,972 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/22049/6
--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 6
Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me>

Reply via email to