Mihaly Szjatinya has uploaded a new patch set (#4). ( 
http://gerrit.cloudera.org:8080/22049 )

Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence 
files
......................................................................

WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

As proposed in Jira, this implements decoding of text buffers for
Impala/Hive text tables. Given a table with 'serialization.encoding'
property set, Hive is able to encode the inserted data into charset
specified, consequently saving it into a text file. Objective is to
perform the opposite operation upon reading data buffers from text files
by employing boost::locale::conv::to_utf() library capability.

Since Hive doesn't encode line delimiters, charsets that would have
delimiters stored differently from ASCII are not allowed.

TODO / Nice to Have:
1. Once design approved, this should be extracted into class (hierarchy)
and used for Sequence files. Potentially expandable for encoding by
writers.

2. Test / think through all possible delimiter (FIELD_DELIM, LINE_DELIM,
COLLECTION_DELIM, ESCAPE_CHAR) scenarios.

3. Test archived scenarios.

4. Test complex types

5. Test per-partition cases

6. Juxtapose java vs boost encodings supported.

7. Test all errors/warnings scenarious / corner cases.

Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
---
M be/src/exec/hdfs-scanner.h
M be/src/exec/text/hdfs-text-scanner.cc
M be/src/exec/text/hdfs-text-scanner.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/util/CMakeLists.txt
A be/src/util/decoder.cc
A be/src/util/decoder.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/decoding/cp1251_names.txt
A testdata/decoding/gbk_names.txt
A testdata/decoding/koi8r_names.txt
A testdata/decoding/latin1_names.txt
A testdata/decoding/shift_jis_names.txt
A testdata/decoding/utf16_names.txt
A testdata/decoding/utf16be_names.txt
A testdata/decoding/utf16le_names.txt
A testdata/workloads/functional-query/queries/QueryTest/decoding.test
A testdata/workloads/functional-query/queries/QueryTest/decoding_big.test
A tests/query_test/test_decoding.py
25 files changed, 537 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/22049/4
--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 4
Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Mihaly Szjatinya <msz...@pm.me>

Reply via email to