[Impala-ASF-CR] IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Mihaly Szjatinya (Code Review) Sun, 10 Nov 2024 10:42:11 -0800

Mihaly Szjatinya has uploaded a new patch set (#2). ( 
http://gerrit.cloudera.org:8080/22049 )


Change subject: IMPALA-10319: Support arbitrary encodings on Text/Sequence files
......................................................................

IMPALA-10319: Support arbitrary encodings on Text/Sequence files

[Draft]
As proposed in Jira, this implements decoding of text buffers for
Impala/Hive text tables. Given a table with 'serialization.encoding'
property set, Hive is able to encode the inserted data into charset
specified, consequently saving it into a text file. Objective is to
perform the opposite operation upon reading data buffers from text files
by employing boost::locale::conv::to_utf() library capability.

Special handling is provided for different BOM scenarios, as well as for
line delimiters, which are not part of the encoding on the Hive side.
For example, given a standard line delimiter '\n', represented by a
single byte '0A' in UTF-8 and two bytes '0x0A 0x00' in UTF-16LE, Hive
inserts it as a single '0A'.

Note: although the described behavior allows keeping the range split
logic simpler, on the negative side it renders Hive files not readable
by regular decoders. Conversely, regular multi-byte files with
delimiters are not correctly read by Hive/Impala. This limitation has no
effect on single-byte charsets.

Design Notes:
1. Memory management. Current draft is a naive
<conv::to_utf() std::string segments -> concat into std::string -> pool>
approach. This can be improved in two steps:
a. Get rid of string for gathering segments and allocate directly on
pool.
b. Provide custom allocator for segment strings, returned by
conv::to_utf() to get those allocated and hopefully RVO'd directly on
pool as well. At the very least, those strings could be tracked via
TrackedString.

2. Decoding filled buffer as a whole. Currently decoding is performed
line by line upon the whole buffer in HdfsTextScanner::FillByteBuffer(),
after it has been read and, if needed, decompressed. This has the
definite advantage of having all the decoding logic in one place,
potentially extractable to a separate class.

Alternatives would be:
a. Decoding tuples /(lines upon their materialization on later stages of
HdfsTextScanner. This would employ the existing splitting logic, even
though decoding itself would be done similarly line by line.
b. Preliminarily transforming malformed line delimiters to secure a
smooth conv::to_utf() in one go.

3. An optimization may be done to avoid delimiters only for multi-byte
encodings.

TODO / Nice to Have:
1. Once design approved, this should be extracted into class (hierarchy)
and used for Sequence files. Potentially expandable for encoding by
writers.

2. Test / think through all possible delimiter (FIELD_DELIM, LINE_DELIM,
COLLECTION_DELIM, ESCAPE_CHAR) scenarios.

3. Test archived scenarios.

4. Juxtapose java vs boost encodings supported.

5. Tests all errors/warnings scenarious / corner cases.

Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
---
M be/src/exec/hdfs-scanner.h
M be/src/exec/text/hdfs-text-scanner.cc
M be/src/exec/text/hdfs-text-scanner.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
A testdata/decoding/cp1251_names.txt
A testdata/decoding/gbk_names.txt
A testdata/decoding/koi8r_names.txt
A testdata/decoding/latin1_names.txt
A testdata/decoding/shift_jis_names.txt
A testdata/decoding/utf16_names.txt
A testdata/decoding/utf16be_names.txt
A testdata/decoding/utf16le_names.txt
A testdata/workloads/functional-query/queries/QueryTest/decoding.test
A tests/query_test/test_decoding.py
17 files changed, 550 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/22049/2
--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 2
Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>

[Impala-ASF-CR] IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Reply via email to