Mihaly Szjatinya has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/22049 )
Change subject: IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... IMPALA-10319: Support arbitrary encodings on Text/Sequence files [Draft] As proposed in Jira, this implements decoding of text buffers for Impala/Hive text tables. Given a table with 'serialization.encoding' property set, Hive is able to encode the inserted data into charset specified, consequently saving it into a text file. Objective is to perform the opposite operation upon reading data buffers from text files by employing boost::locale::conv::to_utf() library capability. Special handling is provided for different BOM scenarios, as well as for line delimiters, which are not part of the encoding on the Hive side. For example, given a standard line delimiter '\n', represented by a single byte '0A' in UTF-8 and two bytes '0x0A 0x00' in UTF-16LE, Hive inserts it as a single '0A'. Note: although the described behavior allows keeping the range split logic simpler, on the negative side it renders Hive files not readable by regular decoders. Conversely, regular multi-byte files with delimiters are not correctly read by Hive/Impala. This limitation has no effect on single-byte charsets. Design Notes: 1. Memory management. Current draft is a naive <conv::to_utf() std::string segments -> concat into std::string -> pool> approach. This can be improved in two steps: a. Get rid of string for gathering segments and allocate directly on pool. b. Provide custom allocator for segment strings, returned by conv::to_utf() to get those allocated and hopefully RVO'd directly on pool as well. At the very least, those strings could be tracked via TrackedString. 2. Decoding filled buffer as a whole. Currently decoding is performed line by line upon the whole buffer in HdfsTextScanner::FillByteBuffer(), after it has been read and, if needed, decompressed. This has the definite advantage of having all the decoding logic in one place, potentially extractable to a separate class. Alternatives would be: a. Decoding tuples /(lines upon their materialization on later stages of HdfsTextScanner. This would employ the existing splitting logic, even though decoding itself would be done similarly line by line. b. Preliminarily transforming malformed line delimiters to secure a smooth conv::to_utf() in one go. 3. An optimization may be done to avoid delimiters only for multi-byte encodings. TODO / Nice to Have: 1. Once design approved, this should be extracted into class (hierarchy) and used for Sequence files. Potentially expandable for encoding by writers. 2. Test / think through all possible delimiter (FIELD_DELIM, LINE_DELIM, COLLECTION_DELIM, ESCAPE_CHAR) scenarios. 3. Test archived scenarios. 4. Juxtapose java vs boost encodings supported. 5. Tests all errors/warnings scenarious / corner cases. Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 --- M be/src/exec/hdfs-scanner.h M be/src/exec/text/hdfs-text-scanner.cc M be/src/exec/text/hdfs-text-scanner.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M common/thrift/CatalogObjects.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java A testdata/decoding/cp1251_names.txt A testdata/decoding/gbk_names.txt A testdata/decoding/koi8r_names.txt A testdata/decoding/latin1_names.txt A testdata/decoding/shift_jis_names.txt A testdata/decoding/utf16_names.txt A testdata/decoding/utf16be_names.txt A testdata/decoding/utf16le_names.txt A testdata/workloads/functional-query/queries/QueryTest/decoding.test A tests/query_test/test_decoding.py 17 files changed, 550 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/22049/2 -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 2 Gerrit-Owner: Mihaly Szjatinya <msz...@pm.me> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>