[
https://issues.apache.org/jira/browse/IMPALA-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17959610#comment-17959610
]
ASF subversion and git services commented on IMPALA-10319:
----------------------------------------------------------
Commit 2de7b8287d44aa4e31ceeb51c0a3921670fd7308 in impala's branch
refs/heads/master from Mihaly Szjatinya
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=2de7b8287 ]
IMPALA-14136: test_charcodec fails with Ozone
A regression for IMPALA-10319 for Ozone environment. A hardcoded
'/test-warehouse' in the path was causing some of the 'test_charcodec'
tests to fail. Turns out the 'makedir' part is not necessary.
Change-Id: If1f74b1ddc481a996d82843041f0f031580f14e5
Reviewed-on: http://gerrit.cloudera.org:8080/23004
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Support arbitrary encodings on Text files
> -----------------------------------------
>
> Key: IMPALA-10319
> URL: https://issues.apache.org/jira/browse/IMPALA-10319
> Project: IMPALA
> Issue Type: New Feature
> Components: Backend
> Reporter: Quanlong Huang
> Assignee: Mihaly Szjatinya
> Priority: Critical
> Attachments: gbk_names.txt
>
>
> ORC/Parquet/Avro files store strings in UTF-8 encoded bytes. However, Text
> and Sequence files can be in arbitrary encodings. Hive supports specifying
> arbitrary encoding on tables using LazySimpleSerDe with the
> "serialization.encoding" table property (HIVE-7142). Impala is currently not
> aware of this table property and treate all strings as byte arrays. It's good
> to support at least reading from these text/sequence files.
> *Example*
> Create a text table in Hive using GBK encoding and load a GBK encoded text
> file into it:
> {code:sql}
> hive> create table gbk_names (name string) stored as textfile
> tblproperties("serialization.encoding"="GBK");
> hive> load data local inpath '/home/quanlong/workspace/Impala/gbk_names.txt'
> into table gbk_names;
> hive> select * from gbk_names;
> +-----------------+
> | gbk_names.name |
> +-----------------+
> | 张三 |
> | 李四 |
> | 王五 |
> +-----------------+
> {code}
> Impala read strings as byte arrays so can't decode them correctly:
> {code:sql}
> impala-shell> invalidate metadata gbk_names;
> impala-shell> select * from gbk_names;
> +------+
> | name |
> +------+
> | ���� |
> | ���� |
> | ���� |
> +------+
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]