[ 
https://issues.apache.org/jira/browse/IMPALA-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950491#comment-17950491
 ] 

Mihaly Szjatinya edited comment on IMPALA-10319 at 5/9/25 10:30 AM:
--------------------------------------------------------------------

Restricting this for Text files only. Created a follow-up IMPALA-14063 for 
Sequence format.


was (Author: JIRAUSER306412):
Restricting this for Text files only. Created a follow-up  
https://issues.apache.org/jira/browse/IMPALA-14063 for Sequence format.

> Support arbitrary encodings on Text files
> -----------------------------------------
>
>                 Key: IMPALA-10319
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10319
>             Project: IMPALA
>          Issue Type: New Feature
>          Components: Backend
>            Reporter: Quanlong Huang
>            Assignee: Mihaly Szjatinya
>            Priority: Critical
>         Attachments: gbk_names.txt
>
>
> ORC/Parquet/Avro files store strings in UTF-8 encoded bytes. However, Text 
> and Sequence files can be in arbitrary encodings. Hive supports specifying 
> arbitrary encoding on tables using LazySimpleSerDe with the 
> "serialization.encoding" table property (HIVE-7142). Impala is currently not 
> aware of this table property and treate all strings as byte arrays. It's good 
> to support at least reading from these text/sequence files.
> *Example*
> Create a text table in Hive using GBK encoding and load a GBK encoded text 
> file into it: 
> {code:sql}
> hive> create table gbk_names (name string) stored as textfile 
> tblproperties("serialization.encoding"="GBK");
> hive> load data local inpath '/home/quanlong/workspace/Impala/gbk_names.txt' 
> into table gbk_names;
> hive> select * from gbk_names;
> +-----------------+
> | gbk_names.name  |
> +-----------------+
> | 张三              |
> | 李四              |
> | 王五              |
> +-----------------+
> {code}
> Impala read strings as byte arrays so can't decode them correctly:
> {code:sql}
> impala-shell> invalidate metadata gbk_names;
> impala-shell> select * from gbk_names;
> +------+
> | name |
> +------+
> | ���� |
> | ���� |
> | ���� |
> +------+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to