[jira] [Updated] (IMPALA-14063) Support arbitrary encodings on Sequence files

Mihaly Szjatinya (Jira) Fri, 09 May 2025 03:28:10 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-14063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mihaly Szjatinya updated IMPALA-14063:
--------------------------------------
    Description: A follow-up for 10319 for Sequence files.  (was: 
ORC/Parquet/Avro files store strings in UTF-8 encoded bytes. However, Text and 
Sequence files can be in arbitrary encodings. Hive supports specifying 
arbitrary encoding on tables using LazySimpleSerDe with the 
"serialization.encoding" table property (HIVE-7142). Impala is currently not 
aware of this table property and treate all strings as byte arrays. It's good 
to support at least reading from these text/sequence files.

*Example*

Create a text table in Hive using GBK encoding and load a GBK encoded text file 
into it: 
{code:sql}
hive> create table gbk_names (name string) stored as textfile 
tblproperties("serialization.encoding"="GBK");
hive> load data local inpath '/home/quanlong/workspace/Impala/gbk_names.txt' 
into table gbk_names;
hive> select * from gbk_names;
+-----------------+
| gbk_names.name  |
+-----------------+
| 张三              |
| 李四              |
| 王五              |
+-----------------+
{code}
Impala read strings as byte arrays so can't decode them correctly:
{code:sql}
impala-shell> invalidate metadata gbk_names;
impala-shell> select * from gbk_names;
+------+
| name |
+------+
| ���� |
| ���� |
| ���� |
+------+
{code})

> Support arbitrary encodings on Sequence files
> ---------------------------------------------
>
>                 Key: IMPALA-14063
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14063
>             Project: IMPALA
>          Issue Type: New Feature
>          Components: Backend
>            Reporter: Mihaly Szjatinya
>            Assignee: Mihaly Szjatinya
>            Priority: Critical
>
> A follow-up for 10319 for Sequence files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-14063) Support arbitrary encodings on Sequence files

Reply via email to