xubo245 created HIVE-21626:
------------------------------

             Summary: Why hive can't load normal string as binary from csv?
                 Key: HIVE-21626
                 URL: https://issues.apache.org/jira/browse/HIVE-21626
             Project: Hive
          Issue Type: Bug
         Environment: hive client: hive1.2.2
            Reporter: xubo245


Why hive can't load normal string as binary from csv?
Hive-1.2.2
```
hive>  CREATE TABLE IF NOT EXISTS hivetable (
    >     id int,
    >     label boolean,
    >     name string,
    >     image binary,
    >     autoLabel boolean)
    >  row format delimited fields terminated by 'ö';
OK
Time taken: 0.068 seconds
hive> LOAD DATA LOCAL INPATH 
'/Users/xubo/Desktop/xubo/git/carbondata3/integration/spark-common-test/src/test/resources/binarystringdata2.csv'
 INTO TABLE hivetable;
Loading data to table default.hivetable
Table default.hivetable stats: ÄnumFiles=1, totalSize=82Å
OK
Time taken: 0.122 seconds
hive> select * from hivetable;
OK
2       false   2.png   i�      true
3       false   3.png   n*%�
                                false
1       true    1.png   ÜAyard dutyÜB   true
```

binarystringdata2.csv data is:

```
2|false|2.png|abc|true
3|false|3.png|biology|false
1|true|1.png|^Ayard duty^B|true
```
binarystringdata2.csv without \u0001 like over1k of hive project.

For the "abc" in csv, it should return abc by reading from hive after loading 
into hive, but why it is "I�"?. abc get bytes is byte[] 97 98 99, after 
org.apache.hadoop.hive.serde2.lazy.LazyBinary#decodeIfNeeded, it will decode to 
base64, return byte[] 105 -74:
```
  public static byte[] decodeIfNeeded(byte[] recv) {
    boolean arrayByteBase64 = Base64.isArrayByteBase64(recv);
    if (LOG.isDebugEnabled() && arrayByteBase64) {
      LOG.debug("Data only contains Base64 alphabets only so try to decode the 
data.");
    }
    return arrayByteBase64 ? Base64.decodeBase64(recv) : recv;
  }
```
when we query with sql in spark, it will return byte[] 69 B7, for the hive 
alien/beeline, it will return string "I�"( char array is 105 65533).

Why the input and output data is different for hive load data ? insert into is 
ok.

Is it bug or limit ? only support base64 code or string that was validated with 
isBase64 as false in csv? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to