[jira] [Commented] (HIVE-16332) We create a partitioned text format table with one partition, after we change the format of table to orc, then the array type field may output error.

Zhizhen Hou (JIRA) Wed, 29 Mar 2017 20:22:25 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15948343#comment-15948343
 ]


Zhizhen Hou commented on HIVE-16332:
------------------------------------


##  Reason Analysis
 ObjectInspectorConverters$ListConverter instance does not clean the data of 
previous record, 
 When the size of  array of current row is less than that of previous row, it 
data of list will not be fully overwrite 
 and the not overwrited data will be output.

## Code Analysis
In FetchOperator.nextRow. At first, it deserializes the value using the 
currSerDe, currSerDe is the SerDe of partition.
Second, ObjectConverter is an instance of 
ObjectInspectorConverters$StructConverter
```
         Object deserialized = currSerDe.deserialize(value);
          if (ObjectConverter != null) {
            deserialized = ObjectConverter.convert(deserialized);
          }
```
In method convert,  it read out every field value in turn, and it uses the 
consponding  converter to convert the field value.
After change the format of table to orc,  with the  type field array, the 
consponding convert is ObjectInspectorConverters$ListConverter.
```
@Override
    public Object convert(Object input) {
      if (input == null) {
        return null;
      }

      int minFields = Math.min(inputFields.size(), outputFields.size());
      // Convert the fields
      for (int f = 0; f < minFields; f++) {
        Object inputFieldValue = inputOI.getStructFieldData(input, 
inputFields.get(f));
        Object outputFieldValue = 
fieldConverters.get(f).convert(inputFieldValue);
        outputOI.setStructFieldData(output, outputFields.get(f), 
outputFieldValue);
      }

      // set the extra fields to null
      for (int f = minFields; f < outputFields.size(); f++) {
        outputOI.setStructFieldData(output, outputFields.get(f), null);
      }

      return output;
    }
  }
  ```
  
  In Method ObjectInspectorConverters$ListConverter.convert,  it first creates 
separate element converter for each element.
  Then, it call outputIO.resize(output,size). 
  Finally, it set every converted element to outputOI.
  ```
  @Override
    public Object convert(Object input) {
      if (input == null) {
        return null;
      }
      // Create enough elementConverters
      // NOTE: we have to have a separate elementConverter for each element,
      // because the elementConverters can reuse the internal object.
      // So it's not safe to use the same elementConverter to convert multiple
      // elements.
      int size = inputOI.getListLength(input);
      while (elementConverters.size() < size) {
        elementConverters.add(getConverter(inputElementOI, outputElementOI));
      }

      // Convert the elements
      outputOI.resize(output, size);
      for (int index = 0; index < size; index++) {
        Object inputElement = inputOI.getListElement(input, index);
        Object outputElement = elementConverters.get(index).convert(
            inputElement);
        outputOI.set(output, index, outputElement);
      }
      return output;
    }

  }
  ```
   The problem is in method resize, it does not clear all the data of previous, 
simply calls ensureCapacity. 
   When the size of  array of current row is less than that of previous row, it 
data of list will not be fully overwrite and the not overwrited data will be 
output.
  ```
  public Object resize(Object list, int newSize) {
      ((ArrayList) list).ensureCapacity(newSize);
      return list;
    }
  ```
  
  ## The method of amending.
```
  public Object resize(Object list, int newSize) {
  ((ArrayList) list).clear();
  return list;
}

```

> We create a partitioned text format table with one partition, after we change 
> the format of table to orc, then the array type field may output error.
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-16332
>                 URL: https://issues.apache.org/jira/browse/HIVE-16332
>             Project: Hive
>          Issue Type: Bug
>          Components: ORC
>    Affects Versions: 2.1.1
>            Reporter: Zhizhen Hou
>            Priority: Critical
>
> ##The step to reproduce the result.
> 1. First crate a text format table with array type field in hive.
> ```
>  create table test_text_orc (
>   col_int bigint,
>   col_text string, 
>   col_array array<string>, 
>   col_map map<string, string>
>   ) 
>   PARTITIONED BY (
>    day string
>    )
>    ROW FORMAT DELIMITED
>  FIELDS TERMINATED BY ',' 
>  collection items TERMINATED  BY ']'
>  map keys TERMINATED BY ':'
>   ;
>  
> ```
> 2. Create new text file hive-orc-text-file-array-error-test.txt.
> ```
> 1,text_value1,array_value1]array_value2]array_value3, 
> map_key1:map_value1,map_key2:map_value2
> 2,text_value2,array_value4, map_key1:map_value3
> ,text_value3,, map_key1:]map_key3:map_value3
> ```
> 3.  Load the data into one partition.
> ```
>  LOAD DATA local INPATH '.hive-orc-text-file-array-error-test.txt' overwrite 
> into table test_text_orc partition(day=20170329)
> ```
> 4. select the data to verify the result.
> ```
> hive> select * from test.test_text_orc;
> OK
> 1     text_value1     ["array_value1","array_value2","array_value3"]  {" 
> map_key1":"map_value1","map_key2":"map_value2"}      20170329
> 2     text_value2     ["array_value4"]        {"map_key1":"map_value3"}       
> 20170329
> NULL  text_value3     []      {" map_key1":"","map_key3":"map_value3"}        
> 20170329
> ```
> 5. Alter table format of table to orc;
> ```
>  alter table test_text_orc set fileformat orc;
> ```
> 6. Check the result again, and you can see the  error result.
> ```
> hive> select * from test.test_text_orc;
> OK
> 1     text_value1     ["array_value1","array_value2","array_value3"]  {" 
> map_key1":"map_value1","map_key2":"map_value2"}      20170329
> 2     text_value2     ["array_value4","array_value2","array_value3"]  
> {"map_key1":"map_value3"}       20170329
> NULL  text_value3     ["array_value4","array_value2","array_value3"]  
> {"map_key3":"map_value3"," map_key1":""}        20170329
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16332) We create a partitioned text format table with one partition, after we change the format of table to orc, then the array type field may output error.

Reply via email to