Yong Zhang created HIVE-4223:
--------------------------------

             Summary: LazySimpleSerDe will throw IndexOutOfBoundsException in 
nested structs of hive table
                 Key: HIVE-4223
                 URL: https://issues.apache.org/jira/browse/HIVE-4223
             Project: Hive
          Issue Type: Bug
          Components: Serializers/Deserializers
    Affects Versions: 0.9.0
         Environment: Hive 0.9.0
            Reporter: Yong Zhang


The LazySimpleSerDe will throw IndexOutOfBoundsException if the column 
structure is struct containing array of struct. 
I have a table with one column defined like this:

columnA
array <
    struct<
       col1:primiType,
       col2:primiType,
       col3:primiType,
       col4:primiType,
       col5:primiType,
       col6:primiType,
       col7:primiType,
       col8:array<
            struct<
              col1:primiType,
              col2::primiType,
              col3::primiType,
              col4:primiType,
              col5:primiType,
              col6:primiType,
              col7:primiType,
              col8:primiType,
              col9:primiType
            >
       >
    >
>

In this example, the outside struct has 8 columns (including the array), and 
the inner struct has 9 columns. As long as the outside struct has LESS column 
count than the inner struct column count, I think we will get the following 
exception as stracktrace in LazeSimpleSerDe when it tries to serialize a row:

Caused by: java.lang.IndexOutOfBoundsException: Index: 8, Size: 8
        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
        at java.util.ArrayList.get(ArrayList.java:322)
        at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)
        at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:443)
        at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:381)
        at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:365)
        at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:568)
        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
        at 
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
        at 
org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:132)
        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
        at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)
        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
        at 
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:531)
        ... 9 more

I am not very sure about exactly the reason of this problem. I believe that the 
  public static void serialize(ByteStream.Output out, Object 
obj,ObjectInspector objInspector, byte[] separators, int level, Text 
nullSequence, boolean escaped, byte escapeChar, boolean[] needsEscape) is 
recursively invoking itself when facing nest structure. But for the nested 
struct structure, the list reference will mass up, and the size() will return 
wrong data.

In the above example case I faced, 
for these 2 lines:

      List<? extends StructField> fields = soi.getAllStructFieldRefs();
      list = soi.getStructFieldsDataAsList(obj);

my StructObjectInspector(soi) will return the CORRECT data for 
getAllStructFieldRefs() and getStructFieldsDataAsList() methods. For example, 
for one row, for the outsider 8 columns struct, I have 2 elements in the inner 
array of struct, and each element will have 9 columns (as there are 9 columns 
in the inner struct). During runtime, after I added more logging in the 
LazySimpleSerDe, I will see the following behavior in the logging:

for 8 outside column, loop
    for 9 inside columns, loop for serialize
    for 9 inside columns, loop for serialize
code broken here, for the outside loop, it will try to access the 9th 
element,which not exist in the outside loop, as you will see the stracktrace as 
it tried to access location 8 of size 8 of list.

What I did is to change the following line of code, it look like fixing this 
problem. But I don't know if it is the right way, but it did fix this problem, 
and I did it on hive 0.9.0 version of code:

481c481,482
<         for (int i = 0; i < list.size(); i++) {
---
>         int listSize = list.size();
>         for (int i = 0; i < listSize; i++) {

I believe the reason of this bug is that if the code did the current way like
        for (int i = 0; i < list.size(); i++)

the method list.size() will be invoked for every loop. But in the nest 
structure, the list.size() will return different result during the recursive 
call, and that caused the problem I am currently facing.

Thanks

Yong Zhang


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to