Bruno Pusztahazi created PIG-5404:
-------------------------------------

             Summary: FLATTEN infers wrong datatype
                 Key: PIG-5404
                 URL: https://issues.apache.org/jira/browse/PIG-5404
             Project: Pig
          Issue Type: Bug
          Components: piggybank
    Affects Versions: 0.17.0
            Reporter: Bruno Pusztahazi


In version 0.12 (checked out branch-0.12) the following code works as expected:

With the following input file test.csv:

 
{code:java}
John_5,18,4.0F
Mary_6,19,3.8F
Bill_7,20,3.9F
Joe_8,18,3.8F{code}
 

 
{code:java}

A = LOAD 'test.csv' USING PigStorage (',') AS 
(name:chararray,age:int,gpr:float);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(name,'_')) as 
(name1:chararray,name2:chararray),age,gpr;
DESCRIBE B;{code}
and produces the following output:

 
{code:java}
B: {name1: chararray,name2: chararray,age: int,gpr: float}
{code}
This is the expected output as the result of flatten is defined as chararrays.

 

When using version 0.17 (checkout out branch-0.17) the code produces:
{code:java}
B: {name1: bytearray,name2: bytearray,age: int,gpr: float}
{code}
This shows that somehow FLATTEN inferred wrong data types (bytearray instead of 
chararay).

 

Using explicit casting as a workaround on 0.17:
{code:java}
B1 = FOREACH B GENERATE (chararray)name1,(chararray)name2,age,gpr;
DESCRIBE B1;{code}
produces
{code:java}
B1: {name1: chararray,name2: chararray,age: int,gpr: float}
{code}
This time with the expected data types.

 

The plan explain show some strange cast operators that are not really used (or 
at least the actual data types are wrong):
{code:java}
#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
B: (Name: LOStore Schema: 
name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)
|
|---B: (Name: LOForEach Schema: 
name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)
    |   |
    |   (Name: LOGenerate[false,false,false,false] Schema: 
name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)ColumnPrune:OutputUids=[121,
 105, 122, 106]ColumnPrune:InputUids=[121, 105, 122, 106]
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 121)
    |   |   |
    |   |   |---name1:(Name: Project Type: bytearray Uid: 121 Input: 0 Column: 
0)
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 122)
    |   |   |
    |   |   |---name2:(Name: Project Type: bytearray Uid: 122 Input: 1 Column: 
0)
    |   |   |
    |   |   age:(Name: Project Type: int Uid: 105 Input: 2 Column: 0)
    |   |   |
    |   |   gpr:(Name: Project Type: float Uid: 106 Input: 3 Column: 0)
    |   |
    |   |---(Name: LOInnerLoad[0] Schema: name1#121:bytearray)
    |   |
    |   |---(Name: LOInnerLoad[1] Schema: name2#122:bytearray)
    |   |
    |   |---(Name: LOInnerLoad[2] Schema: age#105:int)
    |   |
    |   |---(Name: LOInnerLoad[3] Schema: gpr#106:float)
    |
    |---B: (Name: LOForEach Schema: 
name1#135:bytearray,name2#136:bytearray,age#105:int,gpr#106:float)
        |   |
        |   (Name: LOGenerate[true,false,false] Schema: 
name1#135:bytearray,name2#136:bytearray,age#105:int,gpr#106:float)
        |   |   |
        |   |   (Name: UserFunc(org.apache.pig.builtin.STRSPLIT) Type: tuple 
Uid: 132)
        |   |   |
        |   |   |---(Name: Cast Type: chararray Uid: 104)
        |   |   |   |
        |   |   |   |---name:(Name: Project Type: bytearray Uid: 104 Input: 0 
Column: (*))
        |   |   |
        |   |   |---(Name: Constant Type: chararray Uid: 131)
        |   |   |
        |   |   (Name: Cast Type: int Uid: 105)
        |   |   |
        |   |   |---age:(Name: Project Type: bytearray Uid: 105 Input: 1 
Column: (*))
        |   |   |
        |   |   (Name: Cast Type: float Uid: 106)
        |   |   |
        |   |   |---gpr:(Name: Project Type: bytearray Uid: 106 Input: 2 
Column: (*))
        |   |
        |   |---(Name: LOInnerLoad[0] Schema: name#104:bytearray)
        |   |
        |   |---(Name: LOInnerLoad[1] Schema: age#105:bytearray)
        |   |
        |   |---(Name: LOInnerLoad[2] Schema: gpr#106:bytearray)
        |
        |---A: (Name: LOLoad Schema: 
name#104:bytearray,age#105:bytearray,gpr#106:bytearray)RequiredFields:null
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to