[ 
https://issues.apache.org/jira/browse/PIG-5404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-5404:
------------------------------
    Attachment: pig-5404-v01.patch

Sorry for the delay.
 Issue mentioned on this jira actually had two separate issues.
 * Describe not showing the correct schema.
 * Join failing with "join statement has datatype double which is incompatible 
with type of corresponding column in earlier relation"

Former was less critical since it was only happening for "describe". 
 Uploaded a patch for this in PIG-5243.

Latter, it's a critical bug. 
 Issue was, when ForEach with as-clause was referenced by more than one 
relation, type-cast foreach was only inserted to one of the relation and not 
the others. 
 As a result, remaining relations were getting an incorrect schema.

Attaching a patch that inserts typecast to all the remaining relations.

[~daijy], can you review this jira and PIG-5243 for us ?

> FLATTEN infers wrong datatype
> -----------------------------
>
>                 Key: PIG-5404
>                 URL: https://issues.apache.org/jira/browse/PIG-5404
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.17.0
>            Reporter: Bruno Pusztahazi
>            Assignee: Koji Noguchi
>            Priority: Blocker
>              Labels: datatypes, flatten
>         Attachments: pig-5404-v01.patch
>
>
> In version 0.12 (checked out branch-0.12) the following code works as 
> expected:
> With the following input file test.csv:
>  
> {code:java}
> John_5,18,4.0F
> Mary_6,19,3.8F
> Bill_7,20,3.9F
> Joe_8,18,3.8F{code}
>  
>  
> {code:java}
> A = LOAD 'test.csv' USING PigStorage (',') AS 
> (name:chararray,age:int,gpr:float);
> B = FOREACH A GENERATE FLATTEN(STRSPLIT(name,'_')) as 
> (name1:chararray,name2:chararray),age,gpr;
> DESCRIBE B;{code}
> and produces the following output:
>  
> {code:java}
> B: {name1: chararray,name2: chararray,age: int,gpr: float}
> {code}
> This is the expected output as the result of flatten is defined as chararrays.
>  
> When using version 0.17 (checkout out branch-0.17) the code produces:
> {code:java}
> B: {name1: bytearray,name2: bytearray,age: int,gpr: float}
> {code}
> This shows that somehow FLATTEN inferred wrong data types (bytearray instead 
> of chararay).
>  
> Using explicit casting as a workaround on 0.17:
> {code:java}
> B1 = FOREACH B GENERATE (chararray)name1,(chararray)name2,age,gpr;
> DESCRIBE B1;{code}
> produces
> {code:java}
> B1: {name1: chararray,name2: chararray,age: int,gpr: float}
> {code}
> This time with the expected data types.
>  
> The plan explain show some strange cast operators that are not really used 
> (or at least the actual data types are wrong):
> {code:java}
> #-----------------------------------------------
> # New Logical Plan:
> #-----------------------------------------------
> B: (Name: LOStore Schema: 
> name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)
> |
> |---B: (Name: LOForEach Schema: 
> name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)
>     |   |
>     |   (Name: LOGenerate[false,false,false,false] Schema: 
> name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)ColumnPrune:OutputUids=[121,
>  105, 122, 106]ColumnPrune:InputUids=[121, 105, 122, 106]
>     |   |   |
>     |   |   (Name: Cast Type: chararray Uid: 121)
>     |   |   |
>     |   |   |---name1:(Name: Project Type: bytearray Uid: 121 Input: 0 
> Column: 0)
>     |   |   |
>     |   |   (Name: Cast Type: chararray Uid: 122)
>     |   |   |
>     |   |   |---name2:(Name: Project Type: bytearray Uid: 122 Input: 1 
> Column: 0)
>     |   |   |
>     |   |   age:(Name: Project Type: int Uid: 105 Input: 2 Column: 0)
>     |   |   |
>     |   |   gpr:(Name: Project Type: float Uid: 106 Input: 3 Column: 0)
>     |   |
>     |   |---(Name: LOInnerLoad[0] Schema: name1#121:bytearray)
>     |   |
>     |   |---(Name: LOInnerLoad[1] Schema: name2#122:bytearray)
>     |   |
>     |   |---(Name: LOInnerLoad[2] Schema: age#105:int)
>     |   |
>     |   |---(Name: LOInnerLoad[3] Schema: gpr#106:float)
>     |
>     |---B: (Name: LOForEach Schema: 
> name1#135:bytearray,name2#136:bytearray,age#105:int,gpr#106:float)
>         |   |
>         |   (Name: LOGenerate[true,false,false] Schema: 
> name1#135:bytearray,name2#136:bytearray,age#105:int,gpr#106:float)
>         |   |   |
>         |   |   (Name: UserFunc(org.apache.pig.builtin.STRSPLIT) Type: tuple 
> Uid: 132)
>         |   |   |
>         |   |   |---(Name: Cast Type: chararray Uid: 104)
>         |   |   |   |
>         |   |   |   |---name:(Name: Project Type: bytearray Uid: 104 Input: 0 
> Column: (*))
>         |   |   |
>         |   |   |---(Name: Constant Type: chararray Uid: 131)
>         |   |   |
>         |   |   (Name: Cast Type: int Uid: 105)
>         |   |   |
>         |   |   |---age:(Name: Project Type: bytearray Uid: 105 Input: 1 
> Column: (*))
>         |   |   |
>         |   |   (Name: Cast Type: float Uid: 106)
>         |   |   |
>         |   |   |---gpr:(Name: Project Type: bytearray Uid: 106 Input: 2 
> Column: (*))
>         |   |
>         |   |---(Name: LOInnerLoad[0] Schema: name#104:bytearray)
>         |   |
>         |   |---(Name: LOInnerLoad[1] Schema: age#105:bytearray)
>         |   |
>         |   |---(Name: LOInnerLoad[2] Schema: gpr#106:bytearray)
>         |
>         |---A: (Name: LOLoad Schema: 
> name#104:bytearray,age#105:bytearray,gpr#106:bytearray)RequiredFields:null
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to