[jira] [Commented] (PIG-5404) FLATTEN infers wrong datatype

Bruno Pusztahazi (Jira) Thu, 10 Sep 2020 00:47:09 -0700


    [ 
https://issues.apache.org/jira/browse/PIG-5404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193442#comment-17193442
 ]


Bruno Pusztahazi commented on PIG-5404:
---------------------------------------

[~knoguchi] please check the following:

here not only the DESCRIBE is affected, but the the last JOIN will cause extra 
issues:
{code:java}
Data Used to create table:
create 'Test', 'CM'
put 'Test','202006181604008928049-9223370442808339240','CM:NUM','1'
put 'Test','202007010956112120091-9223370442807585497','CM:NUM','2'

===============================================================================

Sample Pig Script to replicate in grunt shell:

REGISTER hbase/lib/*.jar;

CA = LOAD 'hbase://Test' USING 
org.apache.pig.backend.hadoop.hbase.HBaseStorage('CM:NUM', '-loadKey true') AS 
(RowKey:bytearray,NUMBER:chararray);
         
B = FOREACH CA GENERATE RowKey AS CA_RowKey,REPLACE(RowKey,'C-','C') AS 
RowKey1,NUMBER;
describe B;
B: {CA_RowKey: bytearray,RowKey1: chararray,NUMBER: chararray}

C = FOREACH B GENERATE CA_RowKey,FLATTEN(STRSPLIT(RowKey1,'-')) as 
(CA_CASEID:chararray,CA_EPOCH:chararray),NUMBER;
describe C;
C: {CA_RowKey: bytearray,CA_CASEID: bytearray,CA_EPOCH: bytearray,NUMBER: 
chararray}

G = GROUP C BY CA_CASEID;
G1 = FOREACH G  GENERATE group as CA_CASEID, MIN(C.CA_EPOCH) AS CA_EPOCH;
J = join C by (CA_CASEID, CA_EPOCH) , G1 by (CA_CASEID, CA_EPOCH);
DESCRIBE J;
{code}
The issue was:
{code:java}
2020-08-14 12:09:06,257 [main] org.apache.pig.tools.grunt.Grunt - 1130: <line 
6, column 4> join column no. 2 in relation no. 2 of join statement has datatype 
double which is incompatible with type of corresponding column in earlier 
relation(s) in the statement
{code}

> FLATTEN infers wrong datatype
> -----------------------------
>
>                 Key: PIG-5404
>                 URL: https://issues.apache.org/jira/browse/PIG-5404
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.17.0
>            Reporter: Bruno Pusztahazi
>            Assignee: Koji Noguchi
>            Priority: Critical
>              Labels: datatypes, flatten
>
> In version 0.12 (checked out branch-0.12) the following code works as 
> expected:
> With the following input file test.csv:
>  
> {code:java}
> John_5,18,4.0F
> Mary_6,19,3.8F
> Bill_7,20,3.9F
> Joe_8,18,3.8F{code}
>  
>  
> {code:java}
> A = LOAD 'test.csv' USING PigStorage (',') AS 
> (name:chararray,age:int,gpr:float);
> B = FOREACH A GENERATE FLATTEN(STRSPLIT(name,'_')) as 
> (name1:chararray,name2:chararray),age,gpr;
> DESCRIBE B;{code}
> and produces the following output:
>  
> {code:java}
> B: {name1: chararray,name2: chararray,age: int,gpr: float}
> {code}
> This is the expected output as the result of flatten is defined as chararrays.
>  
> When using version 0.17 (checkout out branch-0.17) the code produces:
> {code:java}
> B: {name1: bytearray,name2: bytearray,age: int,gpr: float}
> {code}
> This shows that somehow FLATTEN inferred wrong data types (bytearray instead 
> of chararay).
>  
> Using explicit casting as a workaround on 0.17:
> {code:java}
> B1 = FOREACH B GENERATE (chararray)name1,(chararray)name2,age,gpr;
> DESCRIBE B1;{code}
> produces
> {code:java}
> B1: {name1: chararray,name2: chararray,age: int,gpr: float}
> {code}
> This time with the expected data types.
>  
> The plan explain show some strange cast operators that are not really used 
> (or at least the actual data types are wrong):
> {code:java}
> #-----------------------------------------------
> # New Logical Plan:
> #-----------------------------------------------
> B: (Name: LOStore Schema: 
> name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)
> |
> |---B: (Name: LOForEach Schema: 
> name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)
>     |   |
>     |   (Name: LOGenerate[false,false,false,false] Schema: 
> name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)ColumnPrune:OutputUids=[121,
>  105, 122, 106]ColumnPrune:InputUids=[121, 105, 122, 106]
>     |   |   |
>     |   |   (Name: Cast Type: chararray Uid: 121)
>     |   |   |
>     |   |   |---name1:(Name: Project Type: bytearray Uid: 121 Input: 0 
> Column: 0)
>     |   |   |
>     |   |   (Name: Cast Type: chararray Uid: 122)
>     |   |   |
>     |   |   |---name2:(Name: Project Type: bytearray Uid: 122 Input: 1 
> Column: 0)
>     |   |   |
>     |   |   age:(Name: Project Type: int Uid: 105 Input: 2 Column: 0)
>     |   |   |
>     |   |   gpr:(Name: Project Type: float Uid: 106 Input: 3 Column: 0)
>     |   |
>     |   |---(Name: LOInnerLoad[0] Schema: name1#121:bytearray)
>     |   |
>     |   |---(Name: LOInnerLoad[1] Schema: name2#122:bytearray)
>     |   |
>     |   |---(Name: LOInnerLoad[2] Schema: age#105:int)
>     |   |
>     |   |---(Name: LOInnerLoad[3] Schema: gpr#106:float)
>     |
>     |---B: (Name: LOForEach Schema: 
> name1#135:bytearray,name2#136:bytearray,age#105:int,gpr#106:float)
>         |   |
>         |   (Name: LOGenerate[true,false,false] Schema: 
> name1#135:bytearray,name2#136:bytearray,age#105:int,gpr#106:float)
>         |   |   |
>         |   |   (Name: UserFunc(org.apache.pig.builtin.STRSPLIT) Type: tuple 
> Uid: 132)
>         |   |   |
>         |   |   |---(Name: Cast Type: chararray Uid: 104)
>         |   |   |   |
>         |   |   |   |---name:(Name: Project Type: bytearray Uid: 104 Input: 0 
> Column: (*))
>         |   |   |
>         |   |   |---(Name: Constant Type: chararray Uid: 131)
>         |   |   |
>         |   |   (Name: Cast Type: int Uid: 105)
>         |   |   |
>         |   |   |---age:(Name: Project Type: bytearray Uid: 105 Input: 1 
> Column: (*))
>         |   |   |
>         |   |   (Name: Cast Type: float Uid: 106)
>         |   |   |
>         |   |   |---gpr:(Name: Project Type: bytearray Uid: 106 Input: 2 
> Column: (*))
>         |   |
>         |   |---(Name: LOInnerLoad[0] Schema: name#104:bytearray)
>         |   |
>         |   |---(Name: LOInnerLoad[1] Schema: age#105:bytearray)
>         |   |
>         |   |---(Name: LOInnerLoad[2] Schema: gpr#106:bytearray)
>         |
>         |---A: (Name: LOLoad Schema: 
> name#104:bytearray,age#105:bytearray,gpr#106:bytearray)RequiredFields:null
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PIG-5404) FLATTEN infers wrong datatype

Reply via email to