[ https://issues.apache.org/jira/browse/PIG-5404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193442#comment-17193442 ]
Bruno Pusztahazi commented on PIG-5404: --------------------------------------- [~knoguchi] please check the following: here not only the DESCRIBE is affected, but the the last JOIN will cause extra issues: {code:java} Data Used to create table: create 'Test', 'CM' put 'Test','202006181604008928049-9223370442808339240','CM:NUM','1' put 'Test','202007010956112120091-9223370442807585497','CM:NUM','2' =============================================================================== Sample Pig Script to replicate in grunt shell: REGISTER hbase/lib/*.jar; CA = LOAD 'hbase://Test' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('CM:NUM', '-loadKey true') AS (RowKey:bytearray,NUMBER:chararray); B = FOREACH CA GENERATE RowKey AS CA_RowKey,REPLACE(RowKey,'C-','C') AS RowKey1,NUMBER; describe B; B: {CA_RowKey: bytearray,RowKey1: chararray,NUMBER: chararray} C = FOREACH B GENERATE CA_RowKey,FLATTEN(STRSPLIT(RowKey1,'-')) as (CA_CASEID:chararray,CA_EPOCH:chararray),NUMBER; describe C; C: {CA_RowKey: bytearray,CA_CASEID: bytearray,CA_EPOCH: bytearray,NUMBER: chararray} G = GROUP C BY CA_CASEID; G1 = FOREACH G GENERATE group as CA_CASEID, MIN(C.CA_EPOCH) AS CA_EPOCH; J = join C by (CA_CASEID, CA_EPOCH) , G1 by (CA_CASEID, CA_EPOCH); DESCRIBE J; {code} The issue was: {code:java} 2020-08-14 12:09:06,257 [main] org.apache.pig.tools.grunt.Grunt - 1130: <line 6, column 4> join column no. 2 in relation no. 2 of join statement has datatype double which is incompatible with type of corresponding column in earlier relation(s) in the statement {code} > FLATTEN infers wrong datatype > ----------------------------- > > Key: PIG-5404 > URL: https://issues.apache.org/jira/browse/PIG-5404 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.17.0 > Reporter: Bruno Pusztahazi > Assignee: Koji Noguchi > Priority: Critical > Labels: datatypes, flatten > > In version 0.12 (checked out branch-0.12) the following code works as > expected: > With the following input file test.csv: > > {code:java} > John_5,18,4.0F > Mary_6,19,3.8F > Bill_7,20,3.9F > Joe_8,18,3.8F{code} > > > {code:java} > A = LOAD 'test.csv' USING PigStorage (',') AS > (name:chararray,age:int,gpr:float); > B = FOREACH A GENERATE FLATTEN(STRSPLIT(name,'_')) as > (name1:chararray,name2:chararray),age,gpr; > DESCRIBE B;{code} > and produces the following output: > > {code:java} > B: {name1: chararray,name2: chararray,age: int,gpr: float} > {code} > This is the expected output as the result of flatten is defined as chararrays. > > When using version 0.17 (checkout out branch-0.17) the code produces: > {code:java} > B: {name1: bytearray,name2: bytearray,age: int,gpr: float} > {code} > This shows that somehow FLATTEN inferred wrong data types (bytearray instead > of chararay). > > Using explicit casting as a workaround on 0.17: > {code:java} > B1 = FOREACH B GENERATE (chararray)name1,(chararray)name2,age,gpr; > DESCRIBE B1;{code} > produces > {code:java} > B1: {name1: chararray,name2: chararray,age: int,gpr: float} > {code} > This time with the expected data types. > > The plan explain show some strange cast operators that are not really used > (or at least the actual data types are wrong): > {code:java} > #----------------------------------------------- > # New Logical Plan: > #----------------------------------------------- > B: (Name: LOStore Schema: > name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float) > | > |---B: (Name: LOForEach Schema: > name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float) > | | > | (Name: LOGenerate[false,false,false,false] Schema: > name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)ColumnPrune:OutputUids=[121, > 105, 122, 106]ColumnPrune:InputUids=[121, 105, 122, 106] > | | | > | | (Name: Cast Type: chararray Uid: 121) > | | | > | | |---name1:(Name: Project Type: bytearray Uid: 121 Input: 0 > Column: 0) > | | | > | | (Name: Cast Type: chararray Uid: 122) > | | | > | | |---name2:(Name: Project Type: bytearray Uid: 122 Input: 1 > Column: 0) > | | | > | | age:(Name: Project Type: int Uid: 105 Input: 2 Column: 0) > | | | > | | gpr:(Name: Project Type: float Uid: 106 Input: 3 Column: 0) > | | > | |---(Name: LOInnerLoad[0] Schema: name1#121:bytearray) > | | > | |---(Name: LOInnerLoad[1] Schema: name2#122:bytearray) > | | > | |---(Name: LOInnerLoad[2] Schema: age#105:int) > | | > | |---(Name: LOInnerLoad[3] Schema: gpr#106:float) > | > |---B: (Name: LOForEach Schema: > name1#135:bytearray,name2#136:bytearray,age#105:int,gpr#106:float) > | | > | (Name: LOGenerate[true,false,false] Schema: > name1#135:bytearray,name2#136:bytearray,age#105:int,gpr#106:float) > | | | > | | (Name: UserFunc(org.apache.pig.builtin.STRSPLIT) Type: tuple > Uid: 132) > | | | > | | |---(Name: Cast Type: chararray Uid: 104) > | | | | > | | | |---name:(Name: Project Type: bytearray Uid: 104 Input: 0 > Column: (*)) > | | | > | | |---(Name: Constant Type: chararray Uid: 131) > | | | > | | (Name: Cast Type: int Uid: 105) > | | | > | | |---age:(Name: Project Type: bytearray Uid: 105 Input: 1 > Column: (*)) > | | | > | | (Name: Cast Type: float Uid: 106) > | | | > | | |---gpr:(Name: Project Type: bytearray Uid: 106 Input: 2 > Column: (*)) > | | > | |---(Name: LOInnerLoad[0] Schema: name#104:bytearray) > | | > | |---(Name: LOInnerLoad[1] Schema: age#105:bytearray) > | | > | |---(Name: LOInnerLoad[2] Schema: gpr#106:bytearray) > | > |---A: (Name: LOLoad Schema: > name#104:bytearray,age#105:bytearray,gpr#106:bytearray)RequiredFields:null > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)