Sergey created PIG-3402:
---------------------------

             Summary: Incorrect ORDER BY after UNION ONSCHMEA. Pig handles Long 
atom as chararray
                 Key: PIG-3402
                 URL: https://issues.apache.org/jira/browse/PIG-3402
             Project: Pig
          Issue Type: Bug
            Reporter: Sergey


Here is a part of script:
{code}
lastEndPoints24h = LOAD '$lastEndPoints24h' USING 
org.apache.pig.piggybank.storage.avro.AvroStorage();
describe lastEndPoints24h;
dump lastEndPoints24h;

lastEndPoints24hProj = FOREACH lastEndPoints24h GENERATE msisdn, 
toLong((chararray)ts) as ts:long,
                                                               center_lon, 
center_lat,
                                                               lac, cid, lon, 
lat, cell_type, is_active, azimuth, hpbw, max_dist,
                                                               tile_id, 
zone_col, zone_row,
                                                               is_end_point, 
end_point_type;
describe lastEndPoints24hProj;
dump lastEndPoints24hProj;

unionOfPivotsAndLastEndPoints = UNION ONSCHEMA validPivotsProj, 
lastEndPoints24hProj;
describe unionOfPivotsAndLastEndPoints;
dump unionOfPivotsAndLastEndPoints;


groupedValidPivots = GROUP unionOfPivotsAndLastEndPoints BY msisdn;

pivotsWithEndPoints = FOREACH groupedValidPivots {
                ordered = ORDER unionOfPivotsAndLastEndPoints BY ts;

{code}
The problem is that unionOfPivotsAndLastEndPoints are not correctly sorted. 
Looks like PIg assumes that ts field is chararray.

Here are dumps and schemas of relations:
{code}
lastEndPoints24h: {msisdn: long,ts: long,center_lon: double,center_lat: 
double,lac: int,cid: int,lon: double,lat: double,cell_type: 
chararray,is_active: boolean,azimuth: int,hpbw: int,max_dist: int,tile_id: 
int,zone_col: int,zone_row: int,is_end_point: boolean,end_point_type: chararray}
--dump
(79263332100,1374521131,37.553441893272755,55.880436657140294,7712,24316,37.5473,55.8792,OUTDOOR,true,75,60,1102,49646,469,410,true,JITTER_START)
{code}

{code}
lastEndPoints24hProj: {msisdn: long,ts: long,center_lon: double,center_lat: 
double,lac: int,cid: int,lon: double,lat: double,cell_type: 
chararray,is_active: boolean,azimuth: int,hpbw: int,max_dist: int,tile_id: 
int,zone_col: int,zone_row: int,is_end_point: boolean,end_point_type: chararray}
(79263332100,1374521131,37.553441893272755,55.880436657140294,7712,24316,37.5473,55.8792,OUTDOOR,true,75,60,1102,49646,469,410,true,JITTER_START)
{code}

{code}
unionOfPivotsAndLastEndPoints: {msisdn: long,ts: long,lac: int,cid: int,lon: 
double,lat: double,azimuth: int,hpbw: int,max_dist: int,cell_type: 
chararray,branch_id: int,center_lon: double,center_lat: double,tile_id: 
int,zone_col: int,zone_row: int,is_active: boolean,is_end_point: 
boolean,end_point_type: chararray}
--union dump:
(79263332100,1374529463,7712,5258,37.5564,55.8845,210,60,765,OUTDOOR,5145,37.55330379777028,55.881137048806984,49646,469,410,true,,)
(79263332100,1374550275,7712,24316,37.5473,55.8792,75,60,1102,OUTDOOR,5145,37.55614891372749,55.88052982685867,49646,471,410,true,,)
--more lines here...
--the last one came from projection lastEndPoints24hProj
(79263332100,1374521131,7712,24316,37.5473,55.8792,75,60,1102,OUTDOOR,,37.553441893272755,55.880436657140294,49646,469,410,true,true,JITTER_START)
{code}

Looks like everything is OK, but it's not true!
Here is input for UDF after ORDER BY:
{code}
--a part of code
groupedValidPivots = GROUP unionOfPivotsAndLastEndPoints BY msisdn;

pivotsWithEndPoints = FOREACH groupedValidPivots {
                ordered = ORDER unionOfPivotsAndLastEndPoints BY ts;
                GENERATE FLATTEN(udf.mark_end_points(ordered, 'ts:1, lac:2, 
cid:3, is_end_point:17, lon:4, lat:5, azimuth:6, hpbw:7, max_dist:8'))
{code}

ordered projection print from UDF:
{code}
ITERATE PIVOTS: 0 ) (79263332100L, 1374529463, 7712, 5258, 37.5564, 55.8845, 
210, 60, 765, u'OUTDOOR', 5145, 37.55330379777028, 55.881137048806984, 49646, 
469, 410, True, None, None)
--more lines here...
ITERATE PIVOTS: 22 ) (79263332100L, 1374521131L, 7712, 24316, 37.5473, 55.8792, 
75, 60, 1102, u'OUTDOOR', None, 37.553441893272755, 55.880436657140294, 49646, 
469, 410, True, True, u'JITTER_START')
{code}

See that 1374521131L has "L" and 1374529463 doesn't have (it's ts atom value)
See that 1374529463 > 1374521131, but tuple with ts=1374521131L is at the end 
of list. Looks like sorting was applied to ts:hararray, not to ts:long.

It's weird. :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to