Could you share a portion of the data or even the offending values e.g. a Min and Max pair that aren't correct along with the data you expected to be the Min and Max.
On Sun, Mar 29, 2015 at 7:46 AM, Ronald Green <[email protected]> wrote: > I can share demo data to go with the script. Anyone has any clue? > > On 24 March 2015 at 14:04, Ronald Green <[email protected]> wrote: > > > Hi, > > > > I stumbled upon a case where MIN/MAX on strings results with values that > > are definitely not the minimum or the maximum: > > > > When executed on 1 million records the following script results in wrong > > values for MIN/MAX: > > > > ``` > > src = LOAD 's3n://.../' USING PigStorage('\t','-noschema') AS > (field1:int, > > field2:int, field3:int, field4:chararray, field5:chararray, > > field6:chararray, field7:chararray, field8:chararray); > > agg = GROUP src BY (field3); > > proj = FOREACH agg GENERATE group AS field3, COUNT_STAR(proj) AS > > countme, datafu.pig.stats.HyperLogLogPlusPlus(proj.field5) AS HLL1, > > MIN(proj.field8) AS Minval, MAX(proj.field8) AS Maxval; > > STORE copy_of_destination14 INTO 's3n://...' USING PigStorage('\t'); > > ``` > > > > If I make the following changes, the results for MIN and MAX are as > > expected: > > > > 1. Remove use of HyperLogLogPlusPlus > > 2. If I treat field8 as a datetime field instead of chararray > > 3. If I only execute this on 1/100 of the data > > > > Note that the job is comprised of a single map/reduce job with a single > > map task and a single reduce task. > > > > Any idea? > > > > Thanks, > > Ron > > > -- https://github.com/bearrito @deepbearrito
