Could you share a portion of the data or even the offending values e.g. a
Min and Max pair that aren't correct along with the data you expected to be
the Min and Max.



On Sun, Mar 29, 2015 at 7:46 AM, Ronald Green <[email protected]>
wrote:

> I can share demo data to go with the script. Anyone has any clue?
>
> On 24 March 2015 at 14:04, Ronald Green <[email protected]> wrote:
>
> > Hi,
> >
> > I stumbled upon a case where MIN/MAX on strings results with values that
> > are definitely not the minimum or the maximum:
> >
> > When executed on 1 million records the following script results in wrong
> > values for MIN/MAX:
> >
> > ```
> > src = LOAD 's3n://.../' USING PigStorage('\t','-noschema') AS
> (field1:int,
> > field2:int, field3:int, field4:chararray, field5:chararray,
> > field6:chararray, field7:chararray, field8:chararray);
> > agg = GROUP src BY (field3);
> > proj = FOREACH agg GENERATE group AS field3, COUNT_STAR(proj) AS
> > countme, datafu.pig.stats.HyperLogLogPlusPlus(proj.field5) AS HLL1,
> > MIN(proj.field8) AS Minval, MAX(proj.field8) AS Maxval;
> > STORE copy_of_destination14 INTO 's3n://...' USING PigStorage('\t');
> > ```
> >
> > If I make the following changes, the results for MIN and MAX are as
> > expected:
> >
> > 1. Remove use of HyperLogLogPlusPlus
> > 2. If I treat field8 as a datetime field instead of chararray
> > 3. If I only execute this on 1/100 of the data
> >
> > Note that the job is comprised of a single map/reduce job with a single
> > map task and a single reduce task.
> >
> > Any idea?
> >
> > Thanks,
> > Ron
> >
>



-- 


https://github.com/bearrito
@deepbearrito

Reply via email to