Hi,
I stumbled upon a case where MIN/MAX on strings results with values that
are definitely not the minimum or the maximum:
When executed on 1 million records the following script results in wrong
values for MIN/MAX:
```
src = LOAD 's3n://.../' USING PigStorage('\t','-noschema') AS (field1:int,
field2:int, field3:int, field4:chararray, field5:chararray,
field6:chararray, field7:chararray, field8:chararray);
agg = GROUP src BY (field3);
proj = FOREACH agg GENERATE group AS field3, COUNT_STAR(proj) AS
countme, datafu.pig.stats.HyperLogLogPlusPlus(proj.field5) AS HLL1,
MIN(proj.field8) AS Minval, MAX(proj.field8) AS Maxval;
STORE copy_of_destination14 INTO 's3n://...' USING PigStorage('\t');
```
If I make the following changes, the results for MIN and MAX are as
expected:
1. Remove use of HyperLogLogPlusPlus
2. If I treat field8 as a datetime field instead of chararray
3. If I only execute this on 1/100 of the data
Note that the job is comprised of a single map/reduce job with a single map
task and a single reduce task.
Any idea?
Thanks,
Ron