Putting aside for a moment the question of hashing -0 and +0, I wonder if this 
could be addressed by ordering floating point numbers using the totalOrder 
predicate, but when there is a NaN in a file, omit the field it is in from 
manifest_entry.data_file.{sort_columns, lower_bounds, upper_bounds}.

The logic here is that, though ham-fisted, this would also prevent engines from 
misinterpreting these fields. A natural follow-up question is, "should we 
populate these values in some other way less likely to be misinterpreted by 
compute engines?" IIRC, parquet's transition from {min,max} to 
{min_value,max_value} was motivated by an ambiguity or bug in the spec. This 
starts to get a bit arcane, but maybe we WANT a speed bump to stop engines from 
prune or search by using the non-total-order operators like <=.

Thoughts?

Reply via email to