wgtmac commented on PR #2971:
URL: https://github.com/apache/parquet-java/pull/2971#issuecomment-2796402724

   Thanks for the review @mkaravel @paleolimbot @jiayuasu!
   
   Let me clarify some things about Parquet statistics first:
   - Parquet does not consider `NaN` when collecting min/max values for 
float/double values. So by no means we can know if any NaN value exists simply 
for the column statistics.
   - There is [an ongoing 
effort](https://github.com/apache/parquet-format/pull/221
   ) to fix the above issue by introducing a total order: `-Inf < -NaN < 
Negative < -0.0 < +0.0 < Positive < NaN < +Inf`
   - Null values do not interfere with min/max values in the column statistics. 
If all values are null, we can deduce it by checking `null_count == 
value_count` to ignore the empty min/max values.
   
   Based on the above facts, I think we don't need to consider null values for 
geometry values. We still need to take care of NaN and invalid values.
   
   Then I have some questions:
   - What is an invalid value in a geometry feature? `NaN`? `+/-Inf`? Anything 
else?
   - From the above discussion, it seems that `+/-Inf` are invalid values in 
terms of a bbox. If that's true, definitely we should not make it as the final 
bbox to persist in the file. Is `NaN` a valid value in a bbox? Is it a good way 
to use `NaN` values for an empty bbox?
   - Is it a good approach to drop the entire bbox if any NaN or invalid value 
appears? In this way, we do not fail the writer at the cost of missing bbox. 
I'm in favor of this so we do not produce any confusing stats to users. It is 
really hard to downstream users to decide if the provided stats are reliable 
for predicate push down.
   
   I think @mkaravel's proposal (pasted as below) looks reasonable:
   ```
   - Initialize the box to (inf, -inf, inf, -inf).
   - Compute the bounding box of the current geometry.
   - If any of the box's coordinates is NaN ignore that box. Otherwise merge it 
with the existing box for the group.
   - If the box is still (inf, -inf, inf, -inf) output (NaN, NaN, NaN, NaN) for 
the group. Otherwise, output the box we have already computed.
   ```
   For the last step, I think it is better to drop the bbox instead of writing 
all `NaN`s to confuse users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to