mkaravel commented on PR #2971: URL: https://github.com/apache/parquet-java/pull/2971#issuecomment-2778773392
> Thank you for the review! > > > If at the end of processing a row group you still see a bbox of (inf, -inf, inf, -inf) then expose it as (NaN, NaN, NaN, NaN). This means you have not found any non-empty geometry... > > Probably the main issue here is that in the Thrift `BoundingBox`, only `zmin/max` and `mmin/max` are `optional`, and I forgot when reviewing that about the 100% EMPTY case for x and y. The entire `BoundingBox` is technically `optional` in Thrift, but we use an "unset" `BoundingBox` to definitively mean that there were no x/y values it needs to be very clear in the spec. Either way we should add a clarification that unset `zmin/max` and `mmin/max` can only occur if there were no non-NaN Z or M values encountered (if that is indeed what we meant by that). > > If `NaN`s are the convention we land on (instead of unset BoundingBox), it's critical that that `NaN`s are not generated for any other case (for example, it's critical to ensure that a column chunk containing the invalid geometry `LINESTRING (1 2, nan nan, 3 4)`, does _not_ expose NaN, NaN, NaN, NaN as its x--y statistics, because that would result in a predicate pushdown skipping the entire array which may contain other values that do satisfy the predicate). If my reading of the current state of this PR is correct, this still needs to be fixed here. > > I personally like exposing inf, -inf, inf, -inf better than NaN (currently what C++ does) because it is impossible (I think) to generate this by accident and "just works" with the formula provided in the Parquet format for interval containment (we only provided the formula for `x`, but it's reasonable for somebody to read it and apply the non-wraparound version to other dimensions). > > > Could you please share a link to the C++ implementation? Would be very interested in taking a look. > > The PR is here: [apache/arrow#45459](https://github.com/apache/arrow/pull/45459) ...and the files are geospatial_statistics.h/.cc and geospatial_util_internal.cc. Comments welcome! @paleolimbot Thank you for the detailed answer. You are touching upon a few important things, I think we need to figure them out. First at a very high level: I think we should update the Parquet spec (and Iceberg I would say) to explicitly define the behavior in the edge case where the column has all its geometries empty. I also think we should first define the behavior we want and then move on with the implementation rather than try to do it the other way around. I can see the following interesting cases/considerations which are of course can be combined (and thank you for bringing up the issue with geometries with invalid coordinates): * NULL values. * Empty geometries. * Geometries with invalid coordinates. My understanding is that for the purposes of statistics computation we want to skip all such rows. Another option would be to fail reading or writing when we have geometries with invalid coordinates, but I would like to hear opinions about this direction which is quite invasive. So for now I would assume that we do not want to fail in such cases, but rather compute the statistics in a safe manner. So the next question is how do we compute bounding boxes in the presence of such geometries (empty and invalid). Regarding empty geometries I proposed to use NaNs for the box values because NaNs are also typically used to represent the coordinates of empty points (JTS does that). It made more sense to me to have this "symmetry" (for example think of extracting the west-south corner of a box as a point: if you use NaNs you immediately get an empty point which makes sense to be conceptually). This is why I proposed to use NaNs for the case of entire column of empty geometries. As you very well pointed out, this leaves us with a question about how to deal with geometries with invalid coordinates (the question is even more interesting for geographies, but let's leave this for another thread/discussion). I can see two kinds of invalid coordinates: NaNs and infinity values. Everything else (for geometries) should be considered as valid. Let's assume now that we have a bounding box computation algorithm that will just compare coordinates using any double comparison operator except `!=` (I believe the JTS implementation does that). I am also assuming that a bounding box computation algorithm will also skip empty point, empty points in multipoints, and empty points inside geometry collections, appearing either as points or points inside a multipoint (I believe the JTS implementation also does that). In this case any NaN coordinates in the input will surface as NaNs in the resulting box. Inf coordinates may or may not surface up. In the end for a given geometry we end up with a bounding box that possibly has NaN and inf coordinates. By thinking is that we can ignore geometries (for the row group bounding box computation) that have at least one NaN coordinate in their box and keep the inf coordinates as they are. So basically, the flow for computing the bounding box for a group of geometries would be like: * Initialize the box to `(inf, -inf, inf, -inf)`. * Compute the bounding box of the current geometry. * If any of the box's coordinates is NaN ignore that box. Otherwise merge it with the existing box for the group. * If the box is still `(inf, -inf, inf, -inf)` output `(NaN, NaN, NaN, NaN)` for the group. Otherwise, output the box we have already computed. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
