paleolimbot commented on code in PR #494: URL: https://github.com/apache/parquet-format/pull/494#discussion_r2048178831
########## Geospatial.md: ########## @@ -104,6 +104,19 @@ crosses the antimeridian line. In geographic terminology, the concepts of `xmin` For `GEOGRAPHY` types, X and Y values are restricted to the canonical ranges of [-180, 180] for X and [-90, 90] for Y. +When `GeospatialStatistics` is present, writers must omit zmin and zmax if and +only if there are zero non-NaN Z values in the column chunk, and must omit mmin +and mmax if and only if there are zero non-NaN M values. The bounding box must +be omitted entirely if and only if there are zero non-NaN X values or zero +non-NaN Y values in the column chunk. If Z or M values are missing, the writer +may still include a bounding box using only the available dimensions. + +Readers may interpret the absence of a bounding box, zmin/zmax, or mmin/mmax as +an indication that all corresponding values are null, and may use this +information to skip data during predicate evaluation. For example, a reader may +skip a row group if the bounding box is absent, indicating that all X and Y +coordinates are null. Review Comment: I think that's the idea with this language...we need the absent-ness to be significant so that there is a path for a reader to skip an all empty/null row group, and the other ways of communicating absent-ness were also confusing (use NaNs, use Inf/-Inf). We can also add another field like `optional dimensions_that_have_zero_non_nan_values` (with a less verbose name)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
