jorisvandenbossche commented on PR #240: URL: https://github.com/apache/parquet-format/pull/240#issuecomment-2638004832
I follow what Dewey has already answered, but just trying to additionally clarify a few points from Ryan's post: > Also, please correct me if I'm wrong here. My current understanding is that the WKB data will correspond to the CRS even if the bounding box dimensions override it. @rdblue if I understand you correctly, then yes I think that is not correct. WKB data is defined to be x/y, and almost any producer of WKB values or file format using WKB under the hood (including GeoParquet) will use the mapping of x=lon / y=lat. So for example when using EPSG:4326 (defined with an axis order of lat/lon), the WKB will not correspond to the CRS. > This specifically states that the order of dimensions in bounding box metadata must differ from the CRS in some cases. To me, that seems like a big implementation risk if people don't know to swap them. In addition, the names that we use for the bounding box values (xmin, ymin, xmax, ymax) are misleading when the WKB values use x=latitude, y=longitude but x and y in metadata must be x=longitude, y=latitude. So with my above answer, your last sentence is here is also not correct (I am considering GeoParquet here for a moment). We define both the bbox as the WKB values to use the convention of x=lon / y=lat, so that the bbox and the WKB data are always consistent with each other. This actually ensures that you can read and filter data based on the bbox statistics _without_ having to inspect the CRS of the column. You mention _"I think we want to avoid needing everything to understand the CRS"_, but so that is exactly what GeoParquet tries to achieve by saying that x=lon and y=lat. Because if you are not sure if the bbox and WKB data is lon/lat or lat/lon, then you always have to first inspect the CRS before you know how to specify the bbox filter and how to parse the WKB values. --- This is clearly all confusing and easy to misunderstand / misinterpret each other, which is IMO a good reason to make this more explicit in the spec. So I am personally not a fan of Dewey's last suggestion of leaving this vague and then letting implementations choose how to handle this (which will then in practice be how GeoParquet does it, I would guess, but which is the opposite of what you _could_ read in the current version of the spec) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org