2010YOUY01 commented on PR #560:
URL: https://github.com/apache/sedona-db/pull/560#issuecomment-3835471153

   This PR is reworked, the TLDR for the option semantics is:
   
   1. For a regular parquet file with binary column (but physically 
WKB-encoded), use this option to specify Binary column as geometry
   ```
   sd.read_parquet(
       "geo_legacy.parquet",
       geometry_columns={
           "geometry": {"encoding": "WKB", "crs": "EPSG:4326", "edges": 
"planar"}
       },
   )
   ```
   2. If a column is already geometry (inferred from parquet metadata), this 
option can be used to provide optional but missing field; if one field is 
already inferred from metadata, and set again from the option, an error occur. 
This feels safer to me, but I'm open to other opinions.
   ```
   # Inferred option from metadata:
   #     {"encoding": "WKB"} # "crs" is missing
   
   # Provided 'crs' option from `geometry_columns` is allowed
   sd.read_parquet(
       "geo.parquet",
       geometry_columns={
           "geometry": {"crs": "EPSG:4326"}
       },
   )
   # Now 'geometry' column is a geometry column with crs=4326
   ```
   
   ```
   # Inferred option from metadata:
   #     {"encoding": "WKB", "crs": "EPSG:4326"}
   
   # Not allowed to provide option that is already inferred from schema
   sd.read_parquet(
       "geo.parquet",
       geometry_columns={
           "geometry": {"crs": "EPSG:3857"}
       },
   )
   # Errors...
   ```
   
   ## Implementation/Key changes
   ```text
   (existing)
   geoparquet metadata --> (per col) GeoParquetColumnMetadata --> schema
   
   (PR)
   geoparquet metadata --> (per col) GeoParquetColumnMetadata ----+
                                                                  | (combine)
                                                                  |
   user option geometry_columns --> GeoParquetColumnMetadata -----+--> schema
   ```
   1. Parse option with `serde_json::from_str`, the same as parquet metadata, 
and store the column options inside `GeoParquetFormat -> 
TableGeoParquetOption`, since `TableFormat` trait is used to build schema. When 
`infer_schema()` is called, combine the `GeoParquetColumnMetadata` from both 
metadata and `geometry_columns` option.
   2. Refactor the `GeoParquetColumnMetadata` to make its `encoding` field 
optional. Since this is a required field for GeoParquet spec, assertions are 
added to the existing deserializer to ensure it exist in the parquet metadata
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to