2010YOUY01 opened a new issue, #530:
URL: https://github.com/apache/sedona-db/issues/530

   # Motivation
   
   To convert legacy Parquet files that store geometry as a `BINARY` column 
whose payload is WKB into GeoParquet, the snippet below can be used. It 
explicitly converts the binary WKB payload into a geometry value (and sets the 
SRID), so that SedonaDB recognizes the column as geometry and `to_parquet()` 
can write GeoParquet metadata correctly.
   
   ```python
   # geo_legacy.parquet schema
   # - geo_bin: Binary (payload is WKB)
   # - c1: Int32
   # - c2: Int32
   
   df = sd.read_parquet("/data/geo_legacy.parquet")
   
   # Register a view name for SQL
   df = df.to_view("t", overwrite=True)
   
   df = sd.sql("""
     SELECT
       ST_SetSRID(ST_GeomFromWKB(geo_bin), 4326) AS geometry,
       * EXCLUDE (geo_bin)
     FROM t
   """)
   
   df.to_parquet("geo_geoparquet.parquet")
   ```
   
   # Proposed new API
   
   It would be helpful to have an easier API for this. Using a dedicated method 
(instead of fusing the cast into `read_parquet()` or `to_parquet()`) makes the 
conversion more flexible, especially when “logically geometry, physically 
WKB-in-binary” columns come from other sources or are produced mid-query.
   
   ```python
   def with_geometry(...):
       """
       Convert one or more binary WKB columns into geometry columns.
   
       Args:
           columns: A column name or list of column names containing WKB 
payloads.
           crs: Optional CRS identifier (e.g., 4326 or "EPSG:4326").
           validate: If True, validate WKB payloads while converting.
           primary: Optional name to mark as the primary geometry column.
   
       The converted geometry columns are projected first (in the order of
       ``columns`` or with ``primary`` first), followed by the remaining 
columns.
   
       Examples:
           >>> sd = sedona.db.connect()
           >>> df = sd.read_parquet("geo_legacy.parquet").with_geometry(
           ...     columns=["geo_bin"],
           ...     crs="EPSG:4326",
           ...     validate=True,
           ...     primary="geo_bin",
           ... )
       """
   ```
   
   ## Example usage
   
   ```python
   # geo_legacy.parquet schema
   # - geo_bin: Binary (payload is WKB)
   # - c1: Int32
   # - c2: Int32
   
   df = sd.read_parquet("/data/geo_legacy.parquet")
   
   df = df.with_geometry(
       columns="geo_bin",
       crs=4326,
       validate=True,
       primary="geo_bin",
   )
   
   df.to_parquet("geo_geoparquet.parquet")
   ```
   
   ## Implementation
   
   Internally, it's simply add expression projection on geometry columns (with 
`ST_SetSRID(ST_GeomFromWKB(geo_bin), 4326)`)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to