GitHub user Robinlovelace edited a discussion: Observations from R and Python 
benchmarks: performance bottlenecks and optimization ideas for sedona-db

I have tested sedona-db R and Python interfaces vs established tools like 
geopandas and sf (R). You can find the full reproducible
  setup and current results here: 
[Robinlovelace/geobench](https://github.com/Robinlovelace/geobench).

The benchmarks show impressive performance for Sedona. Hoping the results are 
of interest, and I think some of the observations below could point the way to 
speed-ups in the interfaces, I'm quite new to the project so sharing these 
observations as a discussion rather than bombarding the project with issues, 
I've already opened one as you'll see in link below!

  1. Python: The Shapely Deserialization Bottleneck
  In my Python benchmarks, I noticed a significant drop in performance when 
collecting query results using .to_pandas().

   * Observation: .to_pandas() triggers a conversion from internal Arrow/WKB 
binary geometries into Python shapely objects. For a 100k point dataset, this 
conversion becomes the
     dominant cost.
   * Workaround: By bypassing .to_pandas() and staying in the Arrow/Polars 
ecosystem, I achieved a ~5.5x speedup.
   * Implementation: See 
[bench_sedona_polars.py](https://github.com/Robinlovelace/geobench/blob/main/scripts/bench_sedona_polars.py)
 script.
   * Question: Would the team be open to adding a native .to_polars() method 
(or similar) to keep geometries in efficient binary format?

  2. R: Direct File Ingestion (GDAL/OGR)
  I found that the R interface currently lacks an equivalent to the Python 
sd.read_pyogrio(). 

   * Current Path: To benchmark loading from disk, I have to read via 
sf::st_read() and then pipe to sd_to_view(). This materializes heavy R sf 
objects in memory before they are
     converted back to Arrow for Sedona.
   * Opportunity: Implementing an sd_read_gdal() or sd_read_pyogrio() 
equivalent in the R interface (similar to the path in my [R 
sedonadb](https://github.com/Robinlovelace/geobench/blob/main/scripts/bench_sedona_r.R))
 would allow direct file-to-engine ingestion via Arrow C Streams, drastically 
reducing startup
     overhead.

  3. Roadmap: Complex Linestring Operations
  I am curious about the roadmap for parallelized linestring operations in the 
native engine. Specifically:
   * Does the engine currently support (or plan to support) high-performance 
line merges and line splits (ST_LineMerge, ST_Subdivide)?
   * Given the scaling I see in spatial joins, I'm wondering if Sedona could be 
positioned to become the go-to tool for large-scale topological network 
cleaning.

  4. Optimizer Bug (Issue #477)
  I also encountered a specific PhysicalOptimizer schema mismatch error when 
joining Pandas-sourced tables of different sizes. I have documented this in 
detail here: sedona-db Issue
  #477 (https://github.com/apache/sedona-db/issues/477).

  I’d love to hear the community’s thoughts on these observations and how I can 
best contribute to testing these high-performance paths!


GitHub link: https://github.com/apache/sedona/discussions/2576

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to