GitHub user Robinlovelace edited a discussion: Observations from R and Python benchmarks: performance bottlenecks and optimization ideas for sedona-db
I have tested sedona-db R and Python interfaces vs established tools like geopandas and sf (R). You can find the full reproducible setup and current results here: [Robinlovelace/geobench](https://github.com/Robinlovelace/geobench). The benchmarks show impressive performance for Sedona. Hoping the results are of interest, and I think some of the observations below could point the way to speed-ups in the interfaces, I'm quite new to the project so sharing these observations as a discussion rather than bombarding the project with issues, I've already opened one as you'll see in link below! 1. Python: The Shapely Deserialization Bottleneck In my Python benchmarks, I noticed a significant drop in performance when collecting query results using .to_pandas(). * Observation: .to_pandas() triggers a conversion from internal Arrow/WKB binary geometries into Python shapely objects. For a 100k point dataset, this conversion becomes the dominant cost. * Workaround: By bypassing .to_pandas() and staying in the Arrow/Polars ecosystem, I achieved a ~5.5x speedup. * Implementation: See [bench_sedona_polars.py](https://github.com/Robinlovelace/geobench/blob/main/scripts/bench_sedona_polars.py) script. * Question: Would the team be open to adding a native .to_polars() method (or similar) to keep geometries in efficient binary format? 2. R: Direct File Ingestion (GDAL/OGR) I found that the R interface currently lacks an equivalent to the Python sd.read_pyogrio(). * Current Path: To benchmark loading from disk, I have to read via sf::st_read() and then pipe to sd_to_view(). This materializes heavy R sf objects in memory before they are converted back to Arrow for Sedona. * Opportunity: Implementing an sd_read_gdal() or sd_read_pyogrio() equivalent in the R interface (similar to the path in my [R sedonadb](https://github.com/Robinlovelace/geobench/blob/main/scripts/bench_sedona_r.R)) would allow direct file-to-engine ingestion via Arrow C Streams, drastically reducing startup overhead. 3. Roadmap: Complex Linestring Operations I am curious about the roadmap for parallelized linestring operations in the native engine. Specifically: * Does the engine currently support (or plan to support) high-performance line merges and line splits (ST_LineMerge, ST_Subdivide)? * Given the scaling I see in spatial joins, I'm wondering if Sedona could be positioned to become the go-to tool for large-scale topological network cleaning. 4. Optimizer Bug (Issue #477) I also encountered a specific PhysicalOptimizer schema mismatch error when joining Pandas-sourced tables of different sizes. I have documented this in detail here: sedona-db Issue #477 (https://github.com/apache/sedona-db/issues/477). I’d love to hear the community’s thoughts on these observations and how I can best contribute to testing these high-performance paths! GitHub link: https://github.com/apache/sedona/discussions/2576 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
