Re: [D] Observations from R and Python benchmarks: performance bottlenecks and optimization ideas for sedona-db [sedona]

via GitHub Sat, 03 Jan 2026 19:42:36 -0800


GitHub user paleolimbot added a comment to the discussion: Observations from R 
and Python benchmarks: performance bottlenecks and optimization ideas for 
sedona-db

Very cool! Thank you for putting together these benchmarks! I think you're
right about these bullets:

- We can possibly make SedonaDB->geopandas faster
- We really do need to be able to read GDAL/OGR via R. I'll open a ticket for
this and see if I can squeeze it in...it's more complicated than Python since
we don't have pyogrio to help.
- We'd love to add ST_LineMerge and ST_Subdivide. These are "just" GEOS
functions and have already been merged to georust/geos (
https://github.com/georust/geos/blob/47afbad2483e489911ddb456417808340e9342c3/src/geometry.rs#L2789-L2801
). I'll open tickets for these.
- I'll fix the schema mismatch issue this week 🙂

Reading GeoPackages and converting outside the Arrow universe are always going
to be slower than GeoParquet + staying inside the Arrow universe, and part of
SedonaDB is strengthening those ecosystems to the point that those operations
don't have to happen (i.e., we also want to make SedonaDB->geopandas/sf and
reading .gpkg files unnecessary most of the time by making sure we support the
next step).

If I'm reading these correctly, these benchmarks are of a `.gpkg` read(s),
followed by some operation, with a collect back into various existing
frameworks. I think the reason `sedonadb-sf` appears so fast is that you're
using `sd_collect()`, which doesn't actually produce `sf` objects but something
closer to a zero-copy ALTREP wrapper around the array (a `geoarrow_vctr`, to be
precise). If you changed `sd_collect()` to `st_as_sf()` I think you'd see
something more similar to `sedonadb-geopandas`.

I'm not sure why sedonadb-polars isn't identical to sedonadb-sf for the
spatial_join benchmark (I would have expected those results to be identical). I
think that geopandas caches the spatial index and I'm not sure you have a
totally "fresh" GeoDataFrame for each iteration of your benchmark (or also,
this might be a case where Python/R string handling shines over Arrow since
there are a lot of repeated strings in the output). 16 Polygons x 100k points
is pretty small and I'm pleased with the fact that SedonaDB doesn't add so much
overhead that the performance on that sort of microbenchmark is reasonable.

> how I can best contribute to testing these high-performance paths!

Continuing to kick the tires and write about it is fantastic! Knowing that
there's interest in SedonaDB for R is helpful (it's a bit of a side project for
the other SedonaDB work I do and it's motivating if I know that anybody
actually plans on using it 🙂 ). I'm not sure I will ever get to writing a
GeocompX variant but in theory that's what we're trying to provide with
SedonaDB and it's a great blueprint for stuff that SedonaDB should be able to
do at some point.

GitHub link:
https://github.com/apache/sedona/discussions/2576#discussioncomment-15402640

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Observations from R and Python benchmarks: performance bottlenecks and optimization ideas for sedona-db [sedona]

Reply via email to