GitHub user paleolimbot added a comment to the discussion: Observations from R and Python benchmarks: performance bottlenecks and optimization ideas for sedona-db
Very cool! Thank you for putting together these benchmarks! I think you're right about these bullets: - We can possibly make SedonaDB->geopandas faster - We really do need to be able to read GDAL/OGR via R. I'll open a ticket for this and see if I can squeeze it in...it's more complicated than Python since we don't have pyogrio to help. - We'd love to add ST_LineMerge and ST_Subdivide. These are "just" GEOS functions and have already been merged to georust/geos ( https://github.com/georust/geos/blob/47afbad2483e489911ddb456417808340e9342c3/src/geometry.rs#L2789-L2801 ). I'll open tickets for these. - I'll fix the schema mismatch issue this week 🙂 Reading GeoPackages and converting outside the Arrow universe are always going to be slower than GeoParquet + staying inside the Arrow universe, and part of SedonaDB is strengthening those ecosystems to the point that those operations don't have to happen (i.e., we also want to make SedonaDB->geopandas/sf and reading .gpkg files unnecessary most of the time by making sure we support the next step). If I'm reading these correctly, these benchmarks are of a `.gpkg` read(s), followed by some operation, with a collect back into various existing frameworks. I think the reason `sedonadb-sf` appears so fast is that you're using `sd_collect()`, which doesn't actually produce `sf` objects but something closer to a zero-copy ALTREP wrapper around the array (a `geoarrow_vctr`, to be precise). If you changed `sd_collect()` to `st_as_sf()` I think you'd see something more similar to `sedonadb-geopandas`. I'm not sure why sedonadb-polars isn't identical to sedonadb-sf for the spatial_join benchmark (I would have expected those results to be identical). I think that geopandas caches the spatial index and I'm not sure you have a totally "fresh" GeoDataFrame for each iteration of your benchmark (or also, this might be a case where Python/R string handling shines over Arrow since there are a lot of repeated strings in the output). 16 Polygons x 100k points is pretty small and I'm pleased with the fact that SedonaDB doesn't add so much overhead that the performance on that sort of microbenchmark is reasonable. > how I can best contribute to testing these high-performance paths! Continuing to kick the tires and write about it is fantastic! Knowing that there's interest in SedonaDB for R is helpful (it's a bit of a side project for the other SedonaDB work I do and it's motivating if I know that anybody actually plans on using it 🙂 ). I'm not sure I will ever get to writing a GeocompX variant but in theory that's what we're trying to provide with SedonaDB and it's a great blueprint for stuff that SedonaDB should be able to do at some point. GitHub link: https://github.com/apache/sedona/discussions/2576#discussioncomment-15402640 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
