Le 20/07/2025 à 13:27, Michael Smith via gdal-dev a écrit :
Using GDAL 3.11.3:
I have a dataset Geometry: Point Feature Count: 15546949 in parquet
format (written using gdal from oracle source). When doing a spatial
query using the geoparquet driver, I see it accessing almost all the
row groups of the dataset (PARQUET: 155/156 row groups selected) with
a spatial filter fetching 12000 of the 15M points and it takes
0m18.794s. When accessing via ADBC and libduckdb, it takes 0m7.102s
(but it also uses 7x CPU and about 10x memory (from looking at top).
I then rewrote the dataset using -lco SORT_BY_BBOX=YES. Then parquet
driver accesses PARQUET: 9/238 row groups selected, and the time drops
to 0m1.412s. Using ADBC and libduckdb, the performance doesn’t change.
For proper performance with gdal, is SORT_BY_BBOX=YES always needed?
yes, unless your features are already spatially sorted. It is a bit
strange that you don't see improvements with the ADBC driver as it does
push the spatial filter bbox in the SQL request, so that's perhaps a
limitation on how duckdb itself deals with such filters
--
Michael Smith
RSGIS Center – ERDC CRREL NH
US Army Corps
_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev
--
http://www.spatialys.com
My software is free, but my time generally not.
_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev