Using GDAL 3.11.3:

 

I have a dataset Geometry: Point Feature Count: 15546949 in parquet format 
(written using gdal from oracle source). When doing a spatial query using the 
geoparquet driver, I see it accessing almost all the row groups of the dataset 
(PARQUET: 155/156 row groups selected) with a spatial filter fetching 12000 of 
the 15M points and it takes 0m18.794s. When accessing via ADBC and libduckdb, 
it takes 0m7.102s (but it also uses 7x CPU and about 10x memory (from looking 
at top). 

 

I then rewrote the dataset using -lco SORT_BY_BBOX=YES. Then parquet driver 
accesses PARQUET: 9/238 row groups selected, and the time drops to 0m1.412s. 
Using ADBC and libduckdb, the performance doesn’t change. 

 

For proper performance with gdal, is SORT_BY_BBOX=YES always needed?

 

 

-- 

Michael Smith

RSGIS Center – ERDC CRREL NH

US Army Corps

 

_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Reply via email to