Dan,
No you didn't do anything obviously wrong. I'm not sure that in the
ArrowDataset mode libarrow actually uses group statistics to filter out
row groups, which might cause it to actually ingest the whole files
You may also try to tune the config options at
https://github.com/OSGeo/gdal/blob/master/ogr/ogrsf_frmts/parquet/ogrparquetdatasetlayer.cpp#L522-L558
do you observe a similar difference if you work with just a simple file
like
/vsis3/overturemaps-us-west-2/release/2024-08-20.0/theme=divisions/type=division_area/part-00000-5466202d-8cdf-48e5-9aee-886c73dafc5f-c000.zstd.parquet
?
Even
Le 28/08/2024 à 18:45, Daniel Baston via gdal-dev a écrit :
Hello,
I'm trying to use ogr2ogr with an attribute filter to pull 14 polygons
from Overture maps. Running the following command with CPL_DEBUG=ON
tells me that "PARQUET: Attribute filter fully translated to Arrow"
yet it takes about 7 minutes to complete, and appears to download
quite a bit of data:
ogr2ogr /tmp/vt.geojson
"PARQUET:/vsis3/overturemaps-us-west-2/release/2024-08-20.0/theme=divisions/type=division_area"
-select "id,division_id,names.primary" -where "subtype='county' AND
country='US' AND region='US-VT'"
Have I made a mistake in my ogr2ogr invocation? For comparison,
running what I believe to be an equivalent query in DuckDB takes about
10 seconds:
SELECT
id,
division_id,
names.primary,
ST_GeomFromWKB(geometry) as geometry
FROM
read_parquet('s3://overturemaps-us-west-2/release/2024-08-20.0/theme=divisions/type=division_area/*',
hive_partitioning=1)
WHERE
subtype = 'county'
AND country = 'US'
AND region = 'US-VT';
I am using GDAL master (e09d07a7) and libarrow 16.1.
Thanks,
Dan
_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev
--
http://www.spatialys.com
My software is free, but my time generally not.
_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev