Wow, very cool. Yeah the source had been written with duckdb so I believe the 
metadata is present. 

 

I’ll check out master once this is merged.

 

Thanks so much!

 

Mike

 

-- 

Michael Smith

RSGIS Center – ERDC CRREL NH

US Army Corps

 

 

From: gdal-dev <[email protected]> on behalf of Even Rouault via 
gdal-dev <[email protected]>
Reply-To: Even Rouault <[email protected]>
Date: Sunday, December 28, 2025 at 6:56 PM
To: <[email protected]>
Subject: Re: [gdal-dev] gdal parquet and hive partitioning

 

Both below issues should now be fixed per 
https://github.com/OSGeo/gdal/pull/13606 .  Turns out what caused GDAL to probe 
all files even when _metadata is present is perhaps completely different from 
the reason for the python reproducer in the below apache/arrow issue.

Le 28/12/2025 à 16:48, Even Rouault via gdal-dev a écrit :

Hi Mike,

the problem is likely two folds:

- "gdal vector partition" doesn't write the "_metadata" file that contains the 
schema and the path to the actual .parquet files

- but even if it did, I cannot manage to convince libarrow/libparquet to not 
probe all files. Not sure if I'm missing something in the API or if that's a 
fundamental limitation of the library. I've filed 
https://github.com/apache/arrow/issues/48671 about that.  I've considered 
implementing a workaround on GDAL side but I couldn't come with anything.

Your best workaround is to directly access  
"/vsis3/bucket/overture/20251217/overture-buildings/country=US" 

Even

Le 28/12/2025 à 13:26, Michael Smith via gdal-dev a écrit :
I know that gdal can write parquet data with hive partitioning using gdal 
vector partition, but after doing so, can gdal do the partition elimination on 
reading when a where/attribute is specified on the partition key?
 
I was trying to do a pipeline with:
gdal vector pipeline !  read  
"/vsis3/bucket/overture/20251217/overture-buildings/” ! filter  --bbox 
-117.486117584442,33.9156194185775,-117.333055544584,33.9745995301481 --where 
"country='US'" ! write -f parquet /tmp/test1.parquet --progress --overwrite 
 
but in CPL_DEBUG I see it scanning all the partitions rather than just querying 
the country=US partition. 
 
S3: Downloading 0-1605631 
(https://bucket.s3.us-east-1.amazonaws.com/overture/20251217/overture-buildings/country%3DAI/data_0.parquet)...
S3: Got response_code=206
S3: Downloading 0-16383999 
(https://bucket.s3.us-east-1.amazonaws.com/overture/20251217/overture-buildings/country%3DAL/data_2.parquet)...
S3: Got response_code=206
S3: Downloading 0-16383999 
(https://bucket.s3.us-east-1.amazonaws.com/overture/20251217/overture-buildings/country%3DAL/data_3.parquet)...
S3: Got response_code=206
S3: Downloading 16384000-32767999 
(https://bucket.s3.us-east-1.amazonaws.com/overture/20251217/overture-buildings/country%3DAL/data_2.parquet)...
S3: Got response_code=206
S3: Downloading 16384000-29741378 
(https://bucket.s3.us-east-1.amazonaws.com/overture/20251217/overture-buildings/country%3DAL/data_3.parquet)...
....
 
 
 
-- 
http://www.spatialys.com
My software is free, but my time generally not.


_______________________________________________
gdal-dev mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/gdal-dev
-- 
http://www.spatialys.com
My software is free, but my time generally not.
_______________________________________________ gdal-dev mailing list 
[email protected] https://lists.osgeo.org/mailman/listinfo/gdal-dev 

_______________________________________________
gdal-dev mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Reply via email to