Hi Michael,

I've also noticed that the ADBC / Arrow interface of libduckdb seems to be less efficient than their native API. I've no idea whether this is for a fundamental cause or if it is "just" an implementation issue that could be improved (on their side).

In particular I had the impression that getting an arrow stream for "SELECT * FROM 'the_filename'", as used internally by the driver, seemed to trigger the whole file to be ingested. Or maybe just the first row group, but that might already be too much.

To be noted too that the driver itself asks for Arrow streams a couple of times when geometries are detected, because it rewrites the SQL to use ST_AsWKB() on the geometry columns, otherwise when the spatial extension is loaded, it returns geometries encoded with their own geometry encoding, and I didn't bother writing a parser for this custom encoding (ADBC support in GDAL is a unsponsored effort)

So perhaps to get the most of duckdb, a dedicated driver should be written.

Regarding the lack of geometry for your use case, I'm not sure what the cause is. I believe that duckdb_spatial is a bit stricter / less lax than the OGR GeoParquet driver to recognize GeoParquet. At least older versions of OvertureMaps were loosely compliant with GeoParquet.

With https://github.com/OSGeo/gdal/pull/11536 applies, the following works (although much slower than we'd indeed like it to run)

$ ogrinfo ADBC: -oo SQL="SELECT * FROM 's3://overturemaps-us-west-2/release/2024-12-18.0/theme=places/type=place/part-00000-9b3cb01a-46a1-4378-9e77-baca19283b5a-c000.zstd.parquet' LIMIT 1" -al

INFO: Open of `ADBC:'
      using driver `ADBC' successful.

Layer name: part-00000-9b3cb01a-46a1-4378-9e77-baca19283b5a-c000.zstd
Geometry: Point
Feature Count: 1
Extent: (-179.999992, -84.996332) - (-0.001674, 44.999998)
Layer SRS WKT:
GEOGCRS["WGS 84",
[ ... snip ... ]
    ID["EPSG",4326]]
Data axis to CRS axis mapping: 2,1
Geometry Column = geometry
id: String (0.0)
[ ... snip ... ]
type: String (0.0)
OGRFeature(part-00000-9b3cb01a-46a1-4378-9e77-baca19283b5a-c000.zstd):0
  id (String) = 08ff39bac830c5900361ff7fe23acab8
  version (Integer) = 0
  sources (String(JSON)) = [{"property":"","dataset":"meta","record_id":"1150855701606590","update_time":"2024-09-10T00:00:00.000Z","confidence":null}]
  names.primary (String) = KK Beauty Shop 2
  categories.primary (String) = shopping
  categories.alternate (StringList) = (1:cosmetic_and_beauty_supplies)
  confidence (Real) = 0.265179677819083
  websites (StringList) = (null)
  socials (StringList) = (1:https://www.facebook.com/1150855701606590)
  emails (StringList) = (null)
  phones (StringList) = (1:+959765858258)
  brand.wikidata (String) = (null)
  brand.names.primary (String) = (null)
  addresses (String(JSON)) = [{"freeform":"အမှတ်(၂၁),ပွဲစားလမ်း(အောက်လမ်း)၊ ကြည့်မြင်တိုင်","locality":"Yangon","postcode":"11101","region":null,"country":"MM"}]
  theme (String) = places
  type (String) = place
  POINT (-179.13203 -84.5792175)

Even

Le 21/12/2024 à 21:39, Michael Smith via gdal-dev a écrit :
Using gdal-master conda packages, trying to use the new ADBC driver for 
libduckdb integration, I’m able to connect to a parquet dataset (only if it has 
the parquet extension) but the geometry is not being recognized.
Seems to take a long time to load compared with duckdb. So, I must be doing 
something wrong.
Note private s3 bucket.


CPL_DEBUG=on ogrinfo ADBC:"s3://private-bucket/overture-base/overture-places.parquet" -oo ADBC_DRIVER=libduckdb -oo PRELUDE_STATEMENTS="INSTALL 
httpfs" -oo PRELUDE_STATEMENTS="load httpfs" -oo PRELUDE_STATEMENTS="INSTALL parquet" -oo PRELUDE_STATEMENTS="load parquet" 
-oo PRELUDE_STATEMENTS="install aws" -oo PRELUDE_STATEMENTS="load aws" -oo PRELUDE_STATEMENTS="CREATE SECRET ( TYPE S3,PROVIDER 
CREDENTIAL_CHAIN)"
GDAL: On-demand registering 
/Users/rdcrlmds/mambaforge/envs/gdalmaster/lib/gdalplugins/ogr_ADBC.dylib using 
RegisterOGRADBC.
GDAL: GDALOpen(ADBC:s3://private-bucket/overture-base/overture-places.parquet, 
this=0x13a70a000) succeeds as ADBC.
INFO: Open of `ADBC:s3://private-bucket/overture-base/overture-places.parquet'
       using driver `ADBC' successful.
OGR: GetLayerCount() = 1

1: overture-places (None)
GDAL: GDALClose(ADBC:s3://private-bucket/overture-base/overture-places.parquet, 
this=0x13a70a000)
GDAL: In GDALDestroy - unloading GDAL shared library.


time CPL_DEBUG=on ogrinfo ADBC:"s3://private-bucket/overture-base/overture-places.parquet" -oo ADBC_DRIVER=libduckdb  -oo PRELUDE_STATEMENTS="INSTALL spatial" 
-oo PRELUDE_STATEMENTS="load spatial" -oo PRELUDE_STATEMENTS="INSTALL httpfs" -oo PRELUDE_STATEMENTS="load httpfs" -oo 
PRELUDE_STATEMENTS="INSTALL parquet" -oo PRELUDE_STATEMENTS="load parquet" -oo PRELUDE_STATEMENTS="install aws" -oo PRELUDE_STATEMENTS="load 
aws" -oo PRELUDE_STATEMENTS="CREATE SECRET ( TYPE S3,PROVIDER CREDENTIAL_CHAIN)"
GDAL: On-demand registering 
/Users/rdcrlmds/mambaforge/envs/gdalmaster/lib/gdalplugins/ogr_ADBC.dylib using 
RegisterOGRADBC.
GDAL: GDALOpen(ADBC:s3://private-bucket/overture-base/overture-places.parquet, 
this=0x129e15350) succeeds as ADBC.
INFO: Open of `ADBC:s3://private-bucket/overture-base/overture-places.parquet'
       using driver `ADBC' successful.
OGR: GetLayerCount() = 1

1: overture-places (None)
GDAL: GDALClose(ADBC:s3://private-bucket/overture-base/overture-places.parquet, 
this=0x129e15350)
GDAL: In GDALDestroy - unloading GDAL shared library.
CPL_DEBUG=on ogrinfo  -oo ADBC_DRIVER=libduckdb -oo  -oo  -oo  -oo  -oo  -oo
90.25s user 22.43s system 41% cpu 4:29.75 total


--
http://www.spatialys.com
My software is free, but my time generally not.
Butcher of all kinds of standards, open or closed formats. At the end, this is 
just about bytes.
Mood of the day: "Bien entendu, on peut sauter sur sa chaise comme un cabri en 
disant : les standards ! les standards ! les standards ! Mais ça n’aboutit à rien et ça 
ne signifie rien." ~ dixit De Gaulle

_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Reply via email to