paleolimbot commented on issue #686: URL: https://github.com/apache/sedona-db/issues/686#issuecomment-4000874632
Here's one that's related to the blog post I wrote on counting buildings by zipcode a while back. I used census tracts here because that's the first shapefile I found: ```python # curl -L https://www2.census.gov/geo/tiger/GENZ2024/shp/cb_2024_us_tract_5m.zip -o tracts.shp.zip # curl -L https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/microsoft-buildings_point.parquet -o buildings.parquet import sedona.db sd = sedona.db.connect() sd.options.interactive = True sd.options.memory_limit = "unlimited" sd.read_pyogrio("tracts.shp.zip").to_view("tracts") sd.read_parquet("buildings.parquet").to_view("buildings") # 6.2s sd.sql(""" SELECT t."GEOIDFQ", COUNT(b.geometry) AS building_count FROM tracts AS t LEFT JOIN buildings AS b ON ST_Contains(t.wkb_geometry, ST_Transform(b.geometry, 'epsg:4269')) GROUP BY t."GEOIDFQ" ORDER BY building_count DESC """).to_memtable().to_view("result", overwrite=True) ``` I'm not sure what DuckDB is doing differently here but DuckDB (1.5.0 pre) takes 90s to do this: ```python import duckdb duckdb.load_extension("spatial") duckdb.sql(""" CREATE OR REPLACE VIEW tracts AS SELECT * FROM ST_Read('tracts.shp.zip') """) duckdb.sql(""" SELECT t."GEOIDFQ", COUNT(b.geometry) AS building_count FROM tracts AS t LEFT JOIN "buildings.parquet" AS b ON ST_Contains(t.geom, ST_Transform(b.geometry, 'epsg:4326', 'epsg:4269')) GROUP BY t."GEOIDFQ" ORDER BY building_count DESC """).to_table("results") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
