This is an automated email from the ASF dual-hosted git repository.
jiayu pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/sedona-spatialbench.git
The following commit(s) were added to refs/heads/main by this push:
new cad4066 [DOCS] Fix broken Q2, wrap code for readability in queries
notebook, add contributors guide (#30)
cad4066 is described below
commit cad4066d9bb48d69e6e61a33e5856e32361e3845
Author: Kelly-Ann Dolor <[email protected]>
AuthorDate: Mon Sep 22 22:38:08 2025 -0700
[DOCS] Fix broken Q2, wrap code for readability in queries notebook, add
contributors guide (#30)
* fixing broken query2, wrapping code for readability
* running nbconvert
* removing output
---
CONTRIBUTING.md | 9 +-
CONTRIBUTING.md => docs/contributors-guide.md | 25 +-
docs/index.md | 2 +-
docs/queries.ipynb | 425 +++++++++++-------
docs/queries.md | 612 ++++++++++++++++++++++++++
docs/requirements.txt | 14 +-
mkdocs.yml | 2 +-
7 files changed, 925 insertions(+), 164 deletions(-)
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 1a327bc..4d1292b 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -6,7 +6,7 @@ free to open an issue.
## Our Development Process
-To contribute, please find an existing issue or open a new one. Claiming the
issue you are working on helps us better track progress.
+To contribute, please find [an existing GitHub issue or open a new
one](https://github.com/apache/sedona-spatialbench/issues). Claiming the issue
you are working on helps us better track progress.
## Pull Requests
@@ -14,10 +14,9 @@ We actively welcome your pull requests.
1. Fork the repo and create your branch from `main`.
2. If you've added code that should be tested, add tests.
-3. If you've changed APIs, update the related documentation.
-4. Ensure the standard tests and conformance tests are passing.
-5. Ensure your code follows Rust best practices and addresses all lints from
clippy.
-6. Open your pull request and wait for a review and approval.
+3. Ensure the standard tests and conformance tests are passing.
+4. Ensure your code follows Rust best practices and addresses all lints from
clippy.
+5. Open your pull request and wait for a review and approval.
## Filing Issues
diff --git a/CONTRIBUTING.md b/docs/contributors-guide.md
similarity index 57%
copy from CONTRIBUTING.md
copy to docs/contributors-guide.md
index 1a327bc..507ba63 100644
--- a/CONTRIBUTING.md
+++ b/docs/contributors-guide.md
@@ -1,8 +1,27 @@
-# Contributing to SpatialBench
+---
+title: Contributors Guide
+---
+
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
We want to make contributing to SpatialBench as easy and transparent as
possible. If you have suggestions to improve this contribution guide, feel
-free to open an issue.
+free to [open an issue in
GitHub](https://github.com/apache/sedona-spatialbench/issues).
## Our Development Process
@@ -32,4 +51,4 @@ style, consult [Effective
Rust](https://www.lurklurk.org/effective-rust/title-pa
## License
By contributing to SpatialBench, you agree that your contributions will be
licensed
-under the `LICENSE` file in the root directory of this source tree.
+under the `LICENSE` file in the root directory of this source tree.
\ No newline at end of file
diff --git a/docs/index.md b/docs/index.md
index 993be0e..dbcfeb1 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,5 +1,5 @@
---
-title: Sedona SpatialBench
+title: SpatialBench
---
<!---
diff --git a/docs/queries.ipynb b/docs/queries.ipynb
index d5ea175..fb8691b 100644
--- a/docs/queries.ipynb
+++ b/docs/queries.ipynb
@@ -5,13 +5,59 @@
"id": "cf23cfd2-c45c-4ca3-9dcb-70541f3cc47d",
"metadata": {},
"source": [
- "# SpatialBench Queries"
+ "# Run the SpatialBench Queries"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a8438568",
+ "metadata": {},
+ "source": [
+ "This notebook contains the queries that make up the SpatialBench
benchmark.\n",
+ "\n",
+ "SpatialBench is a benchmark for assessing geospatial SQL analytics query
performance across database systems. The queries represent common, real-world
geospatial analytics tasks and are designed to test a wide range of spatial
functions and join conditions.\n",
+ "\n",
+ "The benchmark uses a realistic but synthetic, transportation-themed
dataset to ensure the queries reflect practical use cases. By running these
queries, you can evaluate and compare the relative performance of different
spatial query engines in a consistent and unbiased manner."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "09929a96",
+ "metadata": {},
+ "source": [
+ "## Before you start\n",
+ "Before running this notebook, ensure that you have installed the packages
in the `requirements.txt` file: "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dbf738f1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%pip install -r ~/sedona-spatialbench/docs/requirements.txt"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "93369f9a",
+ "metadata": {},
+ "source": [
+ "Additionally, install the SpatialBench CLI and generate the synthetic
data on your machine:\n",
+ "\n",
+ "```\n",
+ "# SpatialBench CLI\n",
+ "cargo install --path ./spatialbench-cli\n",
+ "# Generate the benchmarking data to the sf1-parquet directory\n",
+ "spatialbench-cli -s 1 --format=parquet --output-dir sf1-parquet\n",
+ "```"
]
},
{
"cell_type": "code",
- "execution_count": 22,
- "id": "9c1fe889-d035-4ed9-85a6-f7a8ecb891ad",
+ "execution_count": 34,
+ "id": "e6b1f354",
"metadata": {},
"outputs": [],
"source": [
@@ -20,7 +66,7 @@
},
{
"cell_type": "code",
- "execution_count": 23,
+ "execution_count": 35,
"id": "96cf6ad5-e106-48f7-9097-9f3f740e9e7f",
"metadata": {},
"outputs": [],
@@ -30,17 +76,17 @@
},
{
"cell_type": "code",
- "execution_count": 24,
+ "execution_count": 36,
"id": "16a147f4-4a12-4050-a1bb-934834b4a6bc",
"metadata": {},
"outputs": [],
"source": [
-
"sd.read_parquet(f\"/Users/matthewpowers/data/sf1-parquet/building.parquet\").to_view(\"building\")\n",
-
"sd.read_parquet(f\"/Users/matthewpowers/data/sf1-parquet/customer.parquet\").to_view(\"customer\")\n",
-
"sd.read_parquet(f\"/Users/matthewpowers/data/sf1-parquet/driver.parquet\").to_view(\"driver\")\n",
-
"sd.read_parquet(f\"/Users/matthewpowers/data/sf1-parquet/trip.parquet\").to_view(\"trip\")\n",
-
"sd.read_parquet(f\"/Users/matthewpowers/data/sf1-parquet/vehicle.parquet\").to_view(\"vehicle\")\n",
-
"sd.read_parquet(f\"/Users/matthewpowers/data/sf1-parquet/zone.parquet\").to_view(\"zone\")"
+
"sd.read_parquet(f\"../sf1-parquet/building.parquet\").to_view(\"building\")\n",
+
"sd.read_parquet(f\"../sf1-parquet/customer.parquet\").to_view(\"customer\")\n",
+
"sd.read_parquet(f\"../sf1-parquet/driver.parquet\").to_view(\"driver\")\n",
+ "sd.read_parquet(f\"../sf1-parquet/trip.parquet\").to_view(\"trip\")\n",
+
"sd.read_parquet(f\"../sf1-parquet/vehicle.parquet\").to_view(\"vehicle\")\n",
+ "sd.read_parquet(f\"../sf1-parquet/zone.parquet\").to_view(\"zone\")"
]
},
{
@@ -64,7 +110,7 @@
},
{
"cell_type": "code",
- "execution_count": 25,
+ "execution_count": 37,
"id": "f121f799-08f9-4679-a772-fb5f5470986f",
"metadata": {},
"outputs": [
@@ -72,26 +118,36 @@
"name": "stdout",
"output_type": "stream",
"text": [
-
"┌───────────┬────────────────┬──────────────┬──────────────────────────────┬──────────────────────┐\n",
- "│ t_tripkey ┆ pickup_lon ┆ pickup_lat ┆ t_pickuptime
┆ distance_to_center │\n",
- "│ int64 ┆ float64 ┆ float64 ┆ timestamp(millisecond,
none) ┆ float64 │\n",
-
"╞═══════════╪════════════════╪══════════════╪══════════════════════════════╪══════════════════════╡\n",
- "│ 1451371 ┆ -111.791052127 ┆ 34.826733457 ┆ 1998-08-12T06:47:01
┆ 0.05243333056935386 │\n",
-
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ 2047835 ┆ -111.706967009 ┆ 34.883889472 ┆ 1992-04-08T07:36:09
┆ 0.055865062714050374 │\n",
-
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ 3936870 ┆ -111.827619221 ┆ 34.882950924 ┆ 1998-11-10T13:32:07
┆ 0.06792427838042854 │\n",
-
"└───────────┴────────────────┴──────────────┴──────────────────────────────┴──────────────────────┘\n"
+
"┌───────────┬────────────────┬──────────────┬─────────────────────┬──────────────────────┐\n",
+ "│ t_tripkey ┆ pickup_lon ┆ pickup_lat ┆ t_pickuptime ┆
distance_to_center │\n",
+ "│ int64 ┆ float64 ┆ float64 ┆ timestamp ┆
float64 │\n",
+
"╞═══════════╪════════════════╪══════════════╪═════════════════════╪══════════════════════╡\n",
+ "│ 1451371 ┆ -111.791052127 ┆ 34.826733457 ┆ 1998-08-12T06:47:01 ┆
0.05243333056935387 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 2047835 ┆ -111.706967009 ┆ 34.883889472 ┆ 1992-04-08T07:36:09 ┆
0.055865062714050374 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 3936870 ┆ -111.827619221 ┆ 34.882950924 ┆ 1998-11-10T13:32:07 ┆
0.06792427838042854 │\n",
+
"└───────────┴────────────────┴──────────────┴─────────────────────┴──────────────────────┘\n"
]
}
],
"source": [
"sd.sql(\"\"\"\n",
"SELECT\n",
- " t.t_tripkey, ST_X(ST_GeomFromWKB(t.t_pickuploc)) AS pickup_lon,
ST_Y(ST_GeomFromWKB(t.t_pickuploc)) AS pickup_lat, t.t_pickuptime,\n",
- " ST_Distance(ST_GeomFromWKB(t.t_pickuploc), ST_GeomFromText('POINT
(-111.7610 34.8697)')) AS distance_to_center\n",
+ " t.t_tripkey,\n",
+ " ST_X(ST_GeomFromWKB(t.t_pickuploc)) AS pickup_lon,\n",
+ " ST_Y(ST_GeomFromWKB(t.t_pickuploc)) AS pickup_lat,\n",
+ " t.t_pickuptime,\n",
+ " ST_Distance(\n",
+ " ST_GeomFromWKB(t.t_pickuploc),\n",
+ " ST_GeomFromText('POINT (-111.7610 34.8697)')\n",
+ " ) AS distance_to_center\n",
"FROM trip t\n",
- "WHERE ST_DWithin(ST_GeomFromWKB(t.t_pickuploc), ST_GeomFromText('POINT
(-111.7610 34.8697)'), 0.45) -- 50km radius around Sedona center\n",
+ "WHERE ST_DWithin(\n",
+ " ST_GeomFromWKB(t.t_pickuploc),\n",
+ " ST_GeomFromText('POINT (-111.7610 34.8697)'),\n",
+ " 0.45 -- 50km radius around Sedona center in degrees\n",
+ ")\n",
"ORDER BY distance_to_center ASC, t.t_tripkey ASC\n",
"\"\"\").show(3)"
]
@@ -116,7 +172,7 @@
},
{
"cell_type": "code",
- "execution_count": 26,
+ "execution_count": 38,
"id": "cc219172-0cfb-4f60-a728-e4ec5cc5c21e",
"metadata": {},
"outputs": [
@@ -137,7 +193,15 @@
"sd.sql(\"\"\"\n",
"SELECT COUNT(*) AS trip_count_in_coconino_county\n",
"FROM trip t\n",
- "WHERE ST_Intersects(ST_GeomFromWKB(t.t_pickuploc), (SELECT
ST_GeomFromWKB(z.z_boundary) FROM zone z WHERE z.z_name = 'Coconino County'
LIMIT 1))\n",
+ "WHERE ST_Intersects(\n",
+ " ST_GeomFromWKB(t.t_pickuploc),\n",
+ " (\n",
+ " SELECT ST_GeomFromWKB(z.z_boundary)\n",
+ " FROM zone z\n",
+ " WHERE z.z_name = 'Coconino County'\n",
+ " LIMIT 1\n",
+ " )\n",
+ ")\n",
"\"\"\").show(3)"
]
},
@@ -161,7 +225,7 @@
},
{
"cell_type": "code",
- "execution_count": 27,
+ "execution_count": 39,
"id": "eb5d6347-47dd-48af-adb5-929838954ac4",
"metadata": {},
"outputs": [
@@ -169,30 +233,36 @@
"name": "stdout",
"output_type": "stream",
"text": [
-
"┌──────────────────────┬─────────────┬───────────────────┬─────────────────────┬───────────────────┐\n",
- "│ pickup_month ┆ total_trips ┆ avg_distance ┆
avg_duration ┆ avg_fare │\n",
- "│ timestamp(milliseco… ┆ int64 ┆ decimal128(19, 9) ┆
duration(milliseco… ┆ decimal128(19, 9) │\n",
-
"╞══════════════════════╪═════════════╪═══════════════════╪═════════════════════╪═══════════════════╡\n",
- "│ 1992-04-01T00:00:00 ┆ 2 ┆ 0.000020000 ┆ 0 days 1
hours 23 … ┆ 0.000075000 │\n",
-
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ 1992-07-01T00:00:00 ┆ 1 ┆ 0.000010000 ┆ 0 days 0
hours 58 … ┆ 0.000040000 │\n",
-
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ 1994-02-01T00:00:00 ┆ 2 ┆ 0.000020000 ┆ 0 days 1
hours 23 … ┆ 0.000050000 │\n",
-
"└──────────────────────┴─────────────┴───────────────────┴─────────────────────┴───────────────────┘\n"
+
"┌─────────────────────┬─────────────┬──────────────┬─────────────────────────────────┬─────────────┐\n",
+ "│ pickup_month ┆ total_trips ┆ avg_distance ┆
avg_duration ┆ avg_fare │\n",
+ "│ timestamp ┆ int64 ┆ decimal128 ┆
duration ┆ decimal128 │\n",
+
"╞═════════════════════╪═════════════╪══════════════╪═════════════════════════════════╪═════════════╡\n",
+ "│ 1992-04-01T00:00:00 ┆ 2 ┆ 0.000020000 ┆ 0 days 1 hours 23
mins 47.000 … ┆ 0.000075000 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 1992-07-01T00:00:00 ┆ 1 ┆ 0.000010000 ┆ 0 days 0 hours 58
mins 58.000 … ┆ 0.000040000 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 1994-02-01T00:00:00 ┆ 2 ┆ 0.000020000 ┆ 0 days 1 hours 23
mins 50.000 … ┆ 0.000050000 │\n",
+
"└─────────────────────┴─────────────┴──────────────┴─────────────────────────────────┴─────────────┘\n"
]
}
],
"source": [
"sd.sql(\"\"\"\n",
"SELECT\n",
- " DATE_TRUNC('month', t.t_pickuptime) AS pickup_month,
COUNT(t.t_tripkey) AS total_trips,\n",
- " AVG(t.t_distance) AS avg_distance, AVG(t.t_dropofftime -
t.t_pickuptime) AS avg_duration,\n",
+ " DATE_TRUNC('month', t.t_pickuptime) AS pickup_month,\n",
+ " COUNT(t.t_tripkey) AS total_trips,\n",
+ " AVG(t.t_distance) AS avg_distance,\n",
+ " AVG(t.t_dropofftime - t.t_pickuptime) AS avg_duration,\n",
" AVG(t.t_fare) AS avg_fare\n",
"FROM trip t\n",
"WHERE ST_DWithin(\n",
" ST_GeomFromWKB(t.t_pickuploc),\n",
- " ST_GeomFromText('POLYGON((-111.9060 34.7347, -111.6160 34.7347,
-111.6160 35.0047, -111.9060 35.0047, -111.9060 34.7347))'), -- 10km bounding
box around Sedona\n",
- " 0.045 -- Additional 5km buffer\n",
+ " ST_GeomFromText('POLYGON((\n",
+ " -111.9060 34.7347, -111.6160 34.7347,\n",
+ " -111.6160 35.0047, -111.9060 35.0047,\n",
+ " -111.9060 34.7347\n",
+ " ))'), -- Bounding box around Sedona\n",
+ " 0.045 -- Additional 5km buffer in degrees\n",
")\n",
"GROUP BY pickup_month\n",
"ORDER BY pickup_month\n",
@@ -220,7 +290,7 @@
},
{
"cell_type": "code",
- "execution_count": 28,
+ "execution_count": 40,
"id": "f8dcf328-a7f2-49be-9383-afa11766f871",
"metadata": {},
"outputs": [
@@ -230,7 +300,7 @@
"text": [
"┌───────────┬─────────────────────────────────┬────────────┐\n",
"│ z_zonekey ┆ z_name ┆ trip_count │\n",
- "│ int64 ┆ utf8view ┆ int64 │\n",
+ "│ int64 ┆ utf8 ┆ int64 │\n",
"╞═══════════╪═════════════════════════════════╪════════════╡\n",
"│ 65008 ┆ Ndélé ┆ 35 │\n",
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
@@ -243,15 +313,22 @@
],
"source": [
"sd.sql(\"\"\"\n",
- "SELECT z.z_zonekey, z.z_name, COUNT(*) AS trip_count\n",
+ "SELECT\n",
+ " z.z_zonekey,\n",
+ " z.z_name,\n",
+ " COUNT(*) AS trip_count\n",
"FROM\n",
" zone z\n",
" JOIN (\n",
" SELECT t.t_pickuploc\n",
" FROM trip t\n",
" ORDER BY t.t_tip DESC, t.t_tripkey ASC\n",
- " LIMIT 1000 -- Replace 1000 with x (how many top tips you want)\n",
- " ) top_trips ON ST_Within(ST_GeomFromWKB(top_trips.t_pickuploc),
ST_GeomFromWKB(z.z_boundary))\n",
+ " LIMIT 1000\n",
+ " ) top_trips\n",
+ " ON ST_Within(\n",
+ " ST_GeomFromWKB(top_trips.t_pickuploc),\n",
+ " ST_GeomFromWKB(z.z_boundary)\n",
+ " )\n",
"GROUP BY z.z_zonekey, z.z_name\n",
"ORDER BY trip_count DESC, z.z_zonekey ASC\n",
"\"\"\").show(3)"
@@ -278,7 +355,7 @@
},
{
"cell_type": "code",
- "execution_count": 29,
+ "execution_count": 41,
"id": "8c446f9a-3734-4a8d-87a8-0d9d6d34fd5c",
"metadata": {},
"outputs": [
@@ -286,30 +363,35 @@
"name": "stdout",
"output_type": "stream",
"text": [
-
"┌───────────┬────────────────────┬────────────────────────────┬────────────────────┬───────────────┐\n",
- "│ c_custkey ┆ customer_name ┆ pickup_month ┆
monthly_travel_hul ┆ dropoff_count │\n",
- "│ int64 ┆ utf8view ┆ timestamp(millisecond, no… ┆
l_area… ┆ int64 │\n",
-
"╞═══════════╪════════════════════╪════════════════════════════╪════════════════════╪═══════════════╡\n",
- "│ 12049 ┆ Customer#000012049 ┆ 1997-10-01T00:00:00 ┆
22290.72518508209 ┆ 18 │\n",
-
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ 12091 ┆ Customer#000012091 ┆ 1997-08-01T00:00:00 ┆
18901.827089935185 ┆ 18 │\n",
-
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ 13231 ┆ Customer#000013231 ┆ 1994-09-01T00:00:00 ┆
24930.673726244015 ┆ 18 │\n",
-
"└───────────┴────────────────────┴────────────────────────────┴────────────────────┴───────────────┘\n"
+
"┌───────────┬────────────────────┬─────────────────────┬────────────────────┬───────────────┐\n",
+ "│ c_custkey ┆ customer_name ┆ pickup_month ┆
monthly_travel_hul ┆ dropoff_count │\n",
+ "│ int64 ┆ utf8 ┆ timestamp ┆ l_area…
┆ int64 │\n",
+
"╞═══════════╪════════════════════╪═════════════════════╪════════════════════╪═══════════════╡\n",
+ "│ 25975 ┆ Customer#000025975 ┆ 1992-02-01T00:00:00 ┆
34941.303419053635 ┆ 10 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 12061 ┆ Customer#000012061 ┆ 1997-03-01T00:00:00 ┆
34607.53871953154 ┆ 14 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 21418 ┆ Customer#000021418 ┆ 1993-08-01T00:00:00 ┆
34465.32323910264 ┆ 9 │\n",
+
"└───────────┴────────────────────┴─────────────────────┴────────────────────┴───────────────┘\n"
]
}
],
"source": [
"sd.sql(\"\"\"\n",
"SELECT\n",
- " c.c_custkey, c.c_name AS customer_name,\n",
+ " c.c_custkey,\n",
+ " c.c_name AS customer_name,\n",
" DATE_TRUNC('month', t.t_pickuptime) AS pickup_month,\n",
- " ST_Area(ST_ConvexHull(ST_Collect(ST_GeomFromWKB(t.t_dropoffloc)))) AS
monthly_travel_hull_area,\n",
+ " ST_Area(\n",
+ " ST_ConvexHull(ST_Collect(ST_GeomFromWKB(t.t_dropoffloc)))\n",
+ " ) AS monthly_travel_hull_area,\n",
" COUNT(*) as dropoff_count\n",
- "FROM trip t JOIN customer c ON t.t_custkey = c.c_custkey\n",
+ "FROM trip t\n",
+ "JOIN customer c\n",
+ " ON t.t_custkey = c.c_custkey\n",
"GROUP BY c.c_custkey, c.c_name, pickup_month\n",
- "HAVING dropoff_count > 5 -- Only include repeat customers for meaningful
hulls\n",
- "ORDER BY dropoff_count DESC, c.c_custkey ASC\n",
+ "HAVING dropoff_count > 5 -- Only include repeat customers\n",
+ "ORDER BY monthly_travel_hull_area DESC, c.c_custkey ASC\n",
"\"\"\").show(3)"
]
},
@@ -332,7 +414,7 @@
},
{
"cell_type": "code",
- "execution_count": 30,
+ "execution_count": 42,
"id": "f168c302-5187-45f9-8d87-427bbd68c942",
"metadata": {},
"outputs": [
@@ -340,28 +422,40 @@
"name": "stdout",
"output_type": "stream",
"text": [
-
"┌───────────┬─────────────────┬───────────────┬───────────────────┬────────────────────────────────┐\n",
- "│ z_zonekey ┆ z_name ┆ total_pickups ┆ avg_distance ┆
avg_duration │\n",
- "│ int64 ┆ utf8view ┆ int64 ┆ decimal128(19, 9) ┆
duration(millisecond) │\n",
-
"╞═══════════╪═════════════════╪═══════════════╪═══════════════════╪════════════════════════════════╡\n",
- "│ 30084 ┆ Coconino County ┆ 541 ┆ 0.000085323 ┆ 0
days 1 hours 45 mins 16.591… │\n",
-
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ 30083 ┆ Yavapai County ┆ 292 ┆ 0.000076643 ┆ 0
days 1 hours 36 mins 43.647… │\n",
-
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ 29488 ┆ Gila County ┆ 39 ┆ 0.000065641 ┆ 0
days 1 hours 16 mins 59.769… │\n",
-
"└───────────┴─────────────────┴───────────────┴───────────────────┴────────────────────────────────┘\n"
+
"┌───────────┬─────────────────┬───────────────┬──────────────┬────────────────────────────────────┐\n",
+ "│ z_zonekey ┆ z_name ┆ total_pickups ┆ avg_distance ┆
avg_duration │\n",
+ "│ int64 ┆ utf8 ┆ int64 ┆ decimal128 ┆
duration │\n",
+
"╞═══════════╪═════════════════╪═══════════════╪══════════════╪════════════════════════════════════╡\n",
+ "│ 30084 ┆ Coconino County ┆ 541 ┆ 0.000030406 ┆ 0 days 1
hours 45 mins 16.591 secs │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 30083 ┆ Yavapai County ┆ 292 ┆ 0.000027157 ┆ 0 days 1
hours 36 mins 43.647 secs │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 29488 ┆ Gila County ┆ 39 ┆ 0.000021282 ┆ 0 days 1
hours 16 mins 59.769 secs │\n",
+
"└───────────┴─────────────────┴───────────────┴──────────────┴────────────────────────────────────┘\n"
]
}
],
"source": [
"sd.sql(\"\"\"\n",
"SELECT\n",
- " z.z_zonekey, z.z_name,\n",
- " COUNT(t.t_tripkey) AS total_pickups, AVG(t.t_totalamount) AS
avg_distance,\n",
+ " z.z_zonekey,\n",
+ " z.z_name,\n",
+ " COUNT(t.t_tripkey) AS total_pickups,\n",
+ " AVG(t.t_distance) AS avg_distance, -- Corrected from t_totalamount\n",
" AVG(t.t_dropofftime - t.t_pickuptime) AS avg_duration\n",
"FROM trip t, zone z\n",
- "WHERE ST_Intersects(ST_GeomFromText('POLYGON((-112.2110 34.4197,
-111.3110 34.4197, -111.3110 35.3197, -112.2110 35.3197, -112.2110 34.4197))'),
ST_GeomFromWKB(z.z_boundary)) -- 50km bounding box around Sedona\n",
- " AND ST_Within(ST_GeomFromWKB(t.t_pickuploc),
ST_GeomFromWKB(z.z_boundary))\n",
+ "WHERE ST_Intersects(\n",
+ " ST_GeomFromText('POLYGON((\n",
+ " -112.2110 34.4197, -111.3110 34.4197,\n",
+ " -111.3110 35.3197, -112.2110 35.3197,\n",
+ " -112.2110 34.4197\n",
+ " ))'), -- Bounding box around Sedona\n",
+ " ST_GeomFromWKB(z.z_boundary)\n",
+ " )\n",
+ " AND ST_Within(\n",
+ " ST_GeomFromWKB(t.t_pickuploc),\n",
+ " ST_GeomFromWKB(z.z_boundary)\n",
+ " )\n",
"GROUP BY z.z_zonekey, z.z_name\n",
"ORDER BY total_pickups DESC, z.z_zonekey ASC\n",
"\"\"\").show(3)"
@@ -388,7 +482,7 @@
},
{
"cell_type": "code",
- "execution_count": 31,
+ "execution_count": 43,
"id": "3a8cddde-c04b-49d4-ab69-3d75d9a535ff",
"metadata": {},
"outputs": [
@@ -398,13 +492,13 @@
"text": [
"┌───────────┬─────────────────────┬────────────────────┬──────────────────────┐\n",
"│ t_tripkey ┆ reported_distance_m ┆ line_distance_m ┆
detour_ratio │\n",
- "│ int64 ┆ decimal128(15, 5) ┆ float64 ┆ float64
│\n",
+ "│ int64 ┆ decimal128 ┆ float64 ┆ float64
│\n",
"╞═══════════╪═════════════════════╪════════════════════╪══════════════════════╡\n",
- "│ 4688563 ┆ 0.00010 ┆ 11111.126052681648 ┆
8.99998789734414e-9 │\n",
+ "│ 4688563 ┆ 0.00010 ┆ 11111.114941555596 ┆
8.999996897341038e-9 │\n",
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ 2380123 ┆ 0.00010 ┆ 11111.126095065882 ┆
8.999987863013003e-9 │\n",
+ "│ 2380123 ┆ 0.00010 ┆ 11111.114983939786 ┆
8.999996863009868e-9 │\n",
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ 3077131 ┆ 0.00010 ┆ 11111.126138581423 ┆
8.99998782776551e-9 │\n",
+ "│ 3077131 ┆ 0.00010 ┆ 11111.115027455284 ┆
8.999996827762339e-9 │\n",
"└───────────┴─────────────────────┴────────────────────┴──────────────────────┘\n"
]
}
@@ -420,7 +514,7 @@
" ST_GeomFromWKB(t.t_pickuploc),\n",
" ST_GeomFromWKB(t.t_dropoffloc)\n",
" )\n",
- " ) / 0.000009 AS line_distance_m -- 1 meter = 0.000009 degree\n",
+ " ) * 111111 AS line_distance_m -- Approx. meters per degree\n",
" FROM trip t\n",
")\n",
"SELECT\n",
@@ -429,7 +523,10 @@
" t.line_distance_m,\n",
" t.reported_distance_m / NULLIF(t.line_distance_m, 0) AS
detour_ratio\n",
"FROM trip_lengths t\n",
- "ORDER BY detour_ratio DESC NULLS LAST, reported_distance_m DESC,
t_tripkey ASC\n",
+ "ORDER BY\n",
+ " detour_ratio DESC NULLS LAST,\n",
+ " reported_distance_m DESC,\n",
+ " t_tripkey ASC\n",
"\"\"\").show(3)"
]
},
@@ -452,7 +549,7 @@
},
{
"cell_type": "code",
- "execution_count": 32,
+ "execution_count": 44,
"id": "58ac6e6a-2e3d-4dd7-bb03-1b16f55fc186",
"metadata": {},
"outputs": [
@@ -460,16 +557,16 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "┌───────────────┬──────────┬─────────────────────┐\n",
- "│ b_buildingkey ┆ b_name ┆ nearby_pickup_count │\n",
- "│ int64 ┆ utf8view ┆ int64 │\n",
- "╞═══════════════╪══════════╪═════════════════════╡\n",
- "│ 3779 ┆ linen ┆ 42 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ 19135 ┆ misty ┆ 36 │\n",
- "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ 4416 ┆ sienna ┆ 26 │\n",
- "└───────────────┴──────────┴─────────────────────┘\n"
+ "┌───────────────┬────────┬─────────────────────┐\n",
+ "│ b_buildingkey ┆ b_name ┆ nearby_pickup_count │\n",
+ "│ int64 ┆ utf8 ┆ int64 │\n",
+ "╞═══════════════╪════════╪═════════════════════╡\n",
+ "│ 3779 ┆ linen ┆ 42 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 19135 ┆ misty ┆ 36 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 4416 ┆ sienna ┆ 26 │\n",
+ "└───────────────┴────────┴─────────────────────┘\n"
]
}
],
@@ -505,7 +602,7 @@
},
{
"cell_type": "code",
- "execution_count": 34,
+ "execution_count": 45,
"id": "baae3710-6008-47cf-a31f-db2813f9fca3",
"metadata": {},
"outputs": [
@@ -532,22 +629,20 @@
" SELECT b_buildingkey AS id, ST_GeomFromWKB(b_boundary) AS geom\n",
" FROM building\n",
"),\n",
- " b2 AS (\n",
- " SELECT b_buildingkey AS id, ST_GeomFromWKB(b_boundary) AS geom\n",
- " FROM building\n",
- " ),\n",
- " pairs AS (\n",
- " SELECT\n",
- " b1.id AS building_1,\n",
- " b2.id AS building_2,\n",
- " ST_Area(b1.geom) AS area1,\n",
- " ST_Area(b2.geom) AS area2,\n",
- " ST_Area(ST_Intersection(b1.geom, b2.geom)) AS overlap_area\n",
- " FROM b1\n",
- " JOIN b2\n",
- " ON b1.id < b2.id\n",
- " AND ST_Intersects(b1.geom, b2.geom)\n",
- " )\n",
+ "b2 AS (\n",
+ " SELECT b_buildingkey AS id, ST_GeomFromWKB(b_boundary) AS geom\n",
+ " FROM building\n",
+ "),\n",
+ "pairs AS (\n",
+ " SELECT\n",
+ " b1.id AS building_1,\n",
+ " b2.id AS building_2,\n",
+ " ST_Area(b1.geom) AS area1,\n",
+ " ST_Area(b2.geom) AS area2,\n",
+ " ST_Area(ST_Intersection(b1.geom, b2.geom)) AS overlap_area\n",
+ " FROM b1\n",
+ " JOIN b2 ON b1.id < b2.id AND ST_Intersects(b1.geom, b2.geom)\n",
+ ")\n",
"SELECT\n",
" building_1,\n",
" building_2,\n",
@@ -555,10 +650,9 @@
" area2,\n",
" overlap_area,\n",
" CASE\n",
- " WHEN overlap_area = 0 THEN 0.0\n",
" WHEN (area1 + area2 - overlap_area) = 0 THEN 1.0\n",
" ELSE overlap_area / (area1 + area2 - overlap_area)\n",
- " END AS iou\n",
+ " END AS iou\n",
"FROM pairs\n",
"ORDER BY iou DESC, building_1 ASC, building_2 ASC\n",
"\"\"\").show(3)"
@@ -584,7 +678,7 @@
},
{
"cell_type": "code",
- "execution_count": 35,
+ "execution_count": 46,
"id": "bd264220-b2dd-49a1-a8f9-0267ffb1371c",
"metadata": {},
"outputs": [
@@ -592,25 +686,33 @@
"name": "stdout",
"output_type": "stream",
"text": [
-
"┌───────────┬────────────────┬────────────────────────────────────┬───────────────────┬───────────┐\n",
- "│ z_zonekey ┆ pickup_zone ┆ avg_duration ┆
avg_distance ┆ num_trips │\n",
- "│ int64 ┆ utf8view ┆ duration(millisecond) ┆
decimal128(19, 9) ┆ int64 │\n",
-
"╞═══════════╪════════════════╪════════════════════════════════════╪═══════════════════╪═══════════╡\n",
- "│ 31558 ┆ Benewah County ┆ 4 days 13 hours 3 mins 34.000 secs ┆
0.002180000 ┆ 2 │\n",
-
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ 119540 ┆ Kreis Unna ┆ 2 days 4 hours 52 mins 44.000 secs ┆
0.001050000 ┆ 1 │\n",
-
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤\n",
- "│ 59928 ┆ Ndhiwa ┆ 2 days 4 hours 19 mins 39.000 secs ┆
0.001040000 ┆ 1 │\n",
-
"└───────────┴────────────────┴────────────────────────────────────┴───────────────────┴───────────┘\n"
+
"┌───────────┬────────────────┬────────────────────────────────────┬──────────────┬───────────┐\n",
+ "│ z_zonekey ┆ pickup_zone ┆ avg_duration ┆
avg_distance ┆ num_trips │\n",
+ "│ int64 ┆ utf8 ┆ duration ┆
decimal128 ┆ int64 │\n",
+
"╞═══════════╪════════════════╪════════════════════════════════════╪══════════════╪═══════════╡\n",
+ "│ 31558 ┆ Benewah County ┆ 4 days 13 hours 3 mins 34.000 secs ┆
0.002180000 ┆ 2 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 119540 ┆ Kreis Unna ┆ 2 days 4 hours 52 mins 44.000 secs ┆
0.001050000 ┆ 1 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 59928 ┆ Ndhiwa ┆ 2 days 4 hours 19 mins 39.000 secs ┆
0.001040000 ┆ 1 │\n",
+
"└───────────┴────────────────┴────────────────────────────────────┴──────────────┴───────────┘\n"
]
}
],
"source": [
"sd.sql(\"\"\"\n",
"SELECT\n",
- " z.z_zonekey, z.z_name AS pickup_zone, AVG(t.t_dropofftime -
t.t_pickuptime) AS avg_duration,\n",
- " AVG(t.t_distance) AS avg_distance, COUNT(t.t_tripkey) AS num_trips\n",
- "FROM zone z LEFT JOIN trip t ON ST_Within(ST_GeomFromWKB(t.t_pickuploc),
ST_GeomFromWKB(z.z_boundary))\n",
+ " z.z_zonekey,\n",
+ " z.z_name AS pickup_zone,\n",
+ " AVG(t.t_dropofftime - t.t_pickuptime) AS avg_duration,\n",
+ " AVG(t.t_distance) AS avg_distance,\n",
+ " COUNT(t.t_tripkey) AS num_trips\n",
+ "FROM\n",
+ " zone z\n",
+ " LEFT JOIN trip t\n",
+ " ON ST_Within(\n",
+ " ST_GeomFromWKB(t.t_pickuploc), ST_GeomFromWKB(z.z_boundary)\n",
+ " )\n",
"GROUP BY z.z_zonekey, z.z_name\n",
"ORDER BY avg_duration DESC NULLS LAST, z.z_zonekey ASC\n",
"\"\"\").show(3)"
@@ -635,7 +737,7 @@
},
{
"cell_type": "code",
- "execution_count": 36,
+ "execution_count": 47,
"id": "64bf9632-b4ca-4862-9443-41e90b82bee4",
"metadata": {},
"outputs": [
@@ -657,8 +759,16 @@
"SELECT COUNT(*) AS cross_zone_trip_count\n",
"FROM\n",
" trip t\n",
- " JOIN zone pickup_zone ON ST_Within(ST_GeomFromWKB(t.t_pickuploc),
ST_GeomFromWKB(pickup_zone.z_boundary))\n",
- " JOIN zone dropoff_zone ON ST_Within(ST_GeomFromWKB(t.t_dropoffloc),
ST_GeomFromWKB(dropoff_zone.z_boundary))\n",
+ " JOIN zone pickup_zone\n",
+ " ON ST_Within(\n",
+ " ST_GeomFromWKB(t.t_pickuploc),\n",
+ " ST_GeomFromWKB(pickup_zone.z_boundary)\n",
+ " )\n",
+ " JOIN zone dropoff_zone\n",
+ " ON ST_Within(\n",
+ " ST_GeomFromWKB(t.t_dropoffloc),\n",
+ " ST_GeomFromWKB(dropoff_zone.z_boundary)\n",
+ " )\n",
"WHERE pickup_zone.z_zonekey != dropoff_zone.z_zonekey\n",
"\"\"\").show(3)"
]
@@ -682,18 +792,42 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 48,
"id": "0878af06-c518-44f3-9b45-83651cc37d04",
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+
"┌───────────┬─────────────────────────────────┬───────────────┬───────────────┬────────────────────┐\n",
+ "│ t_tripkey ┆ t_pickuploc ┆ b_buildingkey ┆
building_name ┆ distance_to_buildi │\n",
+ "│ int64 ┆ binary ┆ int64 ┆
utf8 ┆ ng… │\n",
+
"╞═══════════╪═════════════════════════════════╪═══════════════╪═══════════════╪════════════════════╡\n",
+ "│ 1 ┆ 01010000009f3c318dd43735405930… ┆ 15870 ┆ purple
┆ 0.984633987957188 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 1 ┆ 01010000009f3c318dd43735405930… ┆ 6800 ┆ ghost
┆ 1.205725156670704 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 1 ┆ 01010000009f3c318dd43735405930… ┆ 8384 ┆
lavender ┆ 1.4195012994942622 │\n",
+
"└───────────┴─────────────────────────────────┴───────────────┴───────────────┴────────────────────┘\n"
+ ]
+ }
+ ],
"source": [
"sd.sql(\"\"\"\n",
"WITH trip_with_geom AS (\n",
- " SELECT t_tripkey, t_pickuploc, ST_GeomFromWKB(t_pickuploc) as
pickup_geom\n",
+ " SELECT\n",
+ " t_tripkey,\n",
+ " t_pickuploc,\n",
+ " ST_GeomFromWKB(t_pickuploc) as pickup_geom\n",
" FROM trip\n",
"),\n",
"building_with_geom AS (\n",
- " SELECT b_buildingkey, b_name, b_boundary, ST_GeomFromWKB(b_boundary)
as boundary_geom\n",
+ " SELECT\n",
+ " b_buildingkey,\n",
+ " b_name,\n",
+ " b_boundary,\n",
+ " ST_GeomFromWKB(b_boundary) as boundary_geom\n",
" FROM building\n",
")\n",
"SELECT\n",
@@ -702,24 +836,17 @@
" b.b_buildingkey,\n",
" b.b_name AS building_name,\n",
" ST_Distance(t.pickup_geom, b.boundary_geom) AS
distance_to_building\n",
- "FROM trip_with_geom t JOIN building_with_geom b\n",
- "ON ST_KNN(t.pickup_geom, b.boundary_geom, 5, FALSE)\n",
- "ORDER BY distance_to_building ASC, b.b_buildingkey ASC\n",
+ "FROM trip_with_geom t\n",
+ "JOIN building_with_geom b\n",
+ " ON ST_KNN(t.pickup_geom, b.boundary_geom, 5, FALSE)\n",
+ "ORDER BY t.t_tripkey ASC, distance_to_building ASC, b.b_buildingkey
ASC\n",
"\"\"\").show(3)"
]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "6bdfeffb-5015-4f34-ae73-a8e59af7d4f4",
- "metadata": {},
- "outputs": [],
- "source": []
}
],
"metadata": {
"kernelspec": {
- "display_name": "Python 3 (ipykernel)",
+ "display_name": ".venv (3.13.3)",
"language": "python",
"name": "python3"
},
@@ -733,7 +860,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.12.4"
+ "version": "3.13.3"
}
},
"nbformat": 4,
diff --git a/docs/queries.md b/docs/queries.md
new file mode 100644
index 0000000..7cc4741
--- /dev/null
+++ b/docs/queries.md
@@ -0,0 +1,612 @@
+# Run the SpatialBench Queries
+
+This notebook contains the queries that make up the SpatialBench benchmark.
+
+SpatialBench is a benchmark for assessing geospatial SQL analytics query
performance across database systems. The queries represent common, real-world
geospatial analytics tasks and are designed to test a wide range of spatial
functions and join conditions.
+
+The benchmark uses a realistic but synthetic, transportation-themed dataset to
ensure the queries reflect practical use cases. By running these queries, you
can evaluate and compare the relative performance of different spatial query
engines in a consistent and unbiased manner.
+
+## Before you start
+
+Before running this notebook, ensure that you have installed the packages in
the `requirements.txt` file:
+
+
+```python
+%pip install -r ~/sedona-spatialbench/docs/requirements.txt
+```
+
+ ...
+ ...
+ Note: you may need to restart the kernel to use updated packages.
+
+
+Additionally, install the SpatialBench CLI and generate the synthetic data on
your machine:
+
+```
+# SpatialBench CLI
+cargo install --path ./spatialbench-cli
+# Generate the benchmarking data to the sf1-parquet directory
+spatialbench-cli -s 1 --format=parquet --output-dir sf1-parquet
+```
+
+
+```python
+import sedona.db
+```
+
+
+```python
+sd = sedona.db.connect()
+```
+
+
+```python
+sd.read_parquet(f"../sf1-parquet/building.parquet").to_view("building")
+sd.read_parquet(f"../sf1-parquet/customer.parquet").to_view("customer")
+sd.read_parquet(f"../sf1-parquet/driver.parquet").to_view("driver")
+sd.read_parquet(f"../sf1-parquet/trip.parquet").to_view("trip")
+sd.read_parquet(f"../sf1-parquet/vehicle.parquet").to_view("vehicle")
+sd.read_parquet(f"../sf1-parquet/zone.parquet").to_view("zone")
+```
+
+## Q1: Find trips starting within 50km of Sedona city center, ordered by
distance
+
+**Real-life scenario:** Identify and rank trips by proximity to a city center
for urban planning and transportation analysis.
+
+This query finds all taxi or rideshare trips that started within 50 kilometers
of downtown Sedona, Arizona. For each qualifying trip, it shows the trip ID,
pickup coordinates, pickup time, and calculates the exact distance from the
pickup location to Sedona's city center. The results are sorted to show the
trips that picked up closest to downtown Sedona first, making it easy to see
which rides originated nearest to the city center.
+
+**Spatial query characteristics tested:**
+
+1. Distance-based spatial filtering (ST_DWithin)
+2. Distance calculation to a fixed point
+3. Coordinate extraction (ST_X, ST_Y)
+4. Ordering by spatial distance
+
+
+```python
+sd.sql("""
+SELECT
+ t.t_tripkey,
+ ST_X(ST_GeomFromWKB(t.t_pickuploc)) AS pickup_lon,
+ ST_Y(ST_GeomFromWKB(t.t_pickuploc)) AS pickup_lat,
+ t.t_pickuptime,
+ ST_Distance(
+ ST_GeomFromWKB(t.t_pickuploc),
+ ST_GeomFromText('POINT (-111.7610 34.8697)')
+ ) AS distance_to_center
+FROM trip t
+WHERE ST_DWithin(
+ ST_GeomFromWKB(t.t_pickuploc),
+ ST_GeomFromText('POINT (-111.7610 34.8697)'),
+ 0.45 -- 50km radius around Sedona center in degrees
+)
+ORDER BY distance_to_center ASC, t.t_tripkey ASC
+""").show(3)
+```
+
+
┌───────────┬────────────────┬──────────────┬─────────────────────┬──────────────────────┐
+ │ t_tripkey ┆ pickup_lon ┆ pickup_lat ┆ t_pickuptime ┆
distance_to_center │
+ │ int64 ┆ float64 ┆ float64 ┆ timestamp ┆
float64 │
+
╞═══════════╪════════════════╪══════════════╪═════════════════════╪══════════════════════╡
+ │ 1451371 ┆ -111.791052127 ┆ 34.826733457 ┆ 1998-08-12T06:47:01 ┆
0.05243333056935387 │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 2047835 ┆ -111.706967009 ┆ 34.883889472 ┆ 1992-04-08T07:36:09 ┆
0.055865062714050374 │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 3936870 ┆ -111.827619221 ┆ 34.882950924 ┆ 1998-11-10T13:32:07 ┆
0.06792427838042854 │
+
└───────────┴────────────────┴──────────────┴─────────────────────┴──────────────────────┘
+
+
+## Q2: Count trips starting within Coconino County (Arizona) zone
+
+**Real-life scenario:** Count all trips originating within a specific
administrative boundary (county) for regional transportation statistics.
+
+This query counts how many taxi or rideshare trips started within Coconino
County, Arizona. It does this by checking if each trip's pickup location falls
inside the county's geographic boundaries. The result is a simple count showing
the total number of trips that originated anywhere within Coconino County.
+
+**Spatial query characteristics tested:**
+
+1. Point-in-polygon spatial filtering (ST_Intersects)
+2. Subquery with spatial geometry selection
+3. Simple aggregation on spatially filtered data
+
+
+```python
+sd.sql("""
+SELECT COUNT(*) AS trip_count_in_coconino_county
+FROM trip t
+WHERE ST_Intersects(
+ ST_GeomFromWKB(t.t_pickuploc),
+ (
+ SELECT ST_GeomFromWKB(z.z_boundary)
+ FROM zone z
+ WHERE z.z_name = 'Coconino County'
+ LIMIT 1
+ )
+)
+""").show(3)
+```
+
+ ┌───────────────────────────────┐
+ │ trip_count_in_coconino_county │
+ │ int64 │
+ ╞═══════════════════════════════╡
+ │ 541 │
+ └───────────────────────────────┘
+
+
+## Q3: Monthly trip statistics within a 15km radius of the Sedona city center
+
+**Real-life scenario:** Track monthly travel trends and performance metrics in
a metropolitan area with seasonal analysis.
+
+This query analyzes taxi and rideshare trip patterns around Sedona, Arizona,
by grouping trips into monthly summaries. It looks at all trips that started
within a 15-kilometer area around Sedona (a 10km box plus 5km buffer) and
calculates key statistics for each month, including total number of trips,
average trip distance, average trip duration, and average fare. The results are
organized chronologically by month, allowing you to see seasonal trends and
changes in ride patterns over tim [...]
+
+**Spatial query characteristics tested:**
+
+1. Distance-based spatial filtering (ST_DWithin) with buffer
+2. Temporal grouping (monthly aggregation)
+3. Multiple statistical aggregations on spatially filtered data
+
+
+```python
+sd.sql("""
+SELECT
+ DATE_TRUNC('month', t.t_pickuptime) AS pickup_month,
+ COUNT(t.t_tripkey) AS total_trips,
+ AVG(t.t_distance) AS avg_distance,
+ AVG(t.t_dropofftime - t.t_pickuptime) AS avg_duration,
+ AVG(t.t_fare) AS avg_fare
+FROM trip t
+WHERE ST_DWithin(
+ ST_GeomFromWKB(t.t_pickuploc),
+ ST_GeomFromText('POLYGON((
+ -111.9060 34.7347, -111.6160 34.7347,
+ -111.6160 35.0047, -111.9060 35.0047,
+ -111.9060 34.7347
+ ))'), -- Bounding box around Sedona
+ 0.045 -- Additional 5km buffer in degrees
+)
+GROUP BY pickup_month
+ORDER BY pickup_month
+""").show(3)
+```
+
+
┌─────────────────────┬─────────────┬──────────────┬─────────────────────────────────┬─────────────┐
+ │ pickup_month ┆ total_trips ┆ avg_distance ┆
avg_duration ┆ avg_fare │
+ │ timestamp ┆ int64 ┆ decimal128 ┆ duration
┆ decimal128 │
+
╞═════════════════════╪═════════════╪══════════════╪═════════════════════════════════╪═════════════╡
+ │ 1992-04-01T00:00:00 ┆ 2 ┆ 0.000020000 ┆ 0 days 1 hours 23
mins 47.000 … ┆ 0.000075000 │
+
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 1992-07-01T00:00:00 ┆ 1 ┆ 0.000010000 ┆ 0 days 0 hours 58
mins 58.000 … ┆ 0.000040000 │
+
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 1994-02-01T00:00:00 ┆ 2 ┆ 0.000020000 ┆ 0 days 1 hours 23
mins 50.000 … ┆ 0.000050000 │
+
└─────────────────────┴─────────────┴──────────────┴─────────────────────────────────┴─────────────┘
+
+
+## Q4: Zone distribution of top 1000 trips by tip amount
+
+**Real-life scenario:** Analyze the geographic distribution of high-value
trips (by tip amount) to understand premium service areas.
+
+This query identifies which neighborhoods or zones produced the most generous
tippers by analyzing the top 1000 highest-tipping trips. It first finds the
1000 trips with the largest tips, then determines which geographic zones or
neighborhoods those pickup locations fall within, and counts how many of these
high-tip trips originated from each area. The results show a ranking of zones
by the number of big tippers they produced, helping identify the most lucrative
pickup areas for drivers [...]
+
+**Spatial query characteristics tested:**
+
+1. Subquery with ordering and limiting
+2. Point-in-polygon spatial join (ST_Within)
+3. Aggregation on spatially joined results
+4. Multi-step query with spatial filtering and grouping
+
+
+```python
+sd.sql("""
+SELECT
+ z.z_zonekey,
+ z.z_name,
+ COUNT(*) AS trip_count
+FROM
+ zone z
+ JOIN (
+ SELECT t.t_pickuploc
+ FROM trip t
+ ORDER BY t.t_tip DESC, t.t_tripkey ASC
+ LIMIT 1000
+ ) top_trips
+ ON ST_Within(
+ ST_GeomFromWKB(top_trips.t_pickuploc),
+ ST_GeomFromWKB(z.z_boundary)
+ )
+GROUP BY z.z_zonekey, z.z_name
+ORDER BY trip_count DESC, z.z_zonekey ASC
+""").show(3)
+```
+
+ ┌───────────┬─────────────────────────────────┬────────────┐
+ │ z_zonekey ┆ z_name ┆ trip_count │
+ │ int64 ┆ utf8 ┆ int64 │
+ ╞═══════════╪═════════════════════════════════╪════════════╡
+ │ 65008 ┆ Ndélé ┆ 35 │
+ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 147530 ┆ 乐山市 ┆ 27 │
+ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 150276 ┆ 锡林郭勒盟 ᠰᠢᠯᠢ ᠶᠢᠨ ᠭᠣᠣᠯ ᠠᠶᠢᠮᠠᠭ ┆ 19 │
+ └───────────┴─────────────────────────────────┴────────────┘
+
+
+## Q5: Monthly travel patterns for repeat customers (convex hull of dropoff
locations)
+
+**Real-life scenario:** Analyze the geographic spread of travel patterns for
frequent customers to understand their mobility behavior.
+
+This query analyzes the monthly travel patterns of frequent customers by
measuring how much geographic area they cover with their trips. For each
customer who took more than five trips in a month, it calculates the size of
the "travel hull" - the area enclosed by connecting all their dropoff locations
that month. The results reveal which customers have the most expansive travel
patterns, helping to identify power users who cover large geographic areas
versus those who stick to smaller, l [...]
+
+**Spatial query characteristics tested:**
+
+1. Spatial aggregation (ST_Collect/ARRAY_AGG)
+2. Convex hull computation (ST_ConvexHull)
+3. Area calculation on complex geometries
+4. Temporal and customer-based grouping with spatial operations
+
+
+```python
+sd.sql("""
+SELECT
+ c.c_custkey,
+ c.c_name AS customer_name,
+ DATE_TRUNC('month', t.t_pickuptime) AS pickup_month,
+ ST_Area(
+ ST_ConvexHull(ST_Collect(ST_GeomFromWKB(t.t_dropoffloc)))
+ ) AS monthly_travel_hull_area,
+ COUNT(*) as dropoff_count
+FROM trip t
+JOIN customer c
+ ON t.t_custkey = c.c_custkey
+GROUP BY c.c_custkey, c.c_name, pickup_month
+HAVING dropoff_count > 5 -- Only include repeat customers
+ORDER BY monthly_travel_hull_area DESC, c.c_custkey ASC
+""").show(3)
+```
+
+
┌───────────┬────────────────────┬─────────────────────┬────────────────────┬───────────────┐
+ │ c_custkey ┆ customer_name ┆ pickup_month ┆
monthly_travel_hul ┆ dropoff_count │
+ │ int64 ┆ utf8 ┆ timestamp ┆ l_area…
┆ int64 │
+
╞═══════════╪════════════════════╪═════════════════════╪════════════════════╪═══════════════╡
+ │ 25975 ┆ Customer#000025975 ┆ 1992-02-01T00:00:00 ┆
34941.303419053635 ┆ 10 │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 12061 ┆ Customer#000012061 ┆ 1997-03-01T00:00:00 ┆
34607.53871953154 ┆ 14 │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 21418 ┆ Customer#000021418 ┆ 1993-08-01T00:00:00 ┆
34465.32323910264 ┆ 9 │
+
└───────────┴────────────────────┴─────────────────────┴────────────────────┴───────────────┘
+
+
+## Q6: Zone statistics for trips within a 50km radius of the Sedona city center
+
+**Real-life scenario:** Analyze trip patterns in zones within a metropolitan
area around a specific city center.
+
+This query analyzes ride activity across all neighborhoods and zones within a
50-kilometer area around Sedona, Arizona. It identifies which zones had the
most pickup activity by counting total trips that started in each region. Also,
it calculates the average trip cost and duration for rides originating from
each zone. The results are ranked by pickup volume, showing which neighborhoods
or areas generate the most ride demand and their typical trip characteristics
within the greater Sedon [...]
+
+**Spatial query characteristics tested:**
+
+1. Polygon containment check (ST_Contains) with bounding box
+2. Point-in-polygon spatial join (ST_Within)
+
+
+```python
+sd.sql("""
+SELECT
+ z.z_zonekey,
+ z.z_name,
+ COUNT(t.t_tripkey) AS total_pickups,
+ AVG(t.t_distance) AS avg_distance, -- Corrected from t_totalamount
+ AVG(t.t_dropofftime - t.t_pickuptime) AS avg_duration
+FROM trip t, zone z
+WHERE ST_Intersects(
+ ST_GeomFromText('POLYGON((
+ -112.2110 34.4197, -111.3110 34.4197,
+ -111.3110 35.3197, -112.2110 35.3197,
+ -112.2110 34.4197
+ ))'), -- Bounding box around Sedona
+ ST_GeomFromWKB(z.z_boundary)
+ )
+ AND ST_Within(
+ ST_GeomFromWKB(t.t_pickuploc),
+ ST_GeomFromWKB(z.z_boundary)
+ )
+GROUP BY z.z_zonekey, z.z_name
+ORDER BY total_pickups DESC, z.z_zonekey ASC
+""").show(3)
+```
+
+
┌───────────┬─────────────────┬───────────────┬──────────────┬────────────────────────────────────┐
+ │ z_zonekey ┆ z_name ┆ total_pickups ┆ avg_distance ┆
avg_duration │
+ │ int64 ┆ utf8 ┆ int64 ┆ decimal128 ┆
duration │
+
╞═══════════╪═════════════════╪═══════════════╪══════════════╪════════════════════════════════════╡
+ │ 30084 ┆ Coconino County ┆ 541 ┆ 0.000030406 ┆ 0 days 1
hours 45 mins 16.591 secs │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 30083 ┆ Yavapai County ┆ 292 ┆ 0.000027157 ┆ 0 days 1
hours 36 mins 43.647 secs │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 29488 ┆ Gila County ┆ 39 ┆ 0.000021282 ┆ 0 days 1
hours 16 mins 59.769 secs │
+
└───────────┴─────────────────┴───────────────┴──────────────┴────────────────────────────────────┘
+
+
+## Q7: Detect potential route detours by comparing reported vs. geometric
distances
+
+**Real-life scenario:** Identify suspicious trips where the reported distance
significantly exceeds the straight-line distance, potentially indicating fare
manipulation.
+
+This query analyzes how much taxi and rideshare trips deviate from the most
direct route by comparing the actual reported trip distance to the
straight-line distance between pickup and dropoff points. It calculates a
"detour ratio" that shows how much longer the actual route was compared to
flying in a straight line. For example, a ratio of 1.5 means the trip was 50%
longer than the direct path. The results are sorted to show the trips with the
highest detour ratios first, helping ident [...]
+
+**Spatial query characteristics tested:**
+
+1. Line geometry creation (ST_MakeLine)
+2. Length calculation (ST_Length)
+3. Coordinate system conversion and distance calculations
+4. Ratio-based filtering on geometric vs. reported measurements
+
+
+```python
+sd.sql("""
+WITH trip_lengths AS (
+ SELECT
+ t.t_tripkey,
+ t.t_distance AS reported_distance_m,
+ ST_Length(
+ ST_MakeLine(
+ ST_GeomFromWKB(t.t_pickuploc),
+ ST_GeomFromWKB(t.t_dropoffloc)
+ )
+ ) * 111111 AS line_distance_m -- Approx. meters per degree
+ FROM trip t
+)
+SELECT
+ t.t_tripkey,
+ t.reported_distance_m,
+ t.line_distance_m,
+ t.reported_distance_m / NULLIF(t.line_distance_m, 0) AS detour_ratio
+FROM trip_lengths t
+ORDER BY
+ detour_ratio DESC NULLS LAST,
+ reported_distance_m DESC,
+ t_tripkey ASC
+""").show(3)
+```
+
+
┌───────────┬─────────────────────┬────────────────────┬──────────────────────┐
+ │ t_tripkey ┆ reported_distance_m ┆ line_distance_m ┆ detour_ratio
│
+ │ int64 ┆ decimal128 ┆ float64 ┆ float64
│
+
╞═══════════╪═════════════════════╪════════════════════╪══════════════════════╡
+ │ 4688563 ┆ 0.00010 ┆ 11111.114941555596 ┆
8.999996897341038e-9 │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 2380123 ┆ 0.00010 ┆ 11111.114983939786 ┆
8.999996863009868e-9 │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 3077131 ┆ 0.00010 ┆ 11111.115027455284 ┆
8.999996827762339e-9 │
+
└───────────┴─────────────────────┴────────────────────┴──────────────────────┘
+
+
+## Q8: Count nearby pickups for each building within a 500m radius
+
+**Real-life scenario:** Count how many trips start within 500 meters of each
building.
+
+This query identifies which buildings generate the most taxi and rideshare
pickup activity by counting trips that started within 500 meters of each
building. It analyzes the relationship between specific buildings (like hotels,
shopping centers, airports, or office buildings) and ride demand in their
immediate vicinity. The results are ranked to show which buildings are the
biggest trip generators, helping identify key pickup hotspots and understand
how different types of buildings drive [...]
+
+**Spatial query characteristics tested:**
+
+1. Distance spatial join between points and polygons
+2. Aggregation on spatial join result
+
+
+```python
+sd.sql("""
+SELECT b.b_buildingkey, b.b_name, COUNT(*) AS nearby_pickup_count
+FROM trip t
+JOIN building b
+ON ST_DWithin(ST_GeomFromWKB(t.t_pickuploc), ST_GeomFromWKB(b.b_boundary),
0.0045) -- ~500m
+GROUP BY b.b_buildingkey, b.b_name
+ORDER BY nearby_pickup_count DESC, b.b_buildingkey ASC
+""").show(3)
+```
+
+ ┌───────────────┬────────┬─────────────────────┐
+ │ b_buildingkey ┆ b_name ┆ nearby_pickup_count │
+ │ int64 ┆ utf8 ┆ int64 │
+ ╞═══════════════╪════════╪═════════════════════╡
+ │ 3779 ┆ linen ┆ 42 │
+ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 19135 ┆ misty ┆ 36 │
+ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 4416 ┆ sienna ┆ 26 │
+ └───────────────┴────────┴─────────────────────┘
+
+
+## Q9: Building Conflation (duplicate/overlap detection via IoU)
+
+**Real-life scenario:** Detect duplicate or overlapping building footprints in
GIS datasets to identify data quality issues.
+
+This query identifies overlapping buildings by calculating how much their
footprints intersect with each other. For every pair of buildings that touch or
overlap, it measures the total area of each building and the area where they
overlap, then calculates an "Intersection over Union" (IoU) score that ranges
from 0 to 1. The results are sorted by IoU score to show the most significantly
overlapping building pairs first, which could help identify data quality
issues, adjacent structures, o [...]
+
+**Spatial query characteristics tested:**
+
+1. Self-join with spatial intersection (ST_Intersects)
+2. Area calculations (ST_Area)
+3. Geometric intersection operations (ST_Intersection)
+4. Complex geometric ratio calculations (IoU - Intersection over Union)
+
+
+```python
+sd.sql("""
+WITH b1 AS (
+ SELECT b_buildingkey AS id, ST_GeomFromWKB(b_boundary) AS geom
+ FROM building
+),
+b2 AS (
+ SELECT b_buildingkey AS id, ST_GeomFromWKB(b_boundary) AS geom
+ FROM building
+),
+pairs AS (
+ SELECT
+ b1.id AS building_1,
+ b2.id AS building_2,
+ ST_Area(b1.geom) AS area1,
+ ST_Area(b2.geom) AS area2,
+ ST_Area(ST_Intersection(b1.geom, b2.geom)) AS overlap_area
+ FROM b1
+ JOIN b2 ON b1.id < b2.id AND ST_Intersects(b1.geom, b2.geom)
+)
+SELECT
+ building_1,
+ building_2,
+ area1,
+ area2,
+ overlap_area,
+ CASE
+ WHEN (area1 + area2 - overlap_area) = 0 THEN 1.0
+ ELSE overlap_area / (area1 + area2 - overlap_area)
+ END AS iou
+FROM pairs
+ORDER BY iou DESC, building_1 ASC, building_2 ASC
+""").show(3)
+```
+
+
┌────────────┬────────────┬───┬───────────────────────┬────────────────────┐
+ │ building_1 ┆ building_2 ┆ … ┆ overlap_area ┆ iou
│
+ │ int64 ┆ int64 ┆ ┆ float64 ┆ float64
│
+
╞════════════╪════════════╪═══╪═══════════════════════╪════════════════════╡
+ │ 2285 ┆ 15719 ┆ … ┆ 2.3709162946727276e-6 ┆ 0.9056816071717889
│
+
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 7562 ┆ 18534 ┆ … ┆ 5.855106543747764e-6 ┆ 0.8450437137796769
│
+
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 2285 ┆ 13658 ┆ … ┆ 1.9770693222933237e-6 ┆ 0.737899157380637
│
+
└────────────┴────────────┴───┴───────────────────────┴────────────────────┘
+
+
+## Q10: Zone statistics for trips starting within each zone
+
+**Real-life scenario:** Analyze trip patterns and performance metrics for each
administrative zone (like city districts or neighborhoods).
+
+This query analyzes trip patterns across all geographic zones by calculating
average trip duration, distance, and volume for rides originating from each
area. It uses a left join to include all zones in the results, even those with
no pickup activity, showing which neighborhoods generate longer trips on
average versus shorter local rides. The results are sorted by average trip
duration to identify zones where people tend to take longer journeys, which
could indicate more isolated areas, [...]
+
+**Spatial query characteristics tested:**
+
+1. Point-in-polygon spatial join (ST_Within)
+2. Aggregation with multiple metrics (average duration, distance, count)
+3. LEFT JOIN to include zones with no trips
+
+
+```python
+sd.sql("""
+SELECT
+ z.z_zonekey,
+ z.z_name AS pickup_zone,
+ AVG(t.t_dropofftime - t.t_pickuptime) AS avg_duration,
+ AVG(t.t_distance) AS avg_distance,
+ COUNT(t.t_tripkey) AS num_trips
+FROM
+ zone z
+ LEFT JOIN trip t
+ ON ST_Within(
+ ST_GeomFromWKB(t.t_pickuploc), ST_GeomFromWKB(z.z_boundary)
+ )
+GROUP BY z.z_zonekey, z.z_name
+ORDER BY avg_duration DESC NULLS LAST, z.z_zonekey ASC
+""").show(3)
+```
+
+
┌───────────┬────────────────┬────────────────────────────────────┬──────────────┬───────────┐
+ │ z_zonekey ┆ pickup_zone ┆ avg_duration ┆
avg_distance ┆ num_trips │
+ │ int64 ┆ utf8 ┆ duration ┆
decimal128 ┆ int64 │
+
╞═══════════╪════════════════╪════════════════════════════════════╪══════════════╪═══════════╡
+ │ 31558 ┆ Benewah County ┆ 4 days 13 hours 3 mins 34.000 secs ┆
0.002180000 ┆ 2 │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 119540 ┆ Kreis Unna ┆ 2 days 4 hours 52 mins 44.000 secs ┆
0.001050000 ┆ 1 │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 59928 ┆ Ndhiwa ┆ 2 days 4 hours 19 mins 39.000 secs ┆
0.001040000 ┆ 1 │
+
└───────────┴────────────────┴────────────────────────────────────┴──────────────┴───────────┘
+
+
+## Q11: Count trips that cross between different zones
+
+**Real-life scenario:** Identify inter-district or inter-city trips to
understand cross-boundary travel patterns.
+
+This query counts how many trips crossed zone boundaries by starting in one
geographic zone and ending in a different zone. It identifies each trip's
pickup and dropoff zones, then filters to only include trips where the pickup
zone is different from the dropoff zone. The result shows the total number of
inter-zone trips, helping measure how much travel occurs between different
neighborhoods, districts, or areas rather than staying within the same local
zone.
+
+**Spatial query characteristics tested:**
+
+1. Multiple point-in-polygon spatial joins
+2. Filtering based on spatial relationship results
+
+
+```python
+sd.sql("""
+SELECT COUNT(*) AS cross_zone_trip_count
+FROM
+ trip t
+ JOIN zone pickup_zone
+ ON ST_Within(
+ ST_GeomFromWKB(t.t_pickuploc),
+ ST_GeomFromWKB(pickup_zone.z_boundary)
+ )
+ JOIN zone dropoff_zone
+ ON ST_Within(
+ ST_GeomFromWKB(t.t_dropoffloc),
+ ST_GeomFromWKB(dropoff_zone.z_boundary)
+ )
+WHERE pickup_zone.z_zonekey != dropoff_zone.z_zonekey
+""").show(3)
+```
+
+ ┌───────────────────────┐
+ │ cross_zone_trip_count │
+ │ int64 │
+ ╞═══════════════════════╡
+ │ 176391 │
+ └───────────────────────┘
+
+
+## Q12: Find five nearest buildings to each trip pickup location using KNN join
+**Real-life scenario:** Identify the closest landmarks or buildings to each
trip start point for location context and navigation.
+
+This query finds the 5 closest buildings to each trip pickup location using
spatial nearest neighbor analysis. For every trip, it identifies the five
buildings that are geographically closest to where the passenger was picked up
and calculates the exact distance to each of those buildings. The results show
which buildings are most commonly near pickup points, helping understand the
relationship between trip origins and nearby landmarks, businesses, or
residential structures that might in [...]
+
+**Spatial query characteristics tested:**
+
+1. K-nearest neighbor (KNN) spatial join
+2. Distance calculations between points and polygons
+3. Ranking and limiting results based on spatial proximity
+
+
+```python
+sd.sql("""
+WITH trip_with_geom AS (
+ SELECT
+ t_tripkey,
+ t_pickuploc,
+ ST_GeomFromWKB(t_pickuploc) as pickup_geom
+ FROM trip
+),
+building_with_geom AS (
+ SELECT
+ b_buildingkey,
+ b_name,
+ b_boundary,
+ ST_GeomFromWKB(b_boundary) as boundary_geom
+ FROM building
+)
+SELECT
+ t.t_tripkey,
+ t.t_pickuploc,
+ b.b_buildingkey,
+ b.b_name AS building_name,
+ ST_Distance(t.pickup_geom, b.boundary_geom) AS distance_to_building
+FROM trip_with_geom t
+JOIN building_with_geom b
+ ON ST_KNN(t.pickup_geom, b.boundary_geom, 5, FALSE)
+ORDER BY t.t_tripkey ASC, distance_to_building ASC, b.b_buildingkey ASC
+""").show(3)
+```
+
+
┌───────────┬─────────────────────────────────┬───────────────┬───────────────┬────────────────────┐
+ │ t_tripkey ┆ t_pickuploc ┆ b_buildingkey ┆
building_name ┆ distance_to_buildi │
+ │ int64 ┆ binary ┆ int64 ┆ utf8
┆ ng… │
+
╞═══════════╪═════════════════════════════════╪═══════════════╪═══════════════╪════════════════════╡
+ │ 1 ┆ 01010000009f3c318dd43735405930… ┆ 15870 ┆ purple
┆ 0.984633987957188 │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 1 ┆ 01010000009f3c318dd43735405930… ┆ 6800 ┆ ghost
┆ 1.205725156670704 │
+
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
+ │ 1 ┆ 01010000009f3c318dd43735405930… ┆ 8384 ┆ lavender
┆ 1.4195012994942622 │
+
└───────────┴─────────────────────────────────┴───────────────┴───────────────┴────────────────────┘
+
diff --git a/docs/requirements.txt b/docs/requirements.txt
index 13a8918..e90aba0 100644
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -1,14 +1,18 @@
+apache-sedona[db]
+jupyter
+jupyterlab
mike
mkdocs-git-revision-date-localized-plugin
mkdocs-glightbox
mkdocs-macros-plugin
mkdocs-material
+mkdocs-jupyter
mkdocstrings[python]
+notebook
nbconvert
pyproj
ruff
-mkdocs-jupyter
-jupyter
-notebook
-jupyterlab
-apache-sedona[db]
\ No newline at end of file
+
+
+
+
diff --git a/mkdocs.yml b/mkdocs.yml
index d2bd5ef..e2fb203 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -22,7 +22,7 @@ nav:
- SpatialBench: index.md
- Overview and Methodology: overview-methodology.md
- Datasets and Generators: datasets-generators.md
- - Queries: queries.ipynb
+ - Run the SpatialBench Queries: queries.md
- Blog: "https://sedona.apache.org/latest/blog/"
- Community: "https://sedona.apache.org/latest/community/contact/"
- Apache Software Foundation: "https://sedona.apache.org/latest/asf/asf/"