Copilot commented on code in PR #26:
URL:
https://github.com/apache/sedona-spatialbench/pull/26#discussion_r2370306361
##########
docs/queries.ipynb:
##########
@@ -0,0 +1,741 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "cf23cfd2-c45c-4ca3-9dcb-70541f3cc47d",
+ "metadata": {},
+ "source": [
+ "# SpatialBench Queries"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "id": "9c1fe889-d035-4ed9-85a6-f7a8ecb891ad",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import sedona.db"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "id": "96cf6ad5-e106-48f7-9097-9f3f740e9e7f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sd = sedona.db.connect()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "id": "16a147f4-4a12-4050-a1bb-934834b4a6bc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+
"sd.read_parquet(f\"/Users/matthewpowers/data/sf1-parquet/building.parquet\").to_view(\"building\")\n",
+
"sd.read_parquet(f\"/Users/matthewpowers/data/sf1-parquet/customer.parquet\").to_view(\"customer\")\n",
+
"sd.read_parquet(f\"/Users/matthewpowers/data/sf1-parquet/driver.parquet\").to_view(\"driver\")\n",
+
"sd.read_parquet(f\"/Users/matthewpowers/data/sf1-parquet/trip.parquet\").to_view(\"trip\")\n",
+
"sd.read_parquet(f\"/Users/matthewpowers/data/sf1-parquet/vehicle.parquet\").to_view(\"vehicle\")\n",
+
"sd.read_parquet(f\"/Users/matthewpowers/data/sf1-parquet/zone.parquet\").to_view(\"zone\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5e243bca-ca97-4f72-bc72-46ace7495019",
+ "metadata": {},
+ "source": [
+ "## Q1: Find trips starting within 50km of Sedona city center, ordered by
distance\n",
+ "\n",
+ "**Real-life scenario:** Identify and rank trips by proximity to a city
center for urban planning and transportation analysis.\n",
+ "\n",
+ "This query finds all taxi or rideshare trips that started within 50
kilometers of downtown Sedona, Arizona. For each qualifying trip, it shows the
trip ID, pickup coordinates, pickup time, and calculates the exact distance
from the pickup location to Sedona's city center. The results are sorted to
show the trips that picked up closest to downtown Sedona first, making it easy
to see which rides originated nearest to the city center.\n",
+ "\n",
+ "**Spatial query characteristics tested:**\n",
+ "\n",
+ "1. Distance-based spatial filtering (ST_DWithin)\n",
+ "2. Distance calculation to a fixed point\n",
+ "3. Coordinate extraction (ST_X, ST_Y)\n",
+ "4. Ordering by spatial distance"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "id": "f121f799-08f9-4679-a772-fb5f5470986f",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+
"┌───────────┬────────────────┬──────────────┬──────────────────────────────┬──────────────────────┐\n",
+ "│ t_tripkey ┆ pickup_lon ┆ pickup_lat ┆ t_pickuptime
┆ distance_to_center │\n",
+ "│ int64 ┆ float64 ┆ float64 ┆ timestamp(millisecond,
none) ┆ float64 │\n",
+
"╞═══════════╪════════════════╪══════════════╪══════════════════════════════╪══════════════════════╡\n",
+ "│ 1451371 ┆ -111.791052127 ┆ 34.826733457 ┆ 1998-08-12T06:47:01
┆ 0.05243333056935386 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 2047835 ┆ -111.706967009 ┆ 34.883889472 ┆ 1992-04-08T07:36:09
┆ 0.055865062714050374 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 3936870 ┆ -111.827619221 ┆ 34.882950924 ┆ 1998-11-10T13:32:07
┆ 0.06792427838042854 │\n",
+
"└───────────┴────────────────┴──────────────┴──────────────────────────────┴──────────────────────┘\n"
+ ]
+ }
+ ],
+ "source": [
+ "sd.sql(\"\"\"\n",
+ "SELECT\n",
+ " t.t_tripkey, ST_X(ST_GeomFromWKB(t.t_pickuploc)) AS pickup_lon,
ST_Y(ST_GeomFromWKB(t.t_pickuploc)) AS pickup_lat, t.t_pickuptime,\n",
+ " ST_Distance(ST_GeomFromWKB(t.t_pickuploc), ST_GeomFromText('POINT
(-111.7610 34.8697)')) AS distance_to_center\n",
+ "FROM trip t\n",
+ "WHERE ST_DWithin(ST_GeomFromWKB(t.t_pickuploc), ST_GeomFromText('POINT
(-111.7610 34.8697)'), 0.45) -- 50km radius around Sedona center\n",
+ "ORDER BY distance_to_center ASC, t.t_tripkey ASC\n",
+ "\"\"\").show(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d1e2a458-fc72-4a21-889e-9099f9ad6fb7",
+ "metadata": {},
+ "source": [
+ "## Q2: Count trips starting within Coconino County (Arizona) zone\n",
+ "\n",
+ "**Real-life scenario:** Count all trips originating within a specific
administrative boundary (county) for regional transportation statistics.\n",
+ "\n",
+ "This query counts how many taxi or rideshare trips started within
Coconino County, Arizona. It does this by checking if each trip's pickup
location falls inside the county's geographic boundaries. The result is a
simple count showing the total number of trips that originated anywhere within
Coconino County.\n",
+ "\n",
+ "**Spatial query characteristics tested:**\n",
+ "\n",
+ "1. Point-in-polygon spatial filtering (ST_Intersects)\n",
+ "2. Subquery with spatial geometry selection\n",
+ "3. Simple aggregation on spatially filtered data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "id": "cc219172-0cfb-4f60-a728-e4ec5cc5c21e",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "┌───────────────────────────────┐\n",
+ "│ trip_count_in_coconino_county │\n",
+ "│ int64 │\n",
+ "╞═══════════════════════════════╡\n",
+ "│ 541 │\n",
+ "└───────────────────────────────┘\n"
+ ]
+ }
+ ],
+ "source": [
+ "sd.sql(\"\"\"\n",
+ "SELECT COUNT(*) AS trip_count_in_coconino_county\n",
+ "FROM trip t\n",
+ "WHERE ST_Intersects(ST_GeomFromWKB(t.t_pickuploc), (SELECT
ST_GeomFromWKB(z.z_boundary) FROM zone z WHERE z.z_name = 'Coconino County'
LIMIT 1))\n",
+ "\"\"\").show(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "998daae7-31e4-4e93-823d-70d4cee00605",
+ "metadata": {},
+ "source": [
+ "## Q3: Monthly trip statistics within a 15km radius of the Sedona city
center\n",
+ "\n",
+ "**Real-life scenario:** Track monthly travel trends and performance
metrics in a metropolitan area with seasonal analysis.\n",
+ "\n",
+ "This query analyzes taxi and rideshare trip patterns around Sedona,
Arizona, by grouping trips into monthly summaries. It looks at all trips that
started within a 15-kilometer area around Sedona (a 10km box plus 5km buffer)
and calculates key statistics for each month, including total number of trips,
average trip distance, average trip duration, and average fare. The results are
organized chronologically by month, allowing you to see seasonal trends and
changes in ride patterns over time in the Sedona area.\n",
+ "\n",
+ "**Spatial query characteristics tested:**\n",
+ "\n",
+ "1. Distance-based spatial filtering (ST_DWithin) with buffer\n",
+ "2. Temporal grouping (monthly aggregation)\n",
+ "3. Multiple statistical aggregations on spatially filtered data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "id": "eb5d6347-47dd-48af-adb5-929838954ac4",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+
"┌──────────────────────┬─────────────┬───────────────────┬─────────────────────┬───────────────────┐\n",
+ "│ pickup_month ┆ total_trips ┆ avg_distance ┆
avg_duration ┆ avg_fare │\n",
+ "│ timestamp(milliseco… ┆ int64 ┆ decimal128(19, 9) ┆
duration(milliseco… ┆ decimal128(19, 9) │\n",
+
"╞══════════════════════╪═════════════╪═══════════════════╪═════════════════════╪═══════════════════╡\n",
+ "│ 1992-04-01T00:00:00 ┆ 2 ┆ 0.000020000 ┆ 0 days 1
hours 23 … ┆ 0.000075000 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 1992-07-01T00:00:00 ┆ 1 ┆ 0.000010000 ┆ 0 days 0
hours 58 … ┆ 0.000040000 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 1994-02-01T00:00:00 ┆ 2 ┆ 0.000020000 ┆ 0 days 1
hours 23 … ┆ 0.000050000 │\n",
+
"└──────────────────────┴─────────────┴───────────────────┴─────────────────────┴───────────────────┘\n"
+ ]
+ }
+ ],
+ "source": [
+ "sd.sql(\"\"\"\n",
+ "SELECT\n",
+ " DATE_TRUNC('month', t.t_pickuptime) AS pickup_month,
COUNT(t.t_tripkey) AS total_trips,\n",
+ " AVG(t.t_distance) AS avg_distance, AVG(t.t_dropofftime -
t.t_pickuptime) AS avg_duration,\n",
+ " AVG(t.t_fare) AS avg_fare\n",
+ "FROM trip t\n",
+ "WHERE ST_DWithin(\n",
+ " ST_GeomFromWKB(t.t_pickuploc),\n",
+ " ST_GeomFromText('POLYGON((-111.9060 34.7347, -111.6160 34.7347,
-111.6160 35.0047, -111.9060 35.0047, -111.9060 34.7347))'), -- 10km bounding
box around Sedona\n",
+ " 0.045 -- Additional 5km buffer\n",
+ ")\n",
+ "GROUP BY pickup_month\n",
+ "ORDER BY pickup_month\n",
+ "\"\"\").show(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b516ed52-8cf2-4c17-bc33-00002e22b287",
+ "metadata": {},
+ "source": [
+ "## Q4: Zone distribution of top 1000 trips by tip amount\n",
+ "\n",
+ "**Real-life scenario:** Analyze the geographic distribution of high-value
trips (by tip amount) to understand premium service areas.\n",
+ "\n",
+ "This query identifies which neighborhoods or zones produced the most
generous tippers by analyzing the top 1000 highest-tipping trips. It first
finds the 1000 trips with the largest tips, then determines which geographic
zones or neighborhoods those pickup locations fall within, and counts how many
of these high-tip trips originated from each area. The results show a ranking
of zones by the number of big tippers they produced, helping identify the most
lucrative pickup areas for drivers seeking high-tip rides.\n",
+ "\n",
+ "**Spatial query characteristics tested:**\n",
+ "\n",
+ "1. Subquery with ordering and limiting\n",
+ "2. Point-in-polygon spatial join (ST_Within)\n",
+ "3. Aggregation on spatially joined results\n",
+ "4. Multi-step query with spatial filtering and grouping"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "id": "f8dcf328-a7f2-49be-9383-afa11766f871",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "┌───────────┬─────────────────────────────────┬────────────┐\n",
+ "│ z_zonekey ┆ z_name ┆ trip_count │\n",
+ "│ int64 ┆ utf8view ┆ int64 │\n",
+ "╞═══════════╪═════════════════════════════════╪════════════╡\n",
+ "│ 65008 ┆ Ndélé ┆ 35 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 147530 ┆ 乐山市 ┆ 27 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 150276 ┆ 锡林郭勒盟 ᠰᠢᠯᠢ ᠶᠢᠨ ᠭᠣᠣᠯ ᠠᠶᠢᠮᠠᠭ ┆ 19 │\n",
+ "└───────────┴─────────────────────────────────┴────────────┘\n"
+ ]
+ }
+ ],
+ "source": [
+ "sd.sql(\"\"\"\n",
+ "SELECT z.z_zonekey, z.z_name, COUNT(*) AS trip_count\n",
+ "FROM\n",
+ " zone z\n",
+ " JOIN (\n",
+ " SELECT t.t_pickuploc\n",
+ " FROM trip t\n",
+ " ORDER BY t.t_tip DESC, t.t_tripkey ASC\n",
+ " LIMIT 1000 -- Replace 1000 with x (how many top tips you want)\n",
+ " ) top_trips ON ST_Within(ST_GeomFromWKB(top_trips.t_pickuploc),
ST_GeomFromWKB(z.z_boundary))\n",
+ "GROUP BY z.z_zonekey, z.z_name\n",
+ "ORDER BY trip_count DESC, z.z_zonekey ASC\n",
+ "\"\"\").show(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ad294223-742c-4b6b-a9ec-036723e67a1f",
+ "metadata": {},
+ "source": [
+ "## Q5: Monthly travel patterns for repeat customers (convex hull of
dropoff locations)\n",
+ "\n",
+ "**Real-life scenario:** Analyze the geographic spread of travel patterns
for frequent customers to understand their mobility behavior. \n",
+ "\n",
+ "This query analyzes the monthly travel patterns of frequent customers by
measuring how much geographic area they cover with their trips. For each
customer who took more than five trips in a month, it calculates the size of
the \"travel hull\" - the area enclosed by connecting all their dropoff
locations that month. The results reveal which customers have the most
expansive travel patterns, helping to identify power users who cover large
geographic areas versus those who stick to smaller, local areas.\n",
+ "\n",
+ "**Spatial query characteristics tested:**\n",
+ "\n",
+ "1. Spatial aggregation (ST_Collect/ARRAY_AGG)\n",
+ "2. Convex hull computation (ST_ConvexHull)\n",
+ "3. Area calculation on complex geometries\n",
+ "4. Temporal and customer-based grouping with spatial operations"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "id": "8c446f9a-3734-4a8d-87a8-0d9d6d34fd5c",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+
"┌───────────┬────────────────────┬────────────────────────────┬────────────────────┬───────────────┐\n",
+ "│ c_custkey ┆ customer_name ┆ pickup_month ┆
monthly_travel_hul ┆ dropoff_count │\n",
+ "│ int64 ┆ utf8view ┆ timestamp(millisecond, no… ┆
l_area… ┆ int64 │\n",
+
"╞═══════════╪════════════════════╪════════════════════════════╪════════════════════╪═══════════════╡\n",
+ "│ 12049 ┆ Customer#000012049 ┆ 1997-10-01T00:00:00 ┆
22290.72518508209 ┆ 18 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 12091 ┆ Customer#000012091 ┆ 1997-08-01T00:00:00 ┆
18901.827089935185 ┆ 18 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 13231 ┆ Customer#000013231 ┆ 1994-09-01T00:00:00 ┆
24930.673726244015 ┆ 18 │\n",
+
"└───────────┴────────────────────┴────────────────────────────┴────────────────────┴───────────────┘\n"
+ ]
+ }
+ ],
+ "source": [
+ "sd.sql(\"\"\"\n",
+ "SELECT\n",
+ " c.c_custkey, c.c_name AS customer_name,\n",
+ " DATE_TRUNC('month', t.t_pickuptime) AS pickup_month,\n",
+ " ST_Area(ST_ConvexHull(ST_Collect(ST_GeomFromWKB(t.t_dropoffloc)))) AS
monthly_travel_hull_area,\n",
+ " COUNT(*) as dropoff_count\n",
+ "FROM trip t JOIN customer c ON t.t_custkey = c.c_custkey\n",
+ "GROUP BY c.c_custkey, c.c_name, pickup_month\n",
+ "HAVING dropoff_count > 5 -- Only include repeat customers for meaningful
hulls\n",
+ "ORDER BY dropoff_count DESC, c.c_custkey ASC\n",
+ "\"\"\").show(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8501daf5-5e4a-46d5-86c0-c3f6ea8673ee",
+ "metadata": {},
+ "source": [
+ "## Q6: Zone statistics for trips within a 50km radius of the Sedona city
center\n",
+ "\n",
+ "**Real-life scenario:** Analyze trip patterns in zones within a
metropolitan area around a specific city center.\n",
+ "\n",
+ "This query analyzes ride activity across all neighborhoods and zones
within a 50-kilometer area around Sedona, Arizona. It identifies which zones
had the most pickup activity by counting total trips that started in each
region. Also, it calculates the average trip cost and duration for rides
originating from each zone. The results are ranked by pickup volume, showing
which neighborhoods or areas generate the most ride demand and their typical
trip characteristics within the greater Sedona region.\n",
+ "\n",
+ "**Spatial query characteristics tested:**\n",
+ "\n",
+ "1. Polygon containment check (ST_Contains) with bounding box\n",
+ "2. Point-in-polygon spatial join (ST_Within)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "id": "f168c302-5187-45f9-8d87-427bbd68c942",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+
"┌───────────┬─────────────────┬───────────────┬───────────────────┬────────────────────────────────┐\n",
+ "│ z_zonekey ┆ z_name ┆ total_pickups ┆ avg_distance ┆
avg_duration │\n",
+ "│ int64 ┆ utf8view ┆ int64 ┆ decimal128(19, 9) ┆
duration(millisecond) │\n",
+
"╞═══════════╪═════════════════╪═══════════════╪═══════════════════╪════════════════════════════════╡\n",
+ "│ 30084 ┆ Coconino County ┆ 541 ┆ 0.000085323 ┆ 0
days 1 hours 45 mins 16.591… │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 30083 ┆ Yavapai County ┆ 292 ┆ 0.000076643 ┆ 0
days 1 hours 36 mins 43.647… │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 29488 ┆ Gila County ┆ 39 ┆ 0.000065641 ┆ 0
days 1 hours 16 mins 59.769… │\n",
+
"└───────────┴─────────────────┴───────────────┴───────────────────┴────────────────────────────────┘\n"
+ ]
+ }
+ ],
+ "source": [
+ "sd.sql(\"\"\"\n",
+ "SELECT\n",
+ " z.z_zonekey, z.z_name,\n",
+ " COUNT(t.t_tripkey) AS total_pickups, AVG(t.t_totalamount) AS
avg_distance,\n",
+ " AVG(t.t_dropofftime - t.t_pickuptime) AS avg_duration\n",
+ "FROM trip t, zone z\n",
+ "WHERE ST_Intersects(ST_GeomFromText('POLYGON((-112.2110 34.4197,
-111.3110 34.4197, -111.3110 35.3197, -112.2110 35.3197, -112.2110 34.4197))'),
ST_GeomFromWKB(z.z_boundary)) -- 50km bounding box around Sedona\n",
+ " AND ST_Within(ST_GeomFromWKB(t.t_pickuploc),
ST_GeomFromWKB(z.z_boundary))\n",
+ "GROUP BY z.z_zonekey, z.z_name\n",
+ "ORDER BY total_pickups DESC, z.z_zonekey ASC\n",
+ "\"\"\").show(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fe58f65b-1403-4f91-81a3-e81784bfd3b4",
+ "metadata": {},
+ "source": [
+ "## Q7: Detect potential route detours by comparing reported vs. geometric
distances\n",
+ "\n",
+ "**Real-life scenario:** Identify suspicious trips where the reported
distance significantly exceeds the straight-line distance, potentially
indicating fare manipulation.\n",
+ "\n",
+ "This query analyzes how much taxi and rideshare trips deviate from the
most direct route by comparing the actual reported trip distance to the
straight-line distance between pickup and dropoff points. It calculates a
\"detour ratio\" that shows how much longer the actual route was compared to
flying in a straight line. For example, a ratio of 1.5 means the trip was 50%
longer than the direct path. The results are sorted to show the trips with the
highest detour ratios first, helping identify routes that took significant
detours due to traffic, road layouts, or other factors.\n",
+ "\n",
+ "**Spatial query characteristics tested:**\n",
+ "\n",
+ "1. Line geometry creation (ST_MakeLine)\n",
+ "2. Length calculation (ST_Length)\n",
+ "3. Coordinate system conversion and distance calculations\n",
+ "4. Ratio-based filtering on geometric vs. reported measurements"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "id": "3a8cddde-c04b-49d4-ab69-3d75d9a535ff",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+
"┌───────────┬─────────────────────┬────────────────────┬──────────────────────┐\n",
+ "│ t_tripkey ┆ reported_distance_m ┆ line_distance_m ┆
detour_ratio │\n",
+ "│ int64 ┆ decimal128(15, 5) ┆ float64 ┆ float64
│\n",
+
"╞═══════════╪═════════════════════╪════════════════════╪══════════════════════╡\n",
+ "│ 4688563 ┆ 0.00010 ┆ 11111.126052681648 ┆
8.99998789734414e-9 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 2380123 ┆ 0.00010 ┆ 11111.126095065882 ┆
8.999987863013003e-9 │\n",
+
"├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 3077131 ┆ 0.00010 ┆ 11111.126138581423 ┆
8.99998782776551e-9 │\n",
+
"└───────────┴─────────────────────┴────────────────────┴──────────────────────┘\n"
+ ]
+ }
+ ],
+ "source": [
+ "sd.sql(\"\"\"\n",
+ "WITH trip_lengths AS (\n",
+ " SELECT\n",
+ " t.t_tripkey,\n",
+ " t.t_distance AS reported_distance_m,\n",
+ " ST_Length(\n",
+ " ST_MakeLine(\n",
+ " ST_GeomFromWKB(t.t_pickuploc),\n",
+ " ST_GeomFromWKB(t.t_dropoffloc)\n",
+ " )\n",
+ " ) / 0.000009 AS line_distance_m -- 1 meter = 0.000009 degree\n",
+ " FROM trip t\n",
+ ")\n",
+ "SELECT\n",
+ " t.t_tripkey,\n",
+ " t.reported_distance_m,\n",
+ " t.line_distance_m,\n",
+ " t.reported_distance_m / NULLIF(t.line_distance_m, 0) AS
detour_ratio\n",
+ "FROM trip_lengths t\n",
+ "ORDER BY detour_ratio DESC NULLS LAST, reported_distance_m DESC,
t_tripkey ASC\n",
+ "\"\"\").show(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3040b853-100b-4c9d-8447-5b9bf4c4a895",
+ "metadata": {},
+ "source": [
+ "## Q8: Count nearby pickups for each building within a 500m radius\n",
+ "\n",
+ "**Real-life scenario:** Count how many trips start within 500 meters of
each building.\n",
+ "\n",
+ "This query identifies which buildings generate the most taxi and
rideshare pickup activity by counting trips that started within 500 meters of
each building. It analyzes the relationship between specific buildings (like
hotels, shopping centers, airports, or office buildings) and ride demand in
their immediate vicinity. The results are ranked to show which buildings are
the biggest trip generators, helping identify key pickup hotspots and
understand how different types of buildings drive transportation demand.\n",
+ "\n",
+ "**Spatial query characteristics tested:**\n",
+ "\n",
+ "1. Distance spatial join between points and polygons\n",
+ "2. Aggregation on spatial join result"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "id": "58ac6e6a-2e3d-4dd7-bb03-1b16f55fc186",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "┌───────────────┬──────────┬─────────────────────┐\n",
+ "│ b_buildingkey ┆ b_name ┆ nearby_pickup_count │\n",
+ "│ int64 ┆ utf8view ┆ int64 │\n",
+ "╞═══════════════╪══════════╪═════════════════════╡\n",
+ "│ 3779 ┆ linen ┆ 42 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 19135 ┆ misty ┆ 36 │\n",
+ "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n",
+ "│ 4416 ┆ sienna ┆ 26 │\n",
+ "└───────────────┴──────────┴─────────────────────┘\n"
+ ]
+ }
+ ],
+ "source": [
+ "sd.sql(\"\"\"\n",
+ "SELECT b.b_buildingkey, b.b_name, COUNT(*) AS nearby_pickup_count\n",
+ "FROM trip t\n",
+ "JOIN building b\n",
+ "ON ST_DWithin(ST_GeomFromWKB(t.t_pickuploc),
ST_GeomFromWKB(b.b_boundary), 0.0045) -- ~500m\n",
Review Comment:
The distance threshold 0.0045 (representing ~500m) is a magic number that
lacks clear documentation of the unit conversion. This makes the query
difficult to understand and modify. Consider defining this as a named constant
or adding clearer documentation about the coordinate system and units.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]