(sedona) branch master updated: [DOCS] sedona distance (#1986)

jiayu Wed, 11 Jun 2025 03:01:55 -0700

This is an automated email from the ASF dual-hosted git repository.

jiayu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/sedona.git



The following commit(s) were added to refs/heads/master by this push:
     new fb9c495f46 [DOCS] sedona distance (#1986)
fb9c495f46 is described below

commit fb9c495f46950b37dfadc94cf325eac21c9e081e
Author: Matthew Powers <[email protected]>
AuthorDate: Wed Jun 11 05:52:40 2025 -0400

    [DOCS] sedona distance (#1986)
    
    * [DOCS] add page on distance computations with spark
    
    * lint
    
    * make concepts pages specific to spark
---
 docs/image/tutorial/concepts/distance1.png      | Bin 0 -> 67423 bytes
 docs/image/tutorial/concepts/distance2.png      | Bin 0 -> 33315 bytes
 docs/image/tutorial/concepts/distance3.png      | Bin 0 -> 36178 bytes
 docs/image/tutorial/concepts/distance4.png      | Bin 0 -> 42298 bytes
 docs/image/tutorial/concepts/distance5.png      | Bin 0 -> 31026 bytes
 docs/tutorial/concepts/clustering-algorithms.md |   4 +-
 docs/tutorial/concepts/distance-spark.md        | 280 ++++++++++++++++++++++++
 docs/tutorial/concepts/spatial-joins.md         |  18 +-
 mkdocs.yml                                      |   1 +
 9 files changed, 292 insertions(+), 11 deletions(-)

diff --git a/docs/image/tutorial/concepts/distance1.png 
b/docs/image/tutorial/concepts/distance1.png
new file mode 100644
index 0000000000..4dd92363c0
Binary files /dev/null and b/docs/image/tutorial/concepts/distance1.png differ
diff --git a/docs/image/tutorial/concepts/distance2.png 
b/docs/image/tutorial/concepts/distance2.png
new file mode 100644
index 0000000000..e9760ed217
Binary files /dev/null and b/docs/image/tutorial/concepts/distance2.png differ
diff --git a/docs/image/tutorial/concepts/distance3.png 
b/docs/image/tutorial/concepts/distance3.png
new file mode 100644
index 0000000000..e3b292e06e
Binary files /dev/null and b/docs/image/tutorial/concepts/distance3.png differ
diff --git a/docs/image/tutorial/concepts/distance4.png 
b/docs/image/tutorial/concepts/distance4.png
new file mode 100644
index 0000000000..660a6433db
Binary files /dev/null and b/docs/image/tutorial/concepts/distance4.png differ
diff --git a/docs/image/tutorial/concepts/distance5.png 
b/docs/image/tutorial/concepts/distance5.png
new file mode 100644
index 0000000000..1499aa0f2d
Binary files /dev/null and b/docs/image/tutorial/concepts/distance5.png differ
diff --git a/docs/tutorial/concepts/clustering-algorithms.md 
b/docs/tutorial/concepts/clustering-algorithms.md
index 460e252ad6..d7bf792961 100644
--- a/docs/tutorial/concepts/clustering-algorithms.md
+++ b/docs/tutorial/concepts/clustering-algorithms.md
@@ -17,7 +17,7 @@
  under the License.
  -->
 
-# Apache Sedona Clustering Algorithms
+# Apache Sedona Clustering Algorithms with Apache Spark
 
 Clustering algorithms group similar data points into “clusters.”  Apache 
Sedona can run clustering algorithms on large geometric datasets.
 
@@ -28,7 +28,7 @@ Note that the term cluster is overloaded here:
 
 This page uses “cluster” to refer to the output of a clustering algorithm.
 
-## Clustering with DBSCAN
+## Clustering with DBSCAN and Spark
 
 This page explains how to use Apache Sedona to perform density-based spatial 
clustering of applications with noise (“DBSCAN”).
 
diff --git a/docs/tutorial/concepts/distance-spark.md 
b/docs/tutorial/concepts/distance-spark.md
new file mode 100644
index 0000000000..52c52e144c
--- /dev/null
+++ b/docs/tutorial/concepts/distance-spark.md
@@ -0,0 +1,280 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+# Compute distance with Sedona and Apache Spark
+
+This post explains how to compute the distance between two points or geometric 
objects using Apache Sedona and Apache Spark.
+
+You will learn how to compute the distance on a two-dimensional Cartesian 
plane and how to calculate distance for geospatial data, taking into account 
the curvature of the Earth.
+
+Let’s start with an example on how to compute the distance between two points 
in a two-dimensional Cartesian plane.
+
+## Distance between two points with Spark and Sedona
+
+Suppose you have four points and would like to compute the distance between 
`point_a` and `point_b` and the distance between `point_c` and `point_d`.
+
+![distance between points](../../image/tutorial/concepts/distance1.png)
+
+Let’s create a DataFrame with these points.
+
+```python
+df = sedona.createDataFrame([
+    (Point(2, 3), Point(6, 4)),
+    (Point(6, 2), Point(9, 2)),
+], ["start", "end"])
+```
+
+The `start` and `end` columns both have the `geometry` type.
+
+Now use the `ST_Distance` function to compute the distance between the points.
+
+```python
+df.withColumn(
+    "distance",
+    ST_Distance(col("start"), col("end"))
+).show()
+```
+
+Here are the results:
+
+```
++-----------+-----------+-----------------+
+|      start|        end|         distance|
++-----------+-----------+-----------------+
+|POINT (2 3)|POINT (6 4)|4.123105625617661|
+|POINT (6 2)|POINT (9 2)|              3.0|
++-----------+-----------+-----------------+
+```
+
+The `ST_Distance` function makes it relatively straightforward to compute the 
distance between points on a two-dimensional plane.
+
+## Distance between two longitude/latitude points with Spark and Sedona
+
+Let’s create two longitude/latitude points and compute the distance between 
them.  Start by creating a DataFrame with the longitude and latitude values.
+
+```python
+seattle = Point(-122.335167, 47.608013)
+new_york = Point(-73.935242, 40.730610)
+sydney = Point(151.2, -33.9)
+df = sedona.createDataFrame(
+    [
+        (seattle, new_york),
+        (seattle, sydney),
+    ],
+    ["place1", "place2"],
+)
+```
+
+Let’s compute the distance between these points now:
+
+```python
+df.withColumn(
+    "st_distance_sphere",
+    ST_DistanceSphere(col("place1"), col("place2"))
+).show()
+```
+
+Here are the results:
+
+```
++--------------------+--------------------+--------------------+
+|              place1|              place2|  st_distance_sphere|
++--------------------+--------------------+--------------------+
+|POINT (-122.33516...|POINT (-73.935242...|  3870075.7867602874|
+|POINT (-122.33516...| POINT (151.2 -33.9)|1.2473172370818963E7|
++--------------------+--------------------+--------------------+
+```
+
+We use the `ST_DistanceSphere` function to calculate the distance, taking into 
account the Earth's curvature.  The function returns the distance in meters.
+
+Let’s see how to compute the distance between two points with a spheroid model 
of the Earth.
+
+## Compute distance between points with a spheroid with Spark and Sedona
+
+Let’s use the same DataFrame from the previous section, but compute the 
distance using a spheroid model of the world.
+
+```python
+res = df.withColumn(
+    "st_distance_spheroid",
+    ST_DistanceSpheroid(col("place1"), col("place2"))
+)
+res.select("place1_name", "place2_name", "st_distance_spheroid").show()
+```
+
+Here are the results:
+
+```
++-----------+-----------+--------------------+
+|place1_name|place2_name|st_distance_spheroid|
++-----------+-----------+--------------------+
+|    seattle|   new_york|  3880173.4858397646|
+|    seattle|     sydney|1.2456531875384018E7|
++-----------+-----------+--------------------+
+```
+
+The `ST_DistanceSpheroid` function returns the meters between the two 
locations.  The spheroid distance computation yields similar results to those 
obtained when you model the Earth as a sphere.  Expect the spheroid function to 
return results that are slightly more accurate.
+
+## Distance between two geometric objects with Spark and Sedona
+
+Let’s take a look at how to compute the distance between a linestring and a 
polygon.  Suppose you have the following objects:
+
+![distance between objects](../../image/tutorial/concepts/distance2.png)
+
+The distance between two polygons is the minimum Euclidean distance between 
any two points.
+
+Let’s compute the distance:
+
+```python
+res = df.withColumn(
+    "distance",
+    ST_Distance(col("geom1"), col("geom2"))
+)
+```
+
+Now, take a look at the results:
+
+```
++---+---+--------+
+|id1|id2|distance|
++---+---+--------+
+|a  |b  |2.0     |
++---+---+--------+
+```
+
+You can readily see the minimum distance between the two polygons in the graph.
+
+## Three-dimensional minimum Cartesian distance
+
+Let’s take a look at how to compute the distance between two points, factoring 
in the elevation of the points.
+
+We will examine the distance between someone standing on top of the Empire 
State Building and someone at sea level.
+
+Let’s create the DataFrame:
+
+```python
+empire_state_ground = Point(-73.9857, 40.7484, 0)
+empire_state_top = Point(-73.9857, 40.7484, 380)
+df = sedona.createDataFrame([
+    (empire_state_ground, empire_state_top),
+], ["point_a", "point_b"])
+```
+
+Now compute the distance and the 3D distance between the points:
+
+```python
+res = df.withColumn(
+    "distance",
+    ST_Distance(col("point_a"), col("point_b"))
+).withColumn(
+    "3d_distance",
+    ST_3DDistance(col("point_a"), col("point_b"))
+)
+```
+
+Take a look at the results:
+
+```
++--------------------+--------------------+--------+-----------+
+|             point_a|             point_b|distance|3d_distance|
++--------------------+--------------------+--------+-----------+
+|POINT (-73.9857 4...|POINT (-73.9857 4...|     0.0|      380.0|
++--------------------+--------------------+--------+-----------+
+```
+
+`ST_Distance` does not factor in the elevation of the point.  `ST_3DDistance` 
factors in the elevation when measuring the distance.
+
+## Compute Frechet distance with Spark and Sedona
+
+Let’s create a Sedona DataFrame with the following linestrings:
+
+![frechet distance](../../image/tutorial/concepts/distance3.png)
+
+Here’s how to create the Sedona DataFrame:
+
+```python
+a = LineString([(1, 1), (1, 3), (2, 4)])
+b = LineString([(1.1, 1), (1.1, 3), (3, 4)])
+c = LineString([(7, 1), (7, 3), (6, 4)])
+df = sedona.createDataFrame([
+    (a, "a", b, "b"),
+    (a, "a", c, "c"),
+], ["geometry1", "geometry1_id", "geometry2", "geometry2_id"])
+```
+
+Compute the Frechet distance:
+
+```python
+res = df.withColumn(
+    "frechet_distance",
+    ST_FrechetDistance(col("geometry1"), col("geometry2"))
+)
+```
+
+Now view the results:
+
+```
+res.select("geometry1_id", "geometry2_id", "frechet_distance").show()
+
++------------+------------+----------------+
+|geometry1_id|geometry2_id|frechet_distance|
++------------+------------+----------------+
+|           a|           b|             1.0|
+|           a|           c|             6.0|
++------------+------------+----------------+
+```
+
+This image visualizes the distances so you have a better intuition for the 
algorithm:
+
+![frechet distance](../../image/tutorial/concepts/distance4.png)
+
+## Compute the max distance between geometries with Spark and Sedona
+
+Suppose you have the following geometric objects:
+
+![distance geometric objects](../../image/tutorial/concepts/distance5.png)
+
+Here’s how to compute the max distance between some of these geometries.  Run 
the computations:
+
+```python
+res = df.withColumn(
+    "max_distance",
+    ST_MaxDistance(col("geom1"), col("geom2"))
+)
+```
+
+Now view the results:
+
+```
+res.select("id1", "id2", "max_distance").show(truncate=False)
+
++---+---+-----------------+
+|id1|id2|max_distance     |
++---+---+-----------------+
+|a  |b  |8.246211251235321|
+|a  |c  |7.615773105863909|
++---+---+-----------------+
+```
+
+You can easily find the maximum distance between two geometric objects.
+
+## Conclusion
+
+Sedona enables you to perform various types of distance computations.  It also 
allows you to compute distance based on different models of the Earth and more 
complex distance computations, like distance factoring in elevation.
+
+Ensure you use the distance function that best suits your analysis.
diff --git a/docs/tutorial/concepts/spatial-joins.md 
b/docs/tutorial/concepts/spatial-joins.md
index e58f30ee3e..68caec5ce3 100644
--- a/docs/tutorial/concepts/spatial-joins.md
+++ b/docs/tutorial/concepts/spatial-joins.md
@@ -17,13 +17,13 @@
  under the License.
  -->
 
-# Apache Sedona Spatial Joins
+# Apache Sedona Spatial Joins with Spark
 
 This post explains how to perform spatial joins with Apache Sedona. You will 
learn about the different types of spatial joins and how to run them 
efficiently.
 
 This page provides basic examples that clearly illustrate the key conceptual 
points of spatial joins.  It also elaborates on spatial join concepts for 
real-world-sized datasets and highlights key performance enhancements.
 
-## Spatial join within
+## Spatial join within using Spark
 
 Look at the following graph containing three points and two polygons.  
`point_b` is within `polygon_y`, `point_c` is within `polygon_x`, and `point_a` 
isn’t within any polygon.
 
@@ -102,7 +102,7 @@ Here’s the same result:
 +--------+----------+
 ```
 
-## Spatial join crosses
+## Spatial join crosses with Spark
 
 Look at the following graph containing one polygon and two lines.  `line_a` 
and `line_b` cross `polygon_x`.  `line_c` does not cross `polygon_x`.
 
@@ -132,7 +132,7 @@ Here is the result:
 
 A spatial join with `ST_Crosses` lets us identify the lines that cross the 
polygon.
 
-## Spatial join with touches
+## Spatial join with touches using Spark
 
 Suppose you have a polygon and two lines.  `line_a` does not touch the 
polygon, and `line_b` does touch the polygon.  See the following diagram:
 
@@ -188,7 +188,7 @@ Here’s the result of the join:
 
 Now, let’s look at running a join to see if points are within a polygon.
 
-## Spatial join overlaps
+## Spatial join overlaps with Spark
 
 The following diagram shows two polygons and a few shapes.  `polygon_a` 
overlaps `polygon_x`.  Neither `line_b`, `line_c`, or `point_d` overlap with 
`polygon_y` or `polygon_x`.
 
@@ -217,7 +217,7 @@ Here is the result:
 +--------+----------+
 ```
 
-## Spatial join K-nearest neighbors (KNN spatial join)
+## Spatial join K-nearest neighbors (KNN spatial join) with Spark
 
 Suppose you have tables with addresses and coffee shop locations.  You’d like 
to find the two nearest coffee shops to each address.
 
@@ -278,7 +278,7 @@ Here’s a visualization of the results:
 
 You can easily see the coffee shops that are closest to each address.
 
-## Spatial distance join
+## Spatial distance join with Spark
 
 Look at the following graph, which shows a point and different transit 
stations. Let’s perform a spatial join to find all the transit stations within 
2.5 units of the point.
 
@@ -335,7 +335,7 @@ Sedona is an excellent tool for finding locations within a 
certain distance from
 
 Sedona uses the Euclidean distance between two objects so the distance unit 
has the same CRS of the original coordinates. To directly operate on WGS84 
coordinates with meter distance, you should use `ST_DistanceSphere`, 
`ST_DistanceSpheroid`, or `ST_DWithnin(useSpheroid = true)`.
 
-## Spatial range join
+## Spatial range join with Spark
 
 All joins triggered by `ST_Intersects`, `ST_Contains`, `ST_Within`, 
`ST_DWithin`, `ST_Touches`, and `ST_Crosses` are considered a range join.  This 
section illustrates another range join, but we've already covered several range 
joins on this page.
 
@@ -393,7 +393,7 @@ Here are the results:
 
 Range joins are helpful in many practical applications.
 
-## Spatial join optimizations
+## Spatial join optimizations for Sedona and Apache Spark
 
 You can optimize spatial joins by using better file formats, indexing your 
data, or optimizing your queries.
 
diff --git a/mkdocs.yml b/mkdocs.yml
index e94215f65c..d86ed8f82d 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -71,6 +71,7 @@ nav:
           - Concepts:
               - Spatial Joins: tutorial/concepts/spatial-joins.md
               - Clustering Algorithms: 
tutorial/concepts/clustering-algorithms.md
+              - Distance: tutorial/concepts/distance-spark.md
           - Map visualization SQL app:
               - Scala/Java: tutorial/viz.md
               - Use Apache Zeppelin: tutorial/zeppelin.md

(sedona) branch master updated: [DOCS] sedona distance (#1986)

Reply via email to