This is an automated email from the ASF dual-hosted git repository.
jiayu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/sedona.git
The following commit(s) were added to refs/heads/master by this push:
new 16caf71556 [DOCS] Add spatial left join docs solution (#2561)
16caf71556 is described below
commit 16caf71556f5ca4fefc8a2c54b4ed20ece5da87d
Author: Alexey <[email protected]>
AuthorDate: Tue Dec 16 19:44:23 2025 +0100
[DOCS] Add spatial left join docs solution (#2561)
---
docs/api/sql/Optimizer.md | 45 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 43 insertions(+), 2 deletions(-)
diff --git a/docs/api/sql/Optimizer.md b/docs/api/sql/Optimizer.md
index aa74e2b4bc..ca14054b1e 100644
--- a/docs/api/sql/Optimizer.md
+++ b/docs/api/sql/Optimizer.md
@@ -132,7 +132,7 @@ DistanceJoin pointshape1#12: geometry, pointshape2#33:
geometry, 2.0, true
Spark SQL Example for meter-based geodesic distance `ST_DistanceSpheroid`
(works for `ST_DistanceSphere` too):
-*Less than a certain distance==*
+*==Less than a certain distance==*
```sql
SELECT *
@@ -140,7 +140,7 @@ FROM pointdf1, pointdf2
WHERE ST_DistanceSpheroid(pointdf1.pointshape1,pointdf2.pointshape2) < 2
```
-*Less than or equal to a certain distance==*
+*==Less than or equal to a certain distance==*
```sql
SELECT *
@@ -151,6 +151,47 @@ WHERE
ST_DistanceSpheroid(pointdf1.pointshape1,pointdf2.pointshape2) <= 2
!!!warning
If you use `ST_DistanceSpheroid` or `ST_DistanceSphere` as the
predicate, the unit of the distance is meter. Currently, distance join with
geodesic distance calculators work best for point data. For non-point data, it
only considers their centroids.
+## Spatial Left Join
+
+Introduction: Perform a left join using the spatial performance of a range or
distance join.
+This allows to find geometries from A and B matching the join criteria while
also keeping those entries from A that do not match any geometry in B.
+
+Range and distance joins ==do not support== a LEFT JOIN as below:
+
+```sql
+SELECT a.*, b.* FROM a
+LEFT JOIN b ON ST_INTERSECTS(a.geometry, b.geometry)
+```
+
+This will lead to a **BroadcastIndexJoin** which can become very inefficient
with two large datasets.
+Otherwise, the **BroadcastNestedLoopJoin** is triggered which is the slowest
option.
+
+In order to make use of Sedona's spatial join performance, it is possible to
produce the result of a left join by combining an INNER JOIN with a LEFT JOIN.
+
+1. With the inner join, we collect the ID from the left side and all the
required columns from the right side (consider the result as **A'**)
+2. In the second step, we combine the left side A with the result of the inner
join **A'**.
+ All the entries of A are kept as they are, while the entries of right side
B are forwarded through **A'**.
+
+```sql
+WITH inner_join AS (
+ SELECT
+ dfA.a_id
+ , dfB.b_id
+ FROM dfA, dfB
+ WHERE ST_INTERSECTS(dfA.geometry, dfB.geometry)
+)
+
+SELECT
+ dfA.*,
+ inner_join.b_id
+FROM dfA
+LEFT JOIN inner_join
+ ON dfA.a_id = inner_join.a_id;
+```
+
+!!!note
+ One can define this strategy as stored procedure or a DBT macro to
simplify the repeated code.
+
## Broadcast index join
Introduction: Perform a range join or distance join but broadcast one of the
sides of the join. This maintains the partitioning of the non-broadcast side
and doesn't require a shuffle.