petern48 commented on issue #2570:
URL: https://github.com/apache/sedona/issues/2570#issuecomment-3700841643

   Hmm, ok. So we can get the matches easily with the traditional spatial join, 
but the challenge lies in finding all of the non-matches on the left side in a 
scalable way (i.e. we can't use the same approach implemented in 
`BroadcastIndexJoinExec`). Spark traditionally will try a SortMergeJoin if the 
join key is linearly sortable, but spatial data is not, so that doesn't work 
for us. It instead falls back on a `BroadcastNestedLoopJoinExec`.
   
   I think at a theoretical level, there isn't a better way to extend support 
to outer joins aside from the workaround mentioned in the 
[PR](https://github.com/apache/sedona/pull/2561): post-filter the results after 
the spatial inner join. I think it would technicall possible to implement that 
natively in the query optimizer, but two complications come to mind:
   
   1. We can't assume an id already exists in the table (which is needed to 
determine if there are no matches across all partitions. We could technically 
first generate a 
[monotonically_increasing_id()](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.monotonically_increasing_id.html)
 column, though that would be expensive too (sorting required). Not sure how 
easy it would be to auto-detect an unique id column from code.
   
   2. This would be a logical plan rewrite based on the physical plan, since we 
don't want to modify behavior for cases small enough to use 
`BroadcastIndexJoinExec`.
   
   ```
   Generate PhysicalPlan from LogicalPlan (normal behavior)
   If JoinExec is not BroadcastIndexJoinExec,
       Then perform the above logical plan rewrite.
       Then, go back and regenerate the physical plan from the new Logical Plan
   ```
   
   Anyways, I think that implementation is more complicated than it is worth, 
for now. Didn't think about it too much when I first created the issue. Closing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to