zhangfengcdt commented on code in PR #362:
URL: https://github.com/apache/sedona-db/pull/362#discussion_r2578237245
##########
rust/sedona-spatial-join/src/build_index.rs:
##########
@@ -33,7 +33,72 @@ use crate::{
spatial_predicate::SpatialPredicate,
};
-pub(crate) async fn build_index(
+/// Synchronous version of build_index that doesn't spawn tasks
+/// Used in execution contexts without async runtime support (e.g.,
Spark/Comet JNI)
+pub async fn build_index_sync(
Review Comment:
@Kontinuation Yes, I think the function name might be a bit confusing, I can
rename it to `build_index_sequential` to better convey that it avoids task
spawning.
So basically, the reason we need this (instead of using build_index
directly) is because the limitation with JoinSet::spawn() in JNI contexts: when
running in a Spark/Comet JNI context, JoinSet::spawn() fails because:
1. The tokio runtime in JNI contexts may be single-threaded or have
limited threading capabilities
2. JoinSet::spawn() requires a multi-threaded runtime to spawn new tasks
Even with single-partition DataFusion plans (which is typical for Comet),
collect_all still calls JoinSet::spawn():
```
// In collect_all:
let mut join_set = JoinSet::new();
for ... {
join_set.spawn(async move { ... }); // <- This fails in JNI contexts
}
join_set.join_all().await; // Wait for all tasks
```
The issue isn't about parallelism benefits (single partition means nothing
to parallelize anyway), but about JoinSet::spawn() itself not working in the
JNI execution environment.
When I call `build_index` directly in JNI context, it hanged forever. I
guess this might be caused by a deadlock by JoinSet::spawn().
In a JNI context:
1. When JoinSet::spawn() is called, it schedules new tasks to run on the
runtime's worker threads
2. Then join_all().await blocks the current thread waiting for those
spawned tasks
3. But if the current thread IS the only worker thread (or all workers are
blocked), no one can execute the spawned tasks
4. Deadlock: the current thread waits for tasks that need the current
thread to execute
It would be ideal to reuse build_index if there is a workaround for it, but
I also think using a clean `build_index_sequential` makes sense because this is
called in a distributed framework and the parallelism is provided and dictated
by the distributed schedulings themselves.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]