parthchandra commented on PR #3753:
URL:
https://github.com/apache/datafusion-comet/pull/3753#issuecomment-4119586835
Claude analysis (worth reading):
PR #3753: Panic Root Cause Analysis
Summary
The PR bumps the jni crate from 0.21.1 to 0.22.4 and adapts the code to
the new API. The CI shows widespread failures: fast crashes on macOS (7-15
minutes) and timeouts on Linux (6 hours, suggesting hangs/deadlocks). Miri also
fails.
The root cause is in native/jni-bridge/src/lib.rs, in the get_env()
function.
---
The Bug: `get_env()` Creates a Forbidden AttachGuard<'static>
The PR's `get_env()` implementation:
```
pub fn get_env() -> CometResult<AttachGuard<'static>> {
unsafe {
let java_vm = JAVA_VM.get_unchecked();
let mut scope = jni::ScopeToken::default();
let guard = java_vm
.attach_current_thread_guard(Default::default, &mut scope)
.map_err(...)?;
Ok(std::mem::transmute::<AttachGuard<'_>,
AttachGuard<'static>>(guard))
}
}
```
jni 0.22.4's AttachGuard documentation is unambiguous:
▎ 2. An AttachGuard MUST NOT be given a 'static lifetime (e.g. by boxing
or moving into a static variable).
▎ Beware of making it too easy to repeatedly materialize access to a
mutable Env... If this happens the API will panic at runtime.
▎ The Drop implementation will panic if a guard is not dropped in the same
order that it was created, relative to other guards (LIFO order).
The attach_current_thread_guard docs also say:
▎ IMPORTANTLY: Never give the returned guard a 'static lifetime by
creating a 'static ScopeToken.
The PR does exactly this: creates a local ScopeToken, gets back an
AttachGuard<'scope> tied to that local variable, then transmutes to
AttachGuard<'static>. This is explicitly documented as forbidden.
---
Why This Causes Panics
jni 0.22.4 added runtime enforcement via two mechanisms:
Mechanism 1: `Env::assert_top()`
Every call to` Env::new()` runs:
```
assert_eq!(
self.level,
JavaVM::thread_attach_guard_level(),
"attempt to create a new local reference using an Env<'not_top> that
is not \
associated with the top JNI stack frame.\nThis implies that there is
more than \
one mutable Env in scope."
);
```
Every `AttachGuard::from_unowned ` call increments a thread-local
`THREAD_GUARD_NEST_LEVEL`. The Env inside the guard stores this level at
creation time. Any JNI call that creates a new local reference (like
find_class, new_string, etc.) re-checks that the env's stored level still
equals the current thread level. If another guard was created after this
env and hasn't been dropped yet, the check fails and panics.
Mechanism 2: LIFO drop assertion
The Drop impl panics if guards aren't dropped in LIFO order. With 'static
guards that can escape their creation scope, drop ordering is not guaranteed.
---
The Specific Trigger
The try_unwrap_or_throw function calls env1.with_env(f), which internally
creates an AttachGuard at level 1 on the thread. Inside callback f (which runs
DataFusion operators), various places call `JVMClasses::get_env()`, which
creates ANOTHER guard at level 2.
This is fine as long as the level-2 guard is dropped before the level-1
with_env guard. But because `get_env()` returns `AttachGuard<'static>`, the
compiler cannot enforce this. If in any code path the level-2 guard survives
longer (e.g. across an await point, stored in an async future, or just a LIFO
violation), the resulting panic messages are:
- assert_top: "attempt to create a new local reference using an
Env<'not_top>"
- thread_guard_level_pop with level=0: "Spuriously dropped more
AttachGuards than were known to exist"
---
Why Miri Fails
Miri flags the transmute as UB. AttachGuard<'scope> has a lifetime tied to
the local scope: ScopeToken. Transmuting to 'static creates a value whose
lifetime claim is longer than what it was created with. Even though ScopeToken
is a zero-sized type (no actual memory corruption),
Miri's strict lifetime analysis flags it.
---
The Fix
The correct approach in jni 0.22.4 is to use the closure-based APIs, which
guarantee LIFO semantics automatically. The get_env() pattern of returning an
AttachGuard is incompatible with jni 0.22.4's safety model.
The recommended migration path (from jni 0.22.4 docs) is to use
`attach_current_thread_with_config` or `attach_current_thread_for_scope` and
pass `&mut Env` through the call chain, rather than having every function call
`get_env()` independently. Alternatively, use the already-existing
`with_env` pattern (which `try_unwrap_or_throw` correctly adopted) and
thread `&mut Env` as a parameter to helper functions that need JNI access.
Looking at how update_metrics(env: &mut Env, `...),`
`convert_datatype_arrays(env, ...)`, etc. are already threaded through the call
chain in
this PR's other changes, extending that pattern consistently would be the
right fix.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]