parthchandra commented on PR #3753:
URL: 
https://github.com/apache/datafusion-comet/pull/3753#issuecomment-4119586835

   Claude analysis (worth reading):
     PR #3753: Panic Root Cause Analysis
   
     Summary
   
     The PR bumps the jni crate from 0.21.1 to 0.22.4 and adapts the code to 
the new API. The CI shows widespread failures: fast crashes on macOS (7-15 
minutes) and timeouts on Linux (6 hours, suggesting hangs/deadlocks). Miri also 
fails.
   
     The root cause is in native/jni-bridge/src/lib.rs, in the get_env() 
function.
   
     ---
     The Bug: `get_env()` Creates a Forbidden AttachGuard<'static>
   
     The PR's `get_env()` implementation:
   
   ```
     pub fn get_env() -> CometResult<AttachGuard<'static>> {
         unsafe {
             let java_vm = JAVA_VM.get_unchecked();
             let mut scope = jni::ScopeToken::default();
             let guard = java_vm
                 .attach_current_thread_guard(Default::default, &mut scope)
                 .map_err(...)?;
             Ok(std::mem::transmute::<AttachGuard<'_>, 
AttachGuard<'static>>(guard))
         }
     }
   ```
   
     jni 0.22.4's AttachGuard documentation is unambiguous:
   
     ▎ 2. An AttachGuard MUST NOT be given a 'static lifetime (e.g. by boxing 
or moving into a static variable).
   
     ▎ Beware of making it too easy to repeatedly materialize access to a 
mutable Env... If this happens the API will panic at runtime.
   
     ▎ The Drop implementation will panic if a guard is not dropped in the same 
order that it was created, relative to other guards (LIFO order).
   
     The attach_current_thread_guard docs also say:
   
     ▎ IMPORTANTLY: Never give the returned guard a 'static lifetime by 
creating a 'static ScopeToken.
   
     The PR does exactly this: creates a local ScopeToken, gets back an 
AttachGuard<'scope> tied to that local variable, then transmutes to 
AttachGuard<'static>. This is explicitly documented as forbidden.
   
     ---
     Why This Causes Panics
   
     jni 0.22.4 added runtime enforcement via two mechanisms:
   
     Mechanism 1: `Env::assert_top()`
   
     Every call to` Env::new()` runs:
     ```
   assert_eq!(
         self.level,
         JavaVM::thread_attach_guard_level(),
         "attempt to create a new local reference using an Env<'not_top> that 
is not \
          associated with the top JNI stack frame.\nThis implies that there is 
more than \
          one mutable Env in scope."
     );
   ```
   
     Every `AttachGuard::from_unowned ` call increments a thread-local 
`THREAD_GUARD_NEST_LEVEL`. The Env inside the guard stores this level at 
creation time. Any JNI call that creates a new local reference (like 
find_class, new_string, etc.) re-checks that the env's stored level still
     equals the current thread level. If another guard was created after this 
env and hasn't been dropped yet, the check fails and panics.
   
     Mechanism 2: LIFO drop assertion
   
     The Drop impl panics if guards aren't dropped in LIFO order. With 'static 
guards that can escape their creation scope, drop ordering is not guaranteed.
   
     ---
     The Specific Trigger
   
     The try_unwrap_or_throw function calls env1.with_env(f), which internally 
creates an AttachGuard at level 1 on the thread. Inside callback f (which runs 
DataFusion operators), various places call `JVMClasses::get_env()`, which 
creates ANOTHER guard at level 2.
   
     This is fine as long as the level-2 guard is dropped before the level-1 
with_env guard. But because `get_env()` returns `AttachGuard<'static>`, the 
compiler cannot enforce this. If in any code path the level-2 guard survives 
longer (e.g. across an await point, stored in an async future, or just a LIFO 
violation), the resulting panic messages are:
     - assert_top: "attempt to create a new local reference using an 
Env<'not_top>"
     - thread_guard_level_pop with level=0: "Spuriously dropped more 
AttachGuards than were known to exist"
   
     ---
     Why Miri Fails
   
     Miri flags the transmute as UB. AttachGuard<'scope> has a lifetime tied to 
the local scope: ScopeToken. Transmuting to 'static creates a value whose 
lifetime claim is longer than what it was created with. Even though ScopeToken 
is a zero-sized type (no actual memory corruption),
     Miri's strict lifetime analysis flags it.
   
     ---
     The Fix
   
     The correct approach in jni 0.22.4 is to use the closure-based APIs, which 
guarantee LIFO semantics automatically. The get_env() pattern of returning an 
AttachGuard is incompatible with jni 0.22.4's safety model.
   
     The recommended migration path (from jni 0.22.4 docs) is to use 
`attach_current_thread_with_config` or `attach_current_thread_for_scope` and 
pass `&mut Env` through the call chain, rather than having every function call 
`get_env()` independently. Alternatively, use the already-existing
     `with_env` pattern (which `try_unwrap_or_throw` correctly adopted) and 
thread `&mut Env` as a parameter to helper functions that need JNI access. 
Looking at how update_metrics(env: &mut Env, `...),` 
`convert_datatype_arrays(env, ...)`, etc. are already threaded through the call 
chain in
     this PR's other changes, extending that pattern consistently would be the 
right fix. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to