This is an automated email from the ASF dual-hosted git repository.
mbutrovich pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git
The following commit(s) were added to refs/heads/main by this push:
new 6aa577b34 docs: Add docs about global singletons to development guide
(#3809)
6aa577b34 is described below
commit 6aa577b34ae066452cebc14382c8aa6e8bb332b7
Author: Matt Butrovich <[email protected]>
AuthorDate: Fri Mar 27 16:09:39 2026 -0400
docs: Add docs about global singletons to development guide (#3809)
---
docs/source/contributor-guide/development.md | 68 ++++++++++++++++++++++++++++
1 file changed, 68 insertions(+)
diff --git a/docs/source/contributor-guide/development.md
b/docs/source/contributor-guide/development.md
index b83f3174d..47f034c97 100644
--- a/docs/source/contributor-guide/development.md
+++ b/docs/source/contributor-guide/development.md
@@ -101,6 +101,74 @@ The runtime is created once per executor JVM in a
`Lazy<Runtime>` static:
| Storing `JNIEnv` in an operator | **No** | `JNIEnv` is
thread-specific |
| Capturing state at plan creation time | Yes | Runs on executor
thread, store in struct |
+## Global singletons
+
+Comet code runs in both the driver and executor JVM processes, and different
parts of the
+codebase run in each. Global singletons have **process lifetime** — they are
created once and
+never dropped until the JVM exits. Since multiple Spark jobs, queries, and
tasks share the same
+process, this makes it difficult to reason about what state a singleton holds
and whether it is
+still valid.
+
+### How to recognize them
+
+**Rust:** `static` variables using `OnceLock`, `LazyLock`, `OnceCell`, `Lazy`,
or `lazy_static!`:
+
+```rust
+static TOKIO_RUNTIME: OnceLock<Runtime> = OnceLock::new();
+static TASK_SHARED_MEMORY_POOLS: Lazy<Mutex<HashMap<i64, PerTaskMemoryPool>>>
= Lazy::new(..);
+```
+
+**Java:** `static` fields, especially mutable collections:
+
+```java
+private static final HashMap<Long, HashMap<Long, ScalarSubquery>> subqueryMap
= new HashMap<>();
+```
+
+**Scala:** `object` declarations (companion objects are JVM singletons)
holding mutable state:
+
+```scala
+object MyCache {
+ private val cache = new ConcurrentHashMap[String, Value]()
+}
+```
+
+### Why they are dangerous
+
+- **Credential staleness.** A singleton caching an authenticated client will
hold stale
+ credentials after token rotation, causing silent failures mid-job.
+- **Unbounded growth.** A cache keyed by file path or configuration grows with
every query
+ but never shrinks. Over hours of process uptime this becomes a memory leak.
+- **Cross-job contamination.** Different Spark jobs on the same process may
use different
+ configurations. A singleton initialized by the first job silently serves
wrong state to
+ subsequent jobs.
+- **Testing difficulty.** Global state persists across test cases, making tests
+ order-dependent.
+
+### When a singleton is acceptable
+
+Some state genuinely has process lifetime:
+
+| Singleton | Why it is safe
|
+| --------------------------------------------- |
--------------------------------------------------- |
+| `TOKIO_RUNTIME` | One runtime per executor, no
configuration variance |
+| `JAVA_VM` / `JVM_CLASSES` | One JVM per process, set
once at JNI load |
+| `OperatorRegistry` / `ExpressionRegistry` | Immutable after
initialization |
+| Compiled `Regex` patterns (`LazyLock<Regex>`) | Stateless and immutable
|
+
+### When to avoid a singleton
+
+If any of these apply, do **not** use a global singleton:
+
+- The state depends on configuration that can vary between jobs or queries
+- The state holds credentials or authenticated connections that will not
expire or invalidate appropriately
+- The state grows proportionally to the number of queries or files processed
+- The state needs cleanup or refresh during process lifetime
+
+Instead, scope state to the plan or task by adding the cache as a field in an
existing session or context object.
+
+If a singleton is truly needed, add a comment explaining why `static` is the
right lifetime,
+whether the cache is bounded, and how credential refresh is handled (if
applicable).
+
## Development Setup
1. Make sure `JAVA_HOME` is set and point to JDK using [support
matrix](../user-guide/latest/installation.md)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]