andygrove commented on issue #4327:
URL:
https://github.com/apache/datafusion-comet/issues/4327#issuecomment-4479558592
## Root cause
The failing assertion is always the same in every recent failing run of
`spark-sql-auto-sql_core-1/spark-4.0.2-jdk21`:
```
[info] ParquetFileFormatV{1,2}Suite (or OrcSourceV{1,2}Suite) *** ABORTED ***
[info] "There are 1 possibly leaked file streams."
(SharedSparkSession.scala:189)
```
This is Spark's `DebugFilesystem.assertNoOpenStreams()` in
`SharedSparkSession.afterEach`, retried via `eventually` for ~10s.
### Why only Spark 4.0.2 / JDK 21
These V1/V2 Parquet and ORC source suites are flaky on JDK 21 even in
apache/spark's own CI. Spark's workaround is `DEDICATED_JVM_SBT_TESTS`, which
forks a separate JVM per listed suite. Comet's workflow
(`.github/workflows/spark_sql_test.yml:199`) sets this var, but only for
`spark-short == '4.0'`.
### Why the workaround silently fails
The workflow also unconditionally sets `SERIAL_SBT_TESTS: \"1\"` at line 191
(added in #4285 to fix OOM on standard runners). In Spark 4.0.2's
`project/SparkBuild.scala`:
```scala
if (!sys.env.contains(\"SERIAL_SBT_TESTS\")) {
allProjects.foreach(enable(SparkParallelTestGrouping.settings))
}
```
When `SERIAL_SBT_TESTS` is set, `SparkParallelTestGrouping.settings` is
never installed, and that block is the only consumer of
`DEDICATED_JVM_SBT_TESTS`. The env var is read into a set that is never used;
SBT falls back to a single forked JVM and runs every suite there.
Evidence from a failing log (run
[26004020697](https://github.com/apache/datafusion-comet/actions/runs/26004020697/job/76441382377)):
- thread-leak warning mentions
`readingParquetFooters-ForkJoinPool-12260-worker-1` (high pool counter = many
prior suites in this JVM)
- unrelated suites (`AnalysisConfOverrideSuite`,
`TPCDSModifiedPlanStabilityWithStatsSuite`, state-store suites) appear in the
same JVM right before these Parquet/ORC suites
- no SBT fork-restart markers between unrelated suites
So all ~11,600 tests share one JVM. By the time these flaky suites run,
accumulated state is enough to make the leak-detection eventually-loop time out.
### Options
The two env vars are mutually exclusive by Spark's design. Either:
1. Drop `SERIAL_SBT_TESTS=1` for the 4.0.2/JDK21 row so the dedicated-JVM
workaround actually fires (likely requires raising the heap to avoid OOM
regression).
2. Move the four flaky suites into their own CI step/job for the 4.0.2/JDK21
matrix entry, achieving isolation outside SBT.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]