Re: [I] Frequent CI failures for Spark 4.0.2 / JDK 21 [datafusion-comet]

via GitHub Mon, 18 May 2026 09:10:20 -0700


andygrove commented on issue #4327:
URL: 
https://github.com/apache/datafusion-comet/issues/4327#issuecomment-4479558592


   ## Root cause
   
   The failing assertion is always the same in every recent failing run of 
`spark-sql-auto-sql_core-1/spark-4.0.2-jdk21`:
   
   ```
   [info] ParquetFileFormatV{1,2}Suite (or OrcSourceV{1,2}Suite) *** ABORTED ***
   [info]   "There are 1 possibly leaked file streams." 
(SharedSparkSession.scala:189)
   ```
   
   This is Spark's `DebugFilesystem.assertNoOpenStreams()` in 
`SharedSparkSession.afterEach`, retried via `eventually` for ~10s.
   
   ### Why only Spark 4.0.2 / JDK 21
   
   These V1/V2 Parquet and ORC source suites are flaky on JDK 21 even in 
apache/spark's own CI. Spark's workaround is `DEDICATED_JVM_SBT_TESTS`, which 
forks a separate JVM per listed suite. Comet's workflow 
(`.github/workflows/spark_sql_test.yml:199`) sets this var, but only for 
`spark-short == '4.0'`.
   
   ### Why the workaround silently fails
   
   The workflow also unconditionally sets `SERIAL_SBT_TESTS: \"1\"` at line 191 
(added in #4285 to fix OOM on standard runners). In Spark 4.0.2's 
`project/SparkBuild.scala`:
   
   ```scala
   if (!sys.env.contains(\"SERIAL_SBT_TESTS\")) {
     allProjects.foreach(enable(SparkParallelTestGrouping.settings))
   }
   ```
   
   When `SERIAL_SBT_TESTS` is set, `SparkParallelTestGrouping.settings` is 
never installed, and that block is the only consumer of 
`DEDICATED_JVM_SBT_TESTS`. The env var is read into a set that is never used; 
SBT falls back to a single forked JVM and runs every suite there.
   
   Evidence from a failing log (run 
[26004020697](https://github.com/apache/datafusion-comet/actions/runs/26004020697/job/76441382377)):
   - thread-leak warning mentions 
`readingParquetFooters-ForkJoinPool-12260-worker-1` (high pool counter = many 
prior suites in this JVM)
   - unrelated suites (`AnalysisConfOverrideSuite`, 
`TPCDSModifiedPlanStabilityWithStatsSuite`, state-store suites) appear in the 
same JVM right before these Parquet/ORC suites
   - no SBT fork-restart markers between unrelated suites
   
   So all ~11,600 tests share one JVM. By the time these flaky suites run, 
accumulated state is enough to make the leak-detection eventually-loop time out.
   
   ### Options
   
   The two env vars are mutually exclusive by Spark's design. Either:
   
   1. Drop `SERIAL_SBT_TESTS=1` for the 4.0.2/JDK21 row so the dedicated-JVM 
workaround actually fires (likely requires raising the heap to avoid OOM 
regression).
   2. Move the four flaky suites into their own CI step/job for the 4.0.2/JDK21 
matrix entry, achieving isolation outside SBT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Frequent CI failures for Spark 4.0.2 / JDK 21 [datafusion-comet]

Reply via email to