I’m now at 4 times this week where my build job has landed on a node that has broken JVM tasks hanging about from surefire tests gone awry. (Culprits: Accumulo, Reef, and Sling.) Due to the way Linux does process limits on systemd-based boxes, even though there is plenty of CPU and memory, my tasks are getting killed because all of these surefire tests have spawned enough threads that everything else fails.
Folks: please, if you aren’t running in a docker container (which makes it extremely easy to clean as well as enforce a sub-5k process limit), please add a Post Action on your Jenkins job to blow away your tasks that are still hanging around. At this point, I feel like I have no choice but to just start nuking any long running java processes (-agent/slave.jar and the datadog stuff that infra runs) before startup just so I can get a build. :(