> On Jan 3, 2019, at 3:11 AM, Bertrand Delacretaz <bdelacre...@apache.org>
> wrote:
>
> Hi,
>
> On Fri, Dec 21, 2018 at 10:53 PM Allen Wittenauer
> <a...@effectivemachines.com.invalid> wrote:
>
>> ...Culprits: Accumulo, Reef, and Sling.
>
> Sling has a few hundred modules, if you have more specific info on
> which are problematic please let us know so we have a better chance of
> fixing that.
I gave up and wrote a (relatively simple) pre-amble for our jobs to
shoot any long running processes that are still hanging out in the workspace
directories. Output gets logged in the console log.
e.g.:
==============
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
jenkins 24952 0.0 0.0 3476248 96 ? Sl 2018 23:32
/home/jenkins/tools/java/latest1.7/bin/java -Xmx512m -Xms256m
-Djava.awt.headless=true -XX:MaxPermSize=256m -Xss256k -jar
/home/jenkins/jenkins-slave/workspace/jclouds-2.1.x/jclouds-labs-2.1.x/jdbc/target/surefire/surefirebooter8344429480529768484.jar
/home/jenkins/jenkins-slave/workspace/jclouds-2.1.x/jclouds-labs-2.1.x/jdbc/target/surefire/surefire6624482576438364006tmp
/home/jenkins/jenkins-slave/workspace/jclouds-2.1.x/jclouds-labs-2.1.x/jdbc/target/surefire/surefire_44678967186117353271tmp
Killing 24952 ***
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
jenkins 53339 0.0 0.4 30068248 462472 ? Sl 2018 3:23
/usr/local/asfpackages/java/jdk1.8.0_191/jre/bin/java -jar
/home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/surefire/surefirebooter4295922957398927030.jar
/home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/surefire/surefire8873399700577323873tmp
/home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/surefire/surefire_09146430567560271463tmp
Killing 53339 ***
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
jenkins 53381 1.5 2.4 13640196 2447672 ? Sl 2018 72:48
/usr/local/asfpackages/java/jdk1.8.0_191/jre/bin/java -Xmx2048m -jar
/home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/dependency/org.apache.sling.launchpad-8.jar
-p 42022 -Dsling.run.modes=author,notshared
Killing 53381 ***
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
jenkins 55854 105 2.4 13638076 2422584 ? Sl 2018 4967:06
/usr/local/asfpackages/java/jdk1.8.0_191/jre/bin/java -Xmx2048m -jar
/home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/dependency/org.apache.sling.launchpad-8.jar
-p 38732 -Dsling.run.modes=publish,notshared
Killing 55854 ***
===========
BTW, I hope people realize that surefire doesn’t actually report all
unit test failures. It makes the assumption that a unit test will write an XML
file. If the unit test gets stuck or any number of other things, it won’t get
reported as a failure. It’s why maven jobs absolutely need to do a post-action
to check for these things (and then kill them so they don’t hang around eating
resources). Hint: running in a docker container makes the post-action required
for this much more fool-proof.
I’m also growing more and more suspicious of some of the tuning on the
build nodes. I have a hunch that other systemd bits beyond pid limits need to
get changed since it doesn’t appear that all node resources are actually
available to the ‘jenkins’ user. But I can’t pin down exactly which ones they
are. I do know that since adding the ‘kill processes > 24 hours’ code, my own
jobs have only failed due to “can’t exec” errors only once.