> On Jan 3, 2019, at 3:11 AM, Bertrand Delacretaz <bdelacre...@apache.org> 
> wrote:
> 
> Hi,
> 
> On Fri, Dec 21, 2018 at 10:53 PM Allen Wittenauer
> <a...@effectivemachines.com.invalid> wrote:
> 
>> ...Culprits: Accumulo, Reef, and Sling.
> 
> Sling has a few hundred modules, if you have more specific info on
> which are problematic please let us know so we have a better chance of
> fixing that.

        I gave up and wrote a (relatively simple) pre-amble for our jobs to 
shoot any long running processes that are still hanging out in the workspace 
directories. Output gets logged in the console log.

e.g.:

============== 

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND

jenkins  24952  0.0  0.0 3476248   96 ?        Sl    2018  23:32 
/home/jenkins/tools/java/latest1.7/bin/java -Xmx512m -Xms256m 
-Djava.awt.headless=true -XX:MaxPermSize=256m -Xss256k -jar 
/home/jenkins/jenkins-slave/workspace/jclouds-2.1.x/jclouds-labs-2.1.x/jdbc/target/surefire/surefirebooter8344429480529768484.jar
 
/home/jenkins/jenkins-slave/workspace/jclouds-2.1.x/jclouds-labs-2.1.x/jdbc/target/surefire/surefire6624482576438364006tmp
 
/home/jenkins/jenkins-slave/workspace/jclouds-2.1.x/jclouds-labs-2.1.x/jdbc/target/surefire/surefire_44678967186117353271tmp

Killing 24952 ***

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
jenkins  53339  0.0  0.4 30068248 462472 ?     Sl    2018   3:23 
/usr/local/asfpackages/java/jdk1.8.0_191/jre/bin/java -jar 
/home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/surefire/surefirebooter4295922957398927030.jar
 
/home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/surefire/surefire8873399700577323873tmp
 
/home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/surefire/surefire_09146430567560271463tmp

Killing 53339 ***

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
jenkins  53381  1.5  2.4 13640196 2447672 ?    Sl    2018  72:48 
/usr/local/asfpackages/java/jdk1.8.0_191/jre/bin/java -Xmx2048m -jar 
/home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/dependency/org.apache.sling.launchpad-8.jar
 -p 42022 -Dsling.run.modes=author,notshared

Killing 53381 ***

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
jenkins  55854  105  2.4 13638076 2422584 ?    Sl    2018 4967:06 
/usr/local/asfpackages/java/jdk1.8.0_191/jre/bin/java -Xmx2048m -jar 
/home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/dependency/org.apache.sling.launchpad-8.jar
 -p 38732 -Dsling.run.modes=publish,notshared

Killing 55854 ***

===========

        BTW, I hope people realize that surefire doesn’t actually report all 
unit test failures.  It makes the assumption that a unit test will write an XML 
file.  If the unit test gets stuck or any number of other things, it won’t get 
reported as a failure.  It’s why maven jobs absolutely need to do a post-action 
to check for these things (and then kill them so they don’t hang around eating 
resources).  Hint: running in a docker container makes the post-action required 
for this much more fool-proof.

        I’m also growing more and more suspicious of some of the tuning on the 
build nodes.  I have a hunch that other systemd bits beyond pid limits need to 
get changed since it doesn’t appear that all node resources are actually 
available to the ‘jenkins’ user.  But I can’t pin down exactly which ones they 
are.  I do know that since adding the ‘kill processes > 24 hours’ code, my own 
jobs have only failed due to “can’t exec” errors only once.

Reply via email to