Is there a way to check the status of a project? I would like to help improve and have done some things, but I need a way to see that what I'm doing is helping.
Chris Am 03.01.19, 16:10 schrieb "Allen Wittenauer" <a...@effectivemachines.com.INVALID>: > On Jan 3, 2019, at 3:11 AM, Bertrand Delacretaz <bdelacre...@apache.org> wrote: > > Hi, > > On Fri, Dec 21, 2018 at 10:53 PM Allen Wittenauer > <a...@effectivemachines.com.invalid> wrote: > >> ...Culprits: Accumulo, Reef, and Sling. > > Sling has a few hundred modules, if you have more specific info on > which are problematic please let us know so we have a better chance of > fixing that. I gave up and wrote a (relatively simple) pre-amble for our jobs to shoot any long running processes that are still hanging out in the workspace directories. Output gets logged in the console log. e.g.: ============== USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND jenkins 24952 0.0 0.0 3476248 96 ? Sl 2018 23:32 /home/jenkins/tools/java/latest1.7/bin/java -Xmx512m -Xms256m -Djava.awt.headless=true -XX:MaxPermSize=256m -Xss256k -jar /home/jenkins/jenkins-slave/workspace/jclouds-2.1.x/jclouds-labs-2.1.x/jdbc/target/surefire/surefirebooter8344429480529768484.jar /home/jenkins/jenkins-slave/workspace/jclouds-2.1.x/jclouds-labs-2.1.x/jdbc/target/surefire/surefire6624482576438364006tmp /home/jenkins/jenkins-slave/workspace/jclouds-2.1.x/jclouds-labs-2.1.x/jdbc/target/surefire/surefire_44678967186117353271tmp Killing 24952 *** USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND jenkins 53339 0.0 0.4 30068248 462472 ? Sl 2018 3:23 /usr/local/asfpackages/java/jdk1.8.0_191/jre/bin/java -jar /home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/surefire/surefirebooter4295922957398927030.jar /home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/surefire/surefire8873399700577323873tmp /home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/surefire/surefire_09146430567560271463tmp Killing 53339 *** USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND jenkins 53381 1.5 2.4 13640196 2447672 ? Sl 2018 72:48 /usr/local/asfpackages/java/jdk1.8.0_191/jre/bin/java -Xmx2048m -jar /home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/dependency/org.apache.sling.launchpad-8.jar -p 42022 -Dsling.run.modes=author,notshared Killing 53381 *** USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND jenkins 55854 105 2.4 13638076 2422584 ? Sl 2018 4967:06 /usr/local/asfpackages/java/jdk1.8.0_191/jre/bin/java -Xmx2048m -jar /home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/dependency/org.apache.sling.launchpad-8.jar -p 38732 -Dsling.run.modes=publish,notshared Killing 55854 *** =========== BTW, I hope people realize that surefire doesn’t actually report all unit test failures. It makes the assumption that a unit test will write an XML file. If the unit test gets stuck or any number of other things, it won’t get reported as a failure. It’s why maven jobs absolutely need to do a post-action to check for these things (and then kill them so they don’t hang around eating resources). Hint: running in a docker container makes the post-action required for this much more fool-proof. I’m also growing more and more suspicious of some of the tuning on the build nodes. I have a hunch that other systemd bits beyond pid limits need to get changed since it doesn’t appear that all node resources are actually available to the ‘jenkins’ user. But I can’t pin down exactly which ones they are. I do know that since adding the ‘kill processes > 24 hours’ code, my own jobs have only failed due to “can’t exec” errors only once.