Re: Jenkins Slave Workspace Retention

Allen Wittenauer Mon, 23 Jul 2018 16:09:02 -0700


> On Jul 23, 2018, at 3:04 PM, Joan Touzet <woh...@apache.org> wrote:
> 
> 
> This is why we switched to Docker for ASF Jenkins CI. By pre-building our
> Docker container images for CI, we take control over the build environment
> in a very proactive way, reducing Infra's investment to just keeping the
> build nodes up, running, and with sufficient disk space.


        All of the projects I’ve been involved with have been using 
Docker-based builds for a few years now.  Experience there has shown that to 
ease with debugging (esp since the Jenkins machines are so finicky) that 
information from inside the container needs to be available after the container 
exits.  As a result, Apache Yetus (which is used to control the majority of 
builds for projects like Hadoop and HBase) will specifically mount key 
directories from the workspace inside the container so that they are readable 
after the build finishes.  Otherwise one spends a significant amount of time 
doing a lot of head scratching as to why stuff failed on the Jenkins build 
servers but not locally.

        It’s also worth pointing out that “just use Docker” only works if one 
is building on Linux.  That isn’t an option on Windows.  This is why a ‘one 
size fits all’ policy for all jobs isn’t really going to work.  Performance on 
the Windows machines is pretty awful (I’m fairly certain it’s IO), so any time 
savings there is huge. (For comparison, the last time I looked a Hadoop Linux 
full build + full analysis: 12 hours, Windows full build + partial analysis: 19 
hours… 7 hours difference with stuff turned off!)

> It also means that, once a build is done, there is no mess on the Jenkins
> build node to clean up - just a regular `docker rm` or `docker rmi` is
> sufficient to restore disk space. Infra is already running these aggressively,
> since if a build hangs due to an unresponsive docker daemon or network
> failure, our post-run script to clean up after ourselves may never run.


        Apache Yetus pretty much manages the docker repos for the ‘Hadoop’ 
queue machines since it runs so frequently.  It happily deletes stale images 
after a time as well as killing any stuck containers that are still running 
after a shorter period of time.  This way ‘docker build’ commands can benefit 
from cache re-use but still get forced to do full rebuilds after a time.  I 
enabled the docker-cleanup functionality as part of the precommit-admin job in 
January as well, so it’s been working alongside whatever extra docker bits the 
INFRA team has been using on the non-Hadoop nodes.  

> We don't put everything into saved artefacts either, but we have built a
> simple Apache CouchDB-based database to which we upload any artefacts we
> want to save for development purposes only.

        … and where does this DB run? Also, it’s not so much about the finished 
artifacts as much as it is about the state of the workspace post-build. If no 
jars get built, then we want to know what happened.

> We had this issue too - which is why we build under a `/tmp` directory
> inside the Docker container to avoid one build trashing another build's
> workspace directory via the multi-node sync mechanism.

        Apache Yetus based builds mount a dir inside the container.  It’s 
relatively expensive to rebuild the repo for large projects. For Hadoop, this 
takes in the 5-10 minute area.  That may not seem like a lot. But given the 
number of build jobs per day, that adds up very quickly.  The quicker the big 
jobs run, the more cycles available for everyone and the faster contributors 
get feedback on their patches.  [Ofc,

Re: Jenkins Slave Workspace Retention

Reply via email to