> On Jan 27, 2020, at 6:37 PM, Andriy Redko <drr...@gmail.com> wrote:
>
> Thanks a lot for looking into it. From the CXF perspective, I have seen that
> many CXF builds have been aborted
> because of the connection with master is lost (don't have exact builds to
> point since we keep only last 3),
> that could probably explain the hanging builds.
This is almost always because whatever is running on the two executors
have suffocated the system resources. This ends up starving the Jenkins
slave.jar, thus causing the disconnect. (It's extremely important to
understand that Jenkins' implementation here is sort of brain dead: the
slave.jar runs as the SAME USER as the jobs being executed. This is an idiotic
implementation, but it is what it is.)
Anyway, in my experiences, if all/most of one type of job are failing
with the node to appear to be crashed, then there is a good chance that job is
the cause. So it would be great if someone could spend the effort to profile
the CXF jobs to see what their actual resource consumption is.
FWIW, we had this problem with Hadoop, HBase, and others on the
'Hadoop' label nodes. The answer was to:
a) always run our jobs in containers that could be easily killed (freestyle
Jenkins jobs that do 'docker run' generally can't be killed, despite what the
UI says, because the signal never reaches the container)
b) those containers had resource limits
c) increase the resources that systemd is allowed to give the jenkins user
After doing that, the number of failures on the Hadoop nodes dropped
exponentially.