> On Jan 27, 2020, at 6:37 PM, Andriy Redko <drr...@gmail.com> wrote:
> 
> Thanks a lot for looking into it. From the CXF perspective, I have seen that 
> many CXF builds have been aborted
> because of the connection with master is lost (don't have exact builds to 
> point since we keep only last 3),
> that could probably explain the hanging builds. 


        This is almost always because whatever is running on the two executors 
have suffocated the system resources. This ends up starving the Jenkins 
slave.jar, thus causing the disconnect.  (It's extremely important to 
understand that Jenkins' implementation here is sort of brain dead: the 
slave.jar runs as the SAME USER as the jobs being executed.  This is an idiotic 
implementation, but it is what it is.)

        Anyway, in my experiences, if all/most of one type of job are failing 
with  the node to appear to be crashed, then there is a good chance that job is 
the cause.  So it would be great if someone could spend the effort to profile 
the CXF jobs to see what their actual resource consumption is.

        FWIW, we had this problem with Hadoop, HBase, and others on the 
'Hadoop' label nodes. The answer was to:

a) always run our jobs in containers that could be easily killed (freestyle 
Jenkins jobs that do 'docker run' generally can't be killed, despite what the 
UI says, because the signal never reaches the container)
b) those containers had resource limits 
c) increase the resources that systemd is allowed to give the jenkins user

        After doing that, the number of failures on the Hadoop nodes dropped 
exponentially. 

Reply via email to