Re: broken builds taking up resources

2020-01-28 Thread Chris Lambertus
> On Jan 28, 2020, at 8:22 PM, Allen Wittenauer > wrote: [snip] > [1] - The best on-prem solution I came up with (before I moved my $DAYJOB > stuff to cloud) was to run each executor in a VM on the box. That VM would > also have a regularly scheduled job that would cause it to wipe itsel

Re: broken builds taking up resources

2020-01-28 Thread Allen Wittenauer
> On Jan 28, 2020, at 8:02 PM, Chris Lambertus wrote: > > > Allen, can you elaborate on what a “proper” implementation is? As far as I > know, this is baked into jenkins. We could raise process limits for the > jenkins user, but these situations only tend to arise when a build has gone >

Re: broken builds taking up resources

2020-01-28 Thread Chris Lambertus
> On Jan 27, 2020, at 10:52 PM, Allen Wittenauer > wrote: > > > >> On Jan 27, 2020, at 6:37 PM, Andriy Redko wrote: >> >> Thanks a lot for looking into it. From the CXF perspective, I have seen that >> many CXF builds have been aborted >> because of the connection with master is lost (do

Re: broken builds taking up resources

2020-01-28 Thread Ismael Juma
FYI, Apache Kafka builds take 3 to 4 hours to run currently (build, unit and integration tests). Thanks, Ismael On Wed, Jan 22, 2020 at 4:55 PM Chris Lambertus wrote: > Folks, > > Over the last week or so we have received many reports of broken builds > due to nodes out of resources. As noted i

Re: broken builds taking up resources

2020-01-27 Thread Allen Wittenauer
> On Jan 27, 2020, at 10:52 PM, Allen Wittenauer > wrote: > > This is almost always because whatever is running on the two executors > have suffocated the system resources. ... and before I forget, a reminder: Java threads take up a file descriptor. Hadoop's unit tests were firing u

Re: broken builds taking up resources

2020-01-27 Thread Allen Wittenauer
> On Jan 27, 2020, at 6:37 PM, Andriy Redko wrote: > > Thanks a lot for looking into it. From the CXF perspective, I have seen that > many CXF builds have been aborted > because of the connection with master is lost (don't have exact builds to > point since we keep only last 3), > that could

Re: broken builds taking up resources

2020-01-27 Thread Andriy Redko
Hi Chris, Thanks a lot for looking into it. From the CXF perspective, I have seen that many CXF builds have been aborted because of the connection with master is lost (don't have exact builds to point since we keep only last 3), that could probably explain the hanging builds. Best Regards,

Re: broken builds taking up resources

2020-01-26 Thread Alexey Markevich
Hello, I checked PR [1] and found many build failures on H41 during 'git' command execution [2]. The last failure was OOM [3]. Finally build was successful using H29 [4]. 1. https://github.com/apache/cxf/pull/631 2. https://builds.apache.org/job/CXF-Trunk-PR/1389/console 3. https://builds.apache.

Re: broken builds taking up resources

2020-01-26 Thread Chris Lambertus
Another incident of CXF build junk sticking around on H40, although in that case, the machine appears to be hosed because of broken container jobs from 2019, with over 1100(!) processes identical to: containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/mob

Re: broken builds taking up resources

2020-01-26 Thread Chris Lambertus
Here is some data from H24, which also contains many broken CXF jobs (not Karaf) from Jan 22. The builds on H41 use karaf artifacts, but they were CXF builds, not karaf builds as previously noted. Copying dev@CXF since this build seems to be related to ongoing node problems. Additionally, ther

Re: broken builds taking up resources

2020-01-26 Thread Gavin McDonald
Hi, On Sun, Jan 26, 2020 at 8:48 AM Mike Jumper wrote: > It would be nice if Jenkins could be configured to recognize when a node is > unusable due to lack of resources and automatically take it offline. > That would be a feature request with the people that write the Jenkins software. Gav...

Re: broken builds taking up resources

2020-01-26 Thread Mike Jumper
It would be nice if Jenkins could be configured to recognize when a node is unusable due to lack of resources and automatically take it offline. - Mike On Fri, Jan 24, 2020, 13:57 Chris Thistlethwaite wrote: > Here is some data from H41, which was rebooted last night and ran out of > threads to

Re: broken builds taking up resources

2020-01-24 Thread Chris Thistlethwaite
Here is some data from H41, which was rebooted last night and ran out of threads today. https://paste.apache.org/lkmpq In this case it looks like Karaf was still stuck/broken even though there were no builds running on H41 at the time I investigated. -Chris T. #asfinfra On 1/24/20 4:26 AM,

Re: broken builds taking up resources

2020-01-24 Thread Mick Semb Wever
> Is there some way we can improve the visibility into disk usage on the > build nodes? How full they are? And what projects are taking up space? > Does jenkins provide this info? Or could infra dump a `du …` report > somewhere? There are two Jenkins plugins that help with this situation

Re: broken builds taking up resources

2020-01-23 Thread Joan Touzet
On 2020-01-23 4:50, Chesnay Schepler wrote: On 23/01/2020 10:19, Thomas Bouron wrote: On Thu, 23 Jan 2020 at 08:56, Robert Munteanu wrote: On Wed, 2020-01-22 at 17:53 -0800, Chris Lambertus wrote: Additionally, orphaned docker jobs are causing major resource contention. I will be adding a we

Re: broken builds taking up resources

2020-01-23 Thread Zoran Regvart
Hi Chris, On Thu, Jan 23, 2020 at 1:55 AM Chris Lambertus wrote: > I will be implementing a system to kill jenkins processes based on duration > of run. My initial feeling is to kill any single process which has been > running for longer than one hour real-time. Can you provide some details on

Re: broken builds taking up resources

2020-01-23 Thread Chesnay Schepler
On 23/01/2020 10:19, Thomas Bouron wrote: On Thu, 23 Jan 2020 at 08:56, Robert Munteanu wrote: On Wed, 2020-01-22 at 17:53 -0800, Chris Lambertus wrote: Additionally, orphaned docker jobs are causing major resource contention. I will be adding a weekly job to docker system prune —all && servi

Re: broken builds taking up resources

2020-01-23 Thread Thomas Bouron
On Thu, 23 Jan 2020 at 08:56, Robert Munteanu wrote: > On Wed, 2020-01-22 at 17:53 -0800, Chris Lambertus wrote: > > Additionally, orphaned docker jobs are causing major resource > > contention. I will be adding a weekly job to docker system prune —all > > && service docker restart. > > +1, it's

Re: broken builds taking up resources

2020-01-23 Thread Robert Munteanu
On Wed, 2020-01-22 at 17:53 -0800, Chris Lambertus wrote: > Additionally, orphaned docker jobs are causing major resource > contention. I will be adding a weekly job to docker system prune —all > && service docker restart. +1, it's easy to get this wrong. It would be great if you could also docume

Re: broken builds taking up resources

2020-01-22 Thread Mick Semb Wever
The Cassandra dtest builds take ~12 hours. The unit tests over an hour We are looking into parallelising these, but work hasn't started on that yet. We recently parallelised a number of the unit test builds, and added pipeline builds, and subsequently builds have been crashing with full disks. Ye

Re: broken builds taking up resources

2020-01-22 Thread Martin Stockhammer
Hi, our average build time for the main archiva build job is about 1 hour on the apache build servers. We have a timeout of 2h configured in our pipeline. So, one hour is too short for us and we would appreciate, if you consider to increase your kill timeout to some higher value. Regards M

Re: broken builds taking up resources

2020-01-22 Thread Josh Fischer
Hi, The Heron project has a build that will last for about 2 hours and 40 minutes on average. It is a single Jenkins job that spins up two different docker containers consecutively. We only run this job to generate artifacts for a release. You can see the job here: https://builds.apache.org/job

Re: broken builds taking up resources

2020-01-22 Thread Chris Lambertus
> On Jan 22, 2020, at 4:55 PM, Chris Lambertus wrote: > > Folks, > > Over the last week or so we have received many reports of broken builds due > to nodes out of resources. As noted in INFRA-19751, builds appear to fail yet > continue to run, using up all available resources on a build nod

broken builds taking up resources

2020-01-22 Thread Chris Lambertus
Folks, Over the last week or so we have received many reports of broken builds due to nodes out of resources. As noted in INFRA-19751, builds appear to fail yet continue to run, using up all available resources on a build node. I will be implementing a system to kill jenkins processes based on