On 19/04/19 20:20, Jaromil wrote: > On Fri, 19 Apr 2019, Daniel Reurich wrote: > >> Hi, >> >> ci.devuan.org - our jenkins server is currently down. This is due to a >> reboot failure after a kernel update that I installed. > > this intervention was not planned not communicated; it also was on a > old infrastructure to which we have no stable reach, because is > maintained by nextime and therefore needing extra coordination > measures to insure interventions.
Indeed it wasn't planned. There is more background information which I had omitted for expediency in order to get the message to nextime to attend to this outage, and trying to explain all the details was not the biggest priority. I was working on that server because I had discovered all source build jobs would fail consistently at between 4 and 6 seconds with a killed process. These jobs run on the master node, ie on this server as the jenkins user. I had discussed the issue with parazyd the day before, but he could offer no answers as to the consistent build failures across all the source jobs I'd tried. I had also discovered during the process that su-ing to the jenkins user also resulted in the session being closed almost immediately. In both cases there was nothing appearing in the logs to indicate OOM or other limitations were being hit. In order to rule out bugs and based on info I was gleaning from the jenkins forum, I began upgrading the OS first, and then jenkins and all the jenkins plug-ins. All these upgrades went smoothly (and solved a number of security vulnerabilities along the way. > > I am unconfortable knowing that anyone of the caretaker can act > unilaterally on such issues, raising risks of emergency interventions > which then affect everyone schedule. > You may be uncomfortable jaromil, but the fact of the matter is I needed to rebuild the debian-installer package. Incidentally the last build on the CI that had been attempted was a couple of weeks ago, the 23rd March I think. With KatolaZ gone, I'm the only other regular package builder these days. Also as far as I'm aware, I'm pretty much the only person who has been hands on with that server in any meaningful way particularly with respect to maintenance and support for it. Given that my particular domain within Devuan has been heavily oriented in the build system then I think it's reasonable that when it's broke I don't need to wait for a full committee to get an approval to fix it - particularly given it was an urgent issue and essentially all builds were broken. > we do need to coordinate on these tasks and find periods in which > everyone affected / responsible for the infrastructure bit is > available. > In the normal circumstances, yes I agree that is reasonable. This wasn't routine maintenance. This was problem solving where I'd spent many hours over 2 days working on the issue before deciding a reboot was a reasonable next move. > I went a long way yesterday urging nextime to help, he is just packing > today for a trip offline for the coming two weeks and the situation is > very uncomfortable as works were schedule and still pending also for > the DNS administration access. He will do his best today to fix that > so we can rotate the DNS on a new machine. > Thank you for this. I do appreciate your efforts and also nextimes. > after that, we should take the occasion to rebuild the CI with better > criteria, since the old setup was suboptimal. at dyne we (well, mostly > parazyd) already setup two more building farms CIs (one for DECODE and > one for maemo-leste) and have fixed a number of issues. Therefore I > kindly ask parazyd and ralph and evilham for their availability > setting up a new CI machine on the ganeti network, where parazyd can > install and plan a new jenkins instance, which I understand won't cost > him too much time since he has a well documented and replicable > procedure for that now. > I agree, and I'm happy to work with whomever is interested in getting it back up and running as soon as we can. > meanwhile we can simply consider the CI unavailable for the period of > Easter, which I hope you all manage to enjoy. we needed to fix this > bit anyway so lets be constructive and do it without letting rush take > over quality. That's a reasonable suggestion. But I also have more time flexibility over easter then in my normal week. So if there is opportunity to restore service on the original server I'd be happy to do so. But definitely don't want to continue relying on infra where we can't have full control. Regards, Daniel -- Daniel Reurich Centurion Computer Technology (2005) Ltd. 021 797 722
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Dng mailing list Dng@lists.dyne.org https://mailinglists.dyne.org/cgi-bin/mailman/listinfo/dng