We are now operating at full capacity again. Turns out we also did not have 11 workers, but 12, so anywhere I said 33 is actually 36 :)
Some items remain TBD, but the rest is done and got us back on our feet again: On Wed, Jan 12, 2022 at 05:48:04PM +0100, Julian Andres Klode wrote: > > # Pending work > > - Move /var/snap/lxd/common out of /srv (where lxd storage pool lives); > this will likely require slightly increasing the '/' disk size. > > - Investigate further where the 30s timeout in lxd comes from and how > to prevent that (or just ignore it, but next item) 2x TBD > > - Investigate were the stuck instances came from and why they were not > cleaned up. Is it possible for us to check which instances should be > running and then remove all other ones from the workers? Right now > we just do a basic time check There were no errors logged. I saw mentions of exit code -15, but nothing concrete. But we now have new cleanup where we only keep as many containers as needed, deleting everything else older than 1 hour. > > - The node lxd-armhf10 needs to finish its redeployment once the > lxd images exist again > > - The node lxd-armhf9 needs to be redeployed to solve the disk I/O > issue > > - Both lxd-armhf10 and lxd-armhf9 will have to be re-enabled with > the new IPs in the mojo service bundle Those 3 redeployments have happened > > - We should really redeploy all the lxd workers to have clean workers > again TBD, need to figure out partitioning for /var/snap/lxd/common, but does not seem urgent right now. -- debian developer - deb.li/jak | jak-linux.org - free software dev ubuntu core developer i speak de, en -- ubuntu-devel mailing list ubuntu-devel@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel