Summary:
Many toolforge jobs just now failed and/or were interrupted. Everything
is now either back online or coming rapidly back online.
Explanation:
In response to a completely unrelated cloud-vps infra issue
(https://phabricator.wikimedia.org/T383583) I needed to migrate many VMs
to new hypervisors. I used an updated openstack command for the
migration, which turned out to unexpectedly reboot the migrated VMs as
part of the migration.
Among those VMs rebooted was the toolforge NFS server. This caused the
inevitable storm of disconnections, failed mounts, and filesystem
lock-ups. This outage was prolonged by the fact that the reboots /also/
temporarily broke puppet which would have otherwise immediately brought
the NFS server back online.
After staged restarts of various services everything began to recover.
As usual, we are now rebooting tools nfs workers in order to reset any
locked up nfs file handles.
Sorry for the outage! Please follow up here or online if you see any
continuing issues. The outage itself is documented as
https://phabricator.wikimedia.org/T383625
_______________________________________________
Cloud-announce mailing list -- cloud-annou...@lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/
_______________________________________________
Cloud mailing list -- cloud@lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/