Summary:

Many toolforge jobs just now failed and/or were interrupted. Everything is now either back online or coming rapidly back online.


Explanation:

In response to a completely unrelated cloud-vps infra issue (https://phabricator.wikimedia.org/T383583) I needed to migrate many VMs to new hypervisors. I used an updated openstack command for the migration, which turned out to unexpectedly reboot the migrated VMs as part of the migration.

Among those VMs rebooted was the toolforge NFS server. This caused the inevitable storm of disconnections, failed mounts, and filesystem lock-ups. This outage was prolonged by the fact that the reboots /also/ temporarily broke puppet which would have otherwise immediately brought the NFS server back online.

After staged restarts of various services everything began to recover. As usual, we are now rebooting tools nfs workers in order to reset any locked up nfs file handles.


Sorry for the outage! Please follow up here or online if you see any continuing issues. The outage itself is documented as https://phabricator.wikimedia.org/T383625

_______________________________________________
Cloud-announce mailing list -- cloud-annou...@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/
_______________________________________________
Cloud mailing list -- cloud@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/

Reply via email to