[Cloud] [Cloud-announce] toolforge outage just now

Andrew Bogott Mon, 13 Jan 2025 13:05:54 -0800

Summary:

Many toolforge jobs just now failed and/or were interrupted. Everythingis now either back online or coming rapidly back online.



Explanation:

In response to a completely unrelated cloud-vps infra issue(https://phabricator.wikimedia.org/T383583) I needed to migrate many VMsto new hypervisors. I used an updated openstack command for themigration, which turned out to unexpectedly reboot the migrated VMs aspart of the migration.

Among those VMs rebooted was the toolforge NFS server. This caused theinevitable storm of disconnections, failed mounts, and filesystemlock-ups. This outage was prolonged by the fact that the reboots /also/temporarily broke puppet which would have otherwise immediately broughtthe NFS server back online.

After staged restarts of various services everything began to recover.As usual, we are now rebooting tools nfs workers in order to reset anylocked up nfs file handles.

Sorry for the outage! Please follow up here or online if you see anycontinuing issues. The outage itself is documented ashttps://phabricator.wikimedia.org/T383625


_______________________________________________
Cloud-announce mailing list -- [email protected]
List information: 
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/
_______________________________________________
Cloud mailing list -- [email protected]
List information: 
https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/

[Cloud] [Cloud-announce] toolforge outage just now

Reply via email to