I have evacuated almost all the VMs on cloudvirt1024 and cloudvirt1018.
In theory these are all now up and running on different hardware.
The list of affected VMs can be found below in the context for this
message. Instances previously hosted on cloudvirt1024 should be up and
running and largely unaffected by the move.
Nearly every instance that was on cloudvirt1018 has suffered some degree
of disk corruption. For the most part I've repaired them enough to
allow logins, but I recommend that you check them extensively before
relying on data integrity there. In some cases you may find part or all
of your misplaced files in /lost+found.
Two VMs are still mid-copy, due to being enormous... I'm going to leave
them to continue the evacuation overnight. They are
'mwoffliner5.mwoffliner.eqiad.wmflabs' and
'pub2.wikiapiary.eqiad.wmflabs'. They may come up during the night but
I recommend against logging into them or restarting them until I've had
a chance to run disk repair on them in the morning.
If you have specific issues with VMs from cloudvirt1018, feel free to
seek help or advice in #wikimedia-cloud on IRC -- I expect Arturo will
appear there in a few hours.
-Andrew
On 2/13/19 1:50 PM, Andrew Bogott wrote:
Now cloudvirt1024 is dying in earnest, so VMs hosted there will be
down for a while as well. This is, as far as anyone can tell, just a
stupid coincidence.
So far it appears that we are going to be able to rescue /most/ things
without significant data loss. For now, though, there's going to be
plenty more downtime.
VMs on cloudvirt1024 are:
| 8113d2c5-6788-43f6-beeb-123b0b717af3 | drmf-beta
| math
| 169b3260-4f7e-43dc-94c2-e699308a3426 | ecmabot
| webperf
| 29e875e3-15d5-4f74-9716-c0025c2ea098 | encoding02
| video
| 1b2b8b50-d463-4b7f-a3a9-6363eeb3ca8b | encoding03
| video
| 5421f938-7a11-499c-bc6a-534da1f4e27d | hafnium
| rcm
| 041d42b9-df36-4176-9f5d-a508989bbebc | hound-app-01
| hound
| 6149375b-8a08-4f03-882a-6fc0f5f77499 | integration-slave-docker-1044
| integration
| 4d64b032-d93a-4a8c-a7e5-569c17e5063f | integration-slave-docker-1046
| integration
| ad48959a-9eb9-46a9-bec4-a2bf23cdf655 | integration-slave-docker-1047
| integration
| 21644632-0972-448f-83d0-b76f9d1d28e0 | ldfclient-new
| wikidata-query
| c2a30fe0-2c87-4b01-be53-8e2a3d0f40a7 | math-docker
| math
| df8f17fb-03fe-4725-b9cf-3d9fe76f4654 | mediawiki2latex
| collection-alt-renderer
| d73f36e6-7534-4910-9a6e-64a6b9088d1e | neon
| rcm
| 2d035965-ba53-41b3-b6ef-d2ebbe50656a | novaadminmadethis
| quotatest
| c84f61c0-4fd2-47a5-b6ab-dd6b5ea98d41 | ores-puppetmaster-01
| ores
| 585bb328-8078-4437-b076-9e555683e27d | ores-sentinel-01
| ores
| 0538bfed-d7b5-4751-9431-8feecbaf78c0 | oxygen
| rcm
| e8090d9e-7529-46a9-b1e1-c4ba523a2898 | packaging
| thumbor
| c7fe4663-7f2b-4d23-a79b-1a2e01c80d93 | twlight-prod
| twl
| 2370b38f-7a65-4ccf-a635-7a2fa5e12b3e | twlight-staging
| twl
| 464577c6-86f0-42f9-9c49-86f9ec9a0210 | twlight-tracker
| twl
| 5325322d-a57e-4a9b-85b7-37643f03bfea | wikidata-misc
| wikidata-dev
On 2/13/19 11:23 AM, Andrew Bogott wrote:
Here's the latest:
cloudvirt1018 is up and running, and many of its VMs are fine. Many
other VMs are corrupted and won't start up. Some of those VMs will
probably be lost for good, but we're still investigating rescue options.
In the meantime, if your VM is up and you can access it then you're
in luck! If not, stay tuned.
-Andrew
On 2/13/19 9:15 AM, Andrew Bogott wrote:
I spoke too soon -- we're still working on this. Most of these VMs
will remain down in the meantime.
Sorry for the outage!
On 2/13/19 8:21 AM, Andrew Bogott wrote:
We don't fully understand what happened, but after Giovanni
performed a classic "turning it off and on again" things are now
running without warnings. The VMs listed below are now coming back
online and everything should be back up shortly.
We'll probably replace some of this hardware anyway, out of an
abundance of caution, but that's unlikely to produce further
downtime. With luck, this is the last you'll hear about this.
-Andrew
On 2/13/19 7:25 AM, Andrew Bogott wrote:
We're currently experiencing a mysterious hareware failure in our
datacenter -- three different SSDs failed overnight, two of them
in cloudvirt1018 and one of them in cloudvirt1024. The VMs on
1018 are down entirely. We may move those on 1024 to another host
shortly in order to guard against additional drive failure.
There's some possibility that we will experience permanent data
loss on cloudvirt1018, but everyone is working hard to avoid this.
The following VMs are on cloudvirt1018:
a11y | reading-web-staging
abogott-scapserver | testlabs
af-puppetdb01 | automation-framework
api | openocr
asdf | quotatest
bastion-eqiad1-02 | bastion
clm-test-01 | community-labs-monitoring
compiler1002 | puppet-diffs
cyberbot-exec-iabot-01 | cyberbot
deployment-db03 | deployment-prep
deployment-db04 | deployment-prep
deployment-memc05 | deployment-prep
deployment-pdfrender02 | deployment-prep
deployment-sca01 | deployment-prep
design-lsg3 | design
eventmetrics-dev01 | eventmetrics
fridolin | catgraph
gtirloni-puppetmaster-01 | testlabs
hadoop-master-3 | analytics
ign | ign2commons
integration-castor03 | integration
integration-slave-docker-1017 | integration
integration-slave-docker-1033 | integration
integration-slave-docker-1038 | integration
integration-slave-jessie-1003 | integration
integration-slave-jessie-android | integration
k8s-master-01 | general-k8s
k8s-node-03 | general-k8s
k8s-node-05 | general-k8s
k8s-node-06 | general-k8s
kdc | analytics
labstash-jessie1 | logging
language-mleb-legacy | language
login-test | catgraph
lsg-01 | design
mathosphere | math
mc-clusterA-1 | test-twemproxy
mwoffliner5 | mwoffliner
novaadminmadethis-4 | quotatest
ntp-01 | cloudinfra
ntp-02 | cloudinfra
ogvjs-testing | ogvjs-integration
phragile-pro | phragile
planet-hotdog | planet
pub2 | wikiapiary
puppenmeister | planet
puppet-compiler-v4-other | testlabs
puppet-compiler-v4-tools | testlabs
quarry-beta-01 | quarry
signwriting-swis | signwriting
signwriting-swserver | signwriting
social-tools3 | social-tools
striker-deploy04 | striker
striker-puppet01 | striker
t166878 | otrs
togetherjs | visualeditor
tools-sgebastion-06 | tools
tools-sgeexec-0902 | tools
tools-sgeexec-0903 | tools
tools-sgewebgrid-generic-0901 | tools
tools-sgewebgrid-lighttpd-0901 | tools
ve-font | design
wikibase1 | sciencesource
wikicitevis-prod | wikicitevis
wikifarm | pluggableauth
women-in-red | globaleducation
_______________________________________________
Wikimedia Cloud Services announce mailing list
cloud-annou...@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud