I guess my only concern is the infinite loop we might get ourselves into. Both when there is an ongoing 503 outage on the metadata server, and due to some bug we might have introduced in this change.
And also the fact that we could be hammering a busy server every second (if there was no retry parameter in the 503 response), for the entire time it is returning a "I'm busy, sorry, try again later, oh, back so soon?". I'm imagining a fleet of VMs booting up and doing that to a server that is already suffering. I don't have an xkcd link, but the thought "beating a dead horse" came to mind. I do agree a 503 (or even 502, which we are not treating here AFAIK) is different from a 500, and I admit I mixed both sometimes in my reasoning. Did you have any specific reasoning to not implement an escape hatch to this infinite loop? Is it extra risk? Was it deemed unlikely, as clouds would "never" have the metadata service return 503 for too long? Or did you consider a stuck instance in a retry loop better than one that has booted, but that nobody can likely log into? I appreciate the extra logging. In fact, I thought such a logging (when the imds server returns 503) would already be very visible in the console even without this update here, so you would be able to check the console logs about why you can't ssh in, and see that the metadata server failed *once*, and cloud-init gave up. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2094858 Title: Cloud-init fails on AWS if IMDSv2 returns a 503 error. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/2094858/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs