I guess my only concern is the infinite loop we might get ourselves
into. Both when there is an ongoing 503 outage on the metadata server,
and due to some bug we might have introduced in this change.

And also the fact that we could be hammering a busy server every second
(if there was no retry parameter in the 503 response), for the entire
time it is returning a "I'm busy, sorry, try again later, oh, back so
soon?". I'm imagining a fleet of VMs booting up and doing that to a
server that is already suffering. I don't have an xkcd link, but the
thought "beating a dead horse" came to mind.

I do agree a 503 (or even 502, which we are not treating here AFAIK) is
different from a 500, and I admit I mixed both sometimes in my
reasoning.

Did you have any specific reasoning to not implement an escape hatch to
this infinite loop? Is it extra risk? Was it deemed unlikely, as clouds
would "never" have the metadata service return 503 for too long? Or did
you consider a stuck instance in a retry loop better than one that has
booted, but that nobody can likely log into?

I appreciate the extra logging. In fact, I thought such a logging (when
the imds server returns 503) would already be very visible in the
console even without this update here, so you would be able to check the
console logs about why you can't ssh in, and see that the metadata
server failed *once*, and cloud-init gave up.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2094858

Title:
  Cloud-init fails on AWS if IMDSv2 returns a 503 error.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/2094858/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to