This is a very well written up SRU. Thank you! I have three review
points:

1. Since this involves impact to a specific Internet service, please
document sign-off from the operators of api.snapcraft.io that they are
happy with this change. I guess it may triple service load but
specifically at the times when the service is already struggling under
load? Plucky isn't released yet, so they're likely only going to notice
a significant difference in production (if there will be any) when this
SRU lands :)

2. I think there's an additional regression risk here: users for whom it
already _always_ fails will now take even longer to fail. This might
affected automated air-gapped deployments, for example, where
api.snapcraft.io might currently time out, but now it will have to time
out three times plus six seconds. I think that such an environment
*should* explicitly and immediately reject, but firewalls are often not
configured to do that in practice. Could this tip such an automated
deployment over the edge if it itself has a timeout for completion? How
long is the connection timeout? If short then probably this isn't
significant; if long (eg. minutes) then it could be. It isn't a big deal
for someone affected to fix this by extending their own timeout or
(better) not triggering lxd when it is bound to fail anyway, but it
might be infuriating to deal with that in an area that is already
frustrating to some set of our users and the expectation is that it
won't regress further in an SRU.

3. Test Plan: could you perhaps simulate this problem with
api.snapcraft.io? For example, redirect it in /etc/hosts and then use
`nc -l -p 443 </dev/null` or similar? It's not exactly the same, but if
it's easy and good enough, then that would be better than a test that
isn't certain to actually exercise the retry path if the real service
happens to be working better at the time of the test.

Of these, 2 gives me reason for hesitation since this is the kind of
change that has affected me in the real world before. I'm on the fence
as to whether it needs mitigating or not. Please could you consider this
scenario, maybe try and measure the impact, and report your thoughts?

1 and 3 are OK to be resolved before release to -updates rather than
blocking now.

Apart from that, I've reviewed the current uploads in Noble and
Oracular, everything else looks fine, and once the above is resolved I'd
be happy to accept from the queues without re-review. I see bug tasks
open for Focal and Jammy, but no uploads for them, so those are not
reviewed.


** Changed in: lxd-installer (Ubuntu Noble)
       Status: New => Incomplete

** Changed in: lxd-installer (Ubuntu Oracular)
       Status: New => Incomplete

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2100564

Title:
  lxd-installer shim fails to install with snapstore error

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/lxd-installer/+bug/2100564/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to