Hi Chris!

I think your todo list looks accurate.

On the question of cron jobs, here are the answers as we understand them
upstream:

What happens if the user runs two multiple cron jobs?

Answer 0: probably nothing. "certbot renew" is designed to be run as
often as you like, and is normally a no-op.

Answer 1: with some small probability, the user might have two "certbot
renew" commands that are executed at the same time. In that case, it
would be fairly common for one or both of those to fail with an error,
that would produce cron email. The baseline probability of this is
collision is about once per 5,000 cert renewals if the hour in the
user's cron job is uniform-random, and one in 200 cert renewals if they
picked the same two hours (noon and midnight) that are baked into
Debian's cron job.

Answer 2: with some much smaller probability, two overlapping "certbot
renew" commands could experience a race condition in writing cert
lineage files in /etc/letsencrypt/archive, or symlinks in
/etc/letsencrypt/live. These would cause a configuration problem (certs
and privkeys don't match) about 75 - 80% of the time. I just measured
this race condition window on a AWS tiny instance, and estimate that a
cert writing race might happen about once every 36,000 cert renewals if
the cron job hours line up, or once every 864,000 renewals if users have
cron jobs at uniform-random hours.

It is however possible that there are other race conditions in some of
our plugins (apache, nginx) that are more likely to occur.

We have a few mitigation options:

Mitigation 0: write a patch to add locking to Certbot 0.10.2 / 0.10.3.
This would add a new dependency on python-filelock, and we'd have to
make a choice about how much field testing we want for this patch before
SRUing it.

Mitigation 1: change the cron job, which picks random times in the hours
after noon and midnight, to a systemd timer that runs at two uniform-
random hours, or a cron job that has two hours that are less likely to
be chosen by sys admins.  We can probably use LE serverside data to pick
the two least common hours.

Mitigation 2: study the plugin code to ensure that problematic race
conditions are really as rare as we think. We could probably tolerate a
temporary risk of failure that's one in a million cert renewals on the
subset of systems which have two cron processes and where the admin
ignored the notice about it -- hard disks fail faster than that.

I think the upstream team favours mitigation 0 :)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640978

Title:
  [SRU] Backport letsencrypt 0.9.3

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/python-acme/+bug/1640978/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to