Re: [Acme] Practical concerns of draft-ietf-acme-ari

Matthew Holt Fri, 21 Jul 2023 14:00:18 -0700

Hi all,

Thank you for the constructive discussion -- I'm glad others are seeing
this. 😅


Aaron, thank you especially for the thoughtful reply and engaging in the
discussion.

Replying inline:

I'm confused by the statement that "with ARI the window is reduced to just
> a few minutes, hours, or days". ... I'd love to hear suggestions for ways
> that the server could suggest a renewal time that doesn't run up against
> this push/pull between wanting to smooth traffic without making clients
> nervous.
>

Right; sorry; to clarify, this refers to the implicit hard deadline of
certificate expiration.

I simply do not think there is a way to offer a wider renewal window than
the full lifetime of the certificate by offering a narrower renewal window.
I know that sounds silly, but since "backoff and retry" is the One Way to
reliably getting a certificate in case of a problem, the more time means
the more chances for success.

(The ACME client not choosing to use the ARI window does not mean the
client doesn't trust the server; it really means it doesn't trust other
clients.)

And I'll reiterate again: if we know a certificate is going to be revoked,
we might as well stop trusting it. (There are very weird implications here
especially if ARI remains unauthenticated, with regards to cert trust,
uptime, and monitors/scanners). (Yes I know it might not be revocation; but
now there's a field to link to a reason/explanation, so we can infer
something from that.)

Frankly, Let's Encrypt is even considering bigger carrots, such as "your
> subscriber account can only get short-lived certs if we've seen it request
> ARI endpoints", or "your renewal requests bypass all rate limits if they're
> made within the ARI suggested window"
>

I think I'd be OK with those carrots. It doesn't forbid an early renewal if
a client has a reason (e.g. server migration, etc), but it does offer
stronger guarantees of renewal within the window. This seems like a really
good incentive as long as the client implementation is possible without
being noisy and complex.

This is not true. Explicitly, by the spec, the renewal window changing
> means nothing. The situations you list are the motivations for writing the
> spec in the first place, but they are not the only motivations for changing
> the window in any given case.
>

Ok, but *in practice* it is true.

In fact, Let's Encrypt is currently considering adding random jitter to the
> renewal window every time it is requested, specifically to prevent
> interpretations like this, and to naturally even-out renewal spikes through
> Brownian motion.
>

I've already accounted for this in plans for my own clients by only looking
for changes of the window that exceed a threshold. We can tolerate some
jitter. So a little jitter will not cause an immediate renewal, but a
significant change will.

 This assumes that the adjusted window will always be later in the lifetime
> of the certificate than before. There is no reason to make this assumption.
> A CA adjusting suggested windows in order to smooth out a load spike would
> be wise to shift 50% of renewal windows *earlier*.
>

Good point, I should probably clarify that I'm talking about those 50% of
windows that are adjusted to be later.

 Happy to be proven wrong.
>

No, you're right -- RLs are not strictly a spec concern. However, maybe it
would be fitting for a "considerations" section that ACME servers ensure
the ARI window is basically 100% reliable as much as possible, so that
clients will be incentivized to honor it and keep the spec relevant.

What do you mean by "enforced"? Deny newOrder requests that appear to be
> renewals but fall outside the suggested window?
>

Yep. Or maybe finalizeOrder requests. Not saying I like this idea. But it's
one way to avoid the "optional doesn't get implemented" problem.

Personally, my ideal would be to say "the ARI url is the Certificate URL
> concatenated with /ari". Unfortunately we can't do that, because there's
> nothing to prevent the URL provided by an Order from having query
> parameters, in which case appending a new path component would be
> incorrect. ... This leads to the question of: what should we use to
> uniquely identify the certificate instead?
>

Is the concern with appending '/ari' that it will result in an invalid url
(e.g. ...?a=b/ari)? The URL could be parsed and reconstructed. (*caveats
noted: URL parsing is hard, but most languages do it anyway).

Or is this infeasible because we need to make the assumption that the cert
is uniquely identified by the path alone and not a query string? I guess in
my mind this seems like the more sensible blocker as I'm not sure what to
do about that.

That said, I too like the idea of using the cert URL. So instead of
manipulating the certificate URL, what if the renewalInfo URL given in the
directory is "the one" for ARI, and the cert URL was specified as a
URL-encoded query param? (When I first started reading the ARI spec, that's
more what I was expecting.) I propose the QS because even though we could
put the cert URL as part of the path in an escaped form, some web servers
may not parse this properly (e.g. %2F is troublesome in particular because
URL normalization is often a necessary security measure and this can be
indistinguishable from / after normalizing, unless done carefully; see
https://github.com/caddyserver/caddy/pull/4948).

I don't understand how this approach (A) helps solve the issues you
> identified above. In order to get up-to-date information, the same number
> of requests still need to be made, it's just that now they're newOrder
> requests instead of renewalInfo requests.
>

Well, for one, this approach can greatly reduce complexity in clients,
since it uses existing renewal flows. No need to go out-of-band to schedule
renewals. It's basically just part of the existing timers/loops/schedulers.
You simply adjust your timer/sleep/whatever, instead of needing to start
all new ones and synchronize ARI routines with renewal routines.

As for solving issues, although I did say "OCSP CertID," it actually could
be whatever improved/easier identifier we end up using. The point is that
it's a way to identify the certificate and distinguish this request as
"ARI-enabled" so to speak. As for the number of requests: it's true this
doesn't reduce them, but as you say, there's not really much hope in doing
that anyway due to the 24 hour constraint. There are still advantages to
this approach that make it appealing: simpler clients, simpler servers
(less infrastructure), less complexity all around. (I'm glad you at least
like the new field in the newOrder requests. I agree that it could be
beneficial regardless.)

On the one hand, I'm in complete agreement, it would be great to have a
> "batch" endpoint that returns suggested windows for all certificates
> associated with a given account, or matching some other criteria. On the
> other hand, there's a reason that Let's Encrypt diverges from RFC8555 and
> does not implement the "orders" field on account objects: endpoints which
> serve unboundedly-large documents and require paging are difficult to
> implement correctly on both the server and client side, and can quickly
> lead to disruptive database queries.
>

Ok, this is interesting. I totally understand the difficulty with paging.
(I wonder if populating the "orders" field could be done optionally, like
if the client specifically requests that it be populated somehow. Then the
server load is still greatly reduced and provides useful info. But I
digress.)

I should clarify though that what I'm suggesting with (B) does *not*
involve enumerating many results and paging through a DB. Maybe in the
worst case where there is no simple way to describe the affected
certificates and they just need to be listed by ID/URL. But I believe what
we've seen in the past suggests that most certificates could be expressed
in terms of a date range, account fingerprint, etc -- some simple notation
that allows the *client* to compute whether it is relevant to them. This
endpoint could even be a static resource.

I would like for us to revisit (B) and maybe see if we can make that work.
Because if ARI is definitely going to go forward at scale with the intent
of reducing network congestion, well... we need to reduce network
congestion rather than increase it.

Maybe there's even some combination of (A) and (B), for example utilizing a
new field on newOrders in combination with a batch ARI endpoint.

Thanks for your consideration --
Matt

On Wed, Jul 19, 2023 at 4:06 PM Aaron Gable <[email protected]> wrote:

> Hi Matt,
>
> Agreed with Tim, receiving practical feedback from implementers of the
> draft standard is very useful. I'll put my thoughts, comments, and
> questions in-line.
>
> On Fri, Jun 23, 2023 at 9:21 AM Matthew Holt <[email protected]> wrote:
>
>>
>> With respect to ARI, ACME servers and clients have conflicts of interest.
>> The ACME client's goal is to keep the site up (with renewed and unrevoked
>> certificates); the optimal way to do this is to start renewing early and
>> retry often. The ACME server's goal is to keep the service up; the optimal
>> way to do this is to suppress clients that overload your capacity.
>> Obviously, these two goals are in opposition with each other. Proactive
>> clients can spike demand, which can cause service interruptions. But
>> service interruptions make clients more paranoid to retry even more often
>> until it works, and so on. ARI narrows the timeframe in which a conforming
>> client can retry failed renewals, which reduces reliability more as time
>> goes on. Without ARI, this window is a reasonable ~60 days. With ARI,
>> however, the window is reduced to just a few minutes, hours, or days. The
>> less time until expiration, the less hope there is to renew the cert in
>> time. As the draft currently stands, this is in the server's interest, but
>> not the client's.
>>
>
> I'm confused by the statement that "with ARI the window is reduced to just
> a few minutes, hours, or days". The draft spec clearly states that the
> client should renew during the window if it can, but that any time after
> the window is also acceptable: "if the selected time is in the past,
> attempt renewal immediately". The renewal window only becomes reduced to a
> few minutes, hours, or days if the ACME server shifts the suggested renewal
> window that far. Which, sure, is possible, but is clearly against the
> server's best interest as well: if the ACME server can't provide continuity
> of business to their Subscribers, then their Subscribers will go elsewhere
> for certificates.
>
> Can this be improved? Absolutely, I'm certain of it. I'd love to hear
> suggestions for ways that the server could suggest a renewal time that
> doesn't run up against this push/pull between wanting to smooth traffic
> without making clients nervous. Unfortunately, I don't believe either of
> the suggestions at the bottom of the message actually addresses this point
> (more on that below).
>
>
>> 1) It is optional. No one will implement this. OK, some clients will --
>> but I can say with authority from years of experience that optional
>> restrictions are not typically favored. Very little mainstream software
>> follow best practices to a tee.
>>
>
> Yep, optional features are difficult to incentivize. I think there's one
> obvious carrot to incentivize client adoption: "if you implement ARI, your
> certs will be renewed *before* they're revoked in the next mass revocation
> incident". Continuity of business can be a powerful motivator. Frankly,
> Let's Encrypt is even considering bigger carrots, such as "your subscriber
> account can only get short-lived certs if we've seen it request ARI
> endpoints", or "your renewal requests bypass all rate limits if they're
> made within the ARI suggested window". We don't know if we'll dangle either
> of those carrots, but it's clear that there are ways to incentivize
> adoption.
>
>
>> 2) A narrower renewal timeframe makes clients less reliable. In theory it
>> should make them *more* reliable since it smooths out traffic, thus
>> improving CA availability. But this assumes that most clients actually
>> implement and follow ARI. Since it's optional, I don't see that happening.
>> Especially since most ACME clients are still running as static cron jobs
>> like it's 2015...
>>
>> I'm sure ARI doesn't really change in the nominal case, which is 99.9..9%
>> of the time. In fact, Let's Encrypt's ARI seems to correspond with when my
>> clients attempt renewals on their own anyway. (So in that sense, ARI is
>> actually useless 99.9..9% of the time?)
>>
>> But when a renewal window does change, what does that mean? Well,
>> something is wrong. Either the certificate is being revoked, or the CA
>> anticipates downtime or availability issues.
>>
>
> This is not true. Explicitly, by the spec, the renewal window changing
> means nothing. The situations you list are the motivations for writing the
> spec in the first place, but they are not the only motivations for changing
> the window in any given case. In fact, Let's Encrypt is currently
> considering adding random jitter to the renewal window every time it is
> requested, specifically to prevent interpretations like this, and to
> naturally even-out renewal spikes through Brownian motion.
>
>
>> If we wait until the (adjusted) window to start renewing, we run
>> ourselves closer to the imminently-impending revocation or the expiration
>> of the certificate, lowering our chances of a successful renewal.
>>
>
> This assumes that the adjusted window will always be later in the lifetime
> of the certificate than before. There is no reason to make this assumption.
> A CA adjusting suggested windows in order to smooth out a load spike would
> be wise to shift 50% of renewal windows *earlier*. Waiting to renew until
> a time that is earlier than when you would have renewed anyway does not
> make things riskier.
>
>
>> 1) Many CAs enforce rate limits. If clients are to honor ARI windows, we
>> would need a guarantee that the first successful cert within the ARI window
>> will be allowed regardless of relevant rate limits. Because ARI restricts a
>> client's ability to spread out renewals when managing certificates in bulk
>> with respect to rate limits, the rate limits must NOT be a blocker when
>> honoring ARI.
>>
>
> I like this idea. We hope and plan to implement this regardless, as I
> suggested above with regards to it being a carrot that we can dangle to
> incentivize client adoption. However, I don't believe it is something that
> can be reasonably specified in an IETF RFC: rate limits are not part of the
> ACME protocol, they're an internal detail of ACME server implementations.
> Happy to be proven wrong.
>
>
>> 2) If ARI were actually enforced, some concerns would be resolved... for
>> example, we can have assurances that other ACME clients are doing the same,
>> thus improving CA availability. It would essentially be the CA scheduling
>> each individual certificate for each ACME client instance -- that's quite a
>> powerful idea, as long as availability is guaranteed (which it's not).
>>
>
> What do you mean by "enforced"? Deny newOrder requests that appear to be
> renewals but fall outside the suggested window?
>
>
>> 3) ARI does not scale well. Some ACME clients manage 10K+ certificates,
>> and in that case the client would have to check the ARI for at least 24
>> certificates per hour to get through them in a month. Deferring to the
>> Retry-After header may result in insufficient throughput. The current
>> expectation or convention is to check every certificate every 6-12 hours,
>> or tens of thousands of checks per day. One endpoint per certificate
>> multiple times per day is quite saturating. This is a considerable burden
>> for both ACME clients and servers. I would like to explore options that do
>> not involve 2+ HTTP requests per certificate.
>>
>
> Totally agreed, we don't love the heavy-polling nature of ARI as it stands
> either. It's a lot of requests, and that's a large part of why we've
> striven to keep the response size so small. The original version of this
> was just a single timestamp. It's grown to two timestamps and an optional
> URL thanks to community feedback, but I'd be happy to reduce the response
> size again if we decide that prioritizing efficiency is more important than
> prioritizing third-party certificate monitoring tools.
>
> Unfortunately, I don't currently have a different approach that I love.
> The 24-hour revocation timeline enforced by the BRs for certain kinds of
> revocations means that clients should be checking at least once every 24
> hours, regardless of mechanism. I'll comment more on your specific
> proposals to address this below.
>
> 4) Crafting the URL is convoluted. As Peter Cooper described it, "The core
>> issue is that the URL you need to construct is based on an OCSP structure
>> identifying the certificate, which requires taking one's existing
>> certificate and parsing out the serial number and issuer, and also taking
>> the intermediate certificate that signed it and getting its public key too.
>> So rather than just, like, using the fingerprint of the existing leaf or
>> something similarly simple that a lot of tooling can already give you, one
>> needs to really dig into both the leaf, and the intermediate, and hash
>> various pieces thereof, and then take all that to build a new ASN.1
>> structure." Why are we striving for near-parity with an OCSP request?? This
>> should be orthogonal to OCSP, right?
>>
>
> This is great feedback. We picked this request format specifically because
> we thought it would be easy. It's good to know that we were wrong, and
> investigate what other request formats would work better.
>
> Allow me to provide a little bit of context for how we arrived at using
> the OCSP CertID structure:
>
> We need a way to uniquely identify the certificate in question. ACME has
> one mechanism for doing so already: the URL provided by a finalized Order.
> Personally, my ideal would be to say "the ARI url is the Certificate URL
> concatenated with /ari". Unfortunately we can't do that, because there's
> nothing to prevent the URL provided by an Order from having query
> parameters, in which case appending a new path component would be
> incorrect. So, we could follow ACME's example, and provide a second
> "renewalInfo" URL in finalized Orders as well. Unfortunately, this a) means
> that clients have to persist this URL in order to use it, and b) clients
> which did not persist the URL (either ephemeral clients, or third-party
> certificate monitoring clients) cannot construct the URL at all.
>
> So we need a way to uniquely identify a certificate which can be
> constructed from the certificate itself. The serial seems like an obvious
> candidate. However, serials are only required to be unique on a per-issuer
> basis, and a single ACME server may issue from multiple issuer
> certificates. It turns out that OCSP already has a solution for this:
> combine the serial with a unique identifier of the issuer. And OCSP's
> solution even comes with algorithm agility for how the unique identifier of
> the issuer is computed! That's nice. So we took OCSP's request format,
> stripped away the pieces not pertaining to identifying a single
> certificate, et voila, the CertID.
>
> We believed this would be easy because many ACME clients are written in
> languages or running in environments that already have access to robust
> OCSP libraries. I wrote the first version of this
> <https://github.com/letsencrypt/boulder/blob/73b72e8fa2d852a40753926c34f38313a7db083d/wfe2/wfe_test.go#L3517-L3538>
>  (constructing
> an OCSP request, parsing it, extracting the relevant parameters, and
> serializing them into a CertID) in a few minutes. Again, it's useful to
> know that we were wrong.
>
> This leads to the question of: what should we use to uniquely identify the
> certificate instead? Certainly we could go with the "fingerprint" or
> "thumbprint" (a sha256 hash of DER bytes or PEM encoding, depending on who
> you ask, of the certificate) if people think that is sufficiently simple,
> easy to specify, unique, and future-proof. We could also go with "just the
> Serial", and force existing ACME servers to choose between either keeping
> serials unique across all issuers they represent, or splitting the server
> into multiple servers which each represent just a single issuer. Or we
> could return to the "url in the Order object" approach we started with. I'm
> curious what path forward people think is best.
>
>
>> 5) Web browsers / HTTP clients are bound to "abuse" ARI because the GET
>> request is not authenticated. Even if the information is not strictly
>> sensitive, I can totally see some browsers or tools using ARI as a signal
>> that a certificate is being revoked, and thus can no longer be trusted, and
>> thus block a site before a server even sees that it needs to renew its
>> cert. I could be incorrect, but can't the information needed to obtain ARI
>> can be scraped from CT logs? If so, I think a global ARI monitor/database
>> is inevitable, and that has interesting implications that I don't know have
>> been fully realized.
>>
>
> Yes, as mentioned above, this was a design goal as a result of community
> feedback. See this early discussion
> <https://mailarchive.ietf.org/arch/msg/acme/szDHa5z6qRiAtmeC2ohrePPoBjU/>
> for context. Again, this is a design goal that I'd be willing to compromise
> if there are sufficient reasons to do so, but I don't think that argument
> has been fully articulated as of yet.
>
>
>> All in all, the current ARI spec feels a little rushed. I'm hoping Let's
>> Encrypt's production deployment is meant to help gather feedback about ARI
>> before finalizing it, rather than to solidify it. Can we revisit both its
>> fundamentals and practical implications too?
>>
>
> Yes, the IETF process is about "rough consensus and running code". We
> can't finalize the spec until something is running. Let's Encrypt's
> deployment, and our encouragement of client adoption, is so that we can
> receive precisely this kind of feedback before the draft becomes an RFC.
>
>
>> I would like to explore some alternatives to the current draft. I can
>> think of two approaches that might address these concerns:
>>
>> A) Instead of a totally separate flow to obtain ARI, simply utilize a
>> Retry-After header in the flow of existing ACME responses. Upon finalizing
>> an order, the ACME server can respond with a Retry-After header which acts
>> as the current-draft Retry-After header for ARI responses. The client then
>> attempts renewal at/after the Retry-After time, but with the OCSP CertID
>> added to the NewOrder object; this indicates to the ACME server that the
>> client is asking if now is a good time to renew the certificate indicated
>> by the CertID. If it's not a good time, the ACME server can reply as such,
>> with another Retry-After, and the client then waits and repeats, until the
>> server actually issues the certificate. If the client needs the certificate
>> immediately, simply omit the CertID from the NewOrder and the normal,
>> "non-ARI" flow is assumed. This is backwards-compatible and requires no
>> additional infrastructure or endpoints.
>>
>
> I don't understand how this approach helps solve the issues you identified
> above. In order to get up-to-date information, the same number of requests
> still need to be made, it's just that now they're newOrder requests instead
> of renewalInfo requests. The unique identifier included in the request is
> no easier to construct. The Retry-After timestamp changing might still
> cause selfish clients to stop providing the CertID and renew right now.
>
> Now, I *am* a fan of adding a field to newOrder requests which uniquely
> identifies the cert being replaced. If such a field is populated, the CA
> would treat it the same as if the client had made a POST request to mark
> the certificate as replaced (Section 4.2 of the current draft). This has
> many nice effects, like letting the CA track renewals explicitly (instead
> of attempting to identify them with heuristics), letting renewal requests
> bypass rate limits, and more. I just don't think it elegantly replaces the
> renewalInfo endpoint itself.
>
>
>> B) If we do need a separate flow for some reason, I would like to see a
>> single endpoint containing a static JSON resource that describes all the
>> active certificates that need early renewal, rather than one
>> tediously-crafted URL per certificate. Certificates can be described by
>> their NotBefore or NotAfter dates, serial numbers, or other relevant
>> attributes. For example, if just a few certs with certain serials were
>> misissued, those serials could be enumerated at this endpoint. Or if a mass
>> revocation is happening, the timeframe of NotBefore dates could be listed,
>> and ACME clients can simply check against the certs they manage with those
>> dates, and replace them. You can represent millions of certificates in,
>> like, 85 bytes this way. And it's way less work for clients and servers.
>> And lastly, drop the "window" idea -- certificates described by this
>> endpoint should be renewed ASAP: try to renew immediately, then back off
>> and retry, for reasons described above (once we know the future is
>> uncertain and/or revocation is imminent, current certs can't be trusted
>> and/or clients must try to preserve their sites' uptime).
>>
>
> On the one hand, I'm in complete agreement, it would be great to have a
> "batch" endpoint that returns suggested windows for all certificates
> associated with a given account, or matching some other criteria. On the
> other hand, there's a reason that Let's Encrypt diverges from RFC8555 and
> does not implement the "orders" field on account objects: endpoints which
> serve unboundedly-large documents and require paging are difficult to
> implement correctly on both the server and client side, and can quickly
> lead to disruptive database queries.
>
>
>> And finally, I want to bring attention to the longer-term prospects for
>> ARI: it's quite possible that ARI will become irrelevant before it is
>> widely adopted by most clients. This itself may discourage adoption. As
>> stated above, ARI has two primary use cases: revocation and traffic
>> smoothing. As we push for shorter certificate lifetimes, revocation should
>> become irrelevant. And traffic smoothing will perhaps become a natural
>> consequence as clients are renewing more frequently anyway. We all know
>> revocation and long-lived certificates are broken, so I'd rather WebPKI
>> developers focus our energy on the ACTUAL goal: short-lived certificates.
>> We should not be focusing our ecosystem resources on infrastructure that
>> acts as a band-aid for a broken leg.
>>
>
> This is an interesting point. ARI was first conceived
> <https://bugzilla.mozilla.org/show_bug.cgi?id=1619179#c7> as a way to
> improve business continuity across mass revocation events, and grew from
> there. The idea that 10-day certs might be a reality, and that revocation
> would be wholly optional for them, was almost unimaginable at that time.
> But even today, the reality is that CAs such as Let's Encrypt will likely
> have to support revocation for a very long time to come: migrating the
> whole world to 10-day certs will not happen overnight. So I think that this
> work is worthwhile, even if other solutions are also on the horizon.
>
> Thanks,
> Aaron
>
>


-- 
This message is confidential and no license is granted for disclosure or
dissemination in whole or in part by any recipients without written
permission from the sender.

_______________________________________________
Acme mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/acme

Re: [Acme] Practical concerns of draft-ietf-acme-ari

Reply via email to