dev@trafficserver.apache.org:

Context
This design review concerns proxy.config.http.negative_revalidating_lifetime
<https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime>
and
is the result of investigating the feature for the following issue:

https://github.com/apache/trafficserver/issues/7425

Feature Overview
The ATS negative revalidation feature implements the option to serve stale
cached responses in the circumstance that the server is not reachable for
revalidation. This can happen due to connectivity issues against the
origin or the origin's service being down resulting in 5xx responses.

As a clarification, do not conflate negative revalidation with the negative
caching feature. While they sound alike, the two features are orthogonal to
each other. The ATS negative caching feature allows for caching of negative
HTTP responses (such as 404 "Not Found"). By contrast, as summarized above,
negative revalidation allows for serving some cached response that is stale
when the server is not reachable for revalidation.

negative_revalidating_lifetime issues
That's the negative revalidation feature: serve stale content when the
server is not accessible. This generally works as expected. The problem
lies in determining whether the stale resource is fresh enough via the
negative_revalidating_lifetime configuration for this feature. The
intention of this lifetime configuration was to provide a way for the user
to specify how old a stale resource can be when negative revalidation takes
place before it will no longer be served from the cache. Looking at the
documentation and the code for it, the *expected behavior* seemed to have
been the following:


   1. A client requests an item. The server responds with, say, a 200 OK
   and the ATS proxy caches the response.
   2. At some point in the future, a client requests this item again.
   3. ATS inspects the item and determines that it is stale.
   4. ATS attempts to validate its cached resource against the origin.
   5. Either a connection cannot be made to the origin or it receives a 5xx
   in response to its validation request.
   6. If negative revalidation is enabled, ATS compares the age of the
   stale resource against negative_revalidating_lifetime. It will then do one
   of two things:
      1. If the age of the cached resource is younger than
      negative_revalidating_lifetime, consider it "fresh enough" and
reply to the
      origin with the cached response.
      2. If the age is older than negative_revalidating_lifetime, reply
      with a 5xx response.

In reality step 6 does not happen. The stale resource is never re-evaluated
as fresh or still stale against negative_revalidating_lifetime. Rather, the
determining factor of whether the resource is served is
proxy.config.http.cache.max_stale_age
<https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-cache-max-stale-age>.
That is, if its age is less than max_stale_age, it is used; otherwise it is
not. In addition, an incorrectly calculated Expires header field is added
which is set to negative_revalidating_lifetime seconds from the current
time rather than negative_revalidating_lifetime seconds from the resources
calculated generation time.

These negative_revalidating_lifetime issues may very well be day zero bugs.
Looking at git, it goes back to at least 2009 in a commit
<https://github.com/apache/trafficserver/commit/a1651345d11f228dfb392cc79169760e80d25f06>
titled "Initial commit" (presumably when we transitioned source code
repositories). It has not worked for at least a very long time and people
have been relying, knowingly or not, upon max_stale_age to limit how stale
resources can be before they are used for negative revalidation.

This design review evaluates the options for how to handle
negative_revalidating_lifetime.

Options
Option 1: Document the Current Behavior
The first option is to leave the feature as is and document how
negative_revalidating_lifetime currently behaves.

Advantages:

   - This requires the least amount of work as it only involves a
   documentation change.
   - This will effectively avoid user confusion about
   negative_revalidating_lifetime so that no further issues will be filed and
   time wasted when it does not behave as expected.

Disadvantages

   - negative_revalidating_lifetime, as it currently behaves, would be hard
   to explain without embarrassing the project. We would have to say something
   like:

The value of negative_revalidating_lifetime is used to calculate an Expires
> header with a field value of negative_revalidating_lifetime seconds in the
> future from the time the cached resource is served. This header is added in
> the cached response. No other behavior is influenced by this configuration.
> The age limit enforced for such stale resources is configured
> via proxy.config.http.cache.max_stale_age.


This likely raises more questions than it answers. In reality, it would
probably be most helpful to steer users away from this feature and toward
max_stale_age.


Option 2: Fix the Issues
Alternatively, we can fix the issues surrounding
negative_revalidating_lifetime. Looking at the code, it seems that the
Expires header, which is currently incorrectly calculated, really isn't
needed. Rather we can make the negative_revalidating_lifetime conditionally
a part of the HttpTransact::is_stale_cache_response_returnable()
<https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L5943>
function.

In addition to max_stale_age, which is currently taken into account there,
we can also compare against negative_revalidating_lifetime to see whether
the resource can be served (but only when negative revalidation is being
used...the function is used in other contexts).

Advantages:

   - This addresses both the known issues with
   negative_revalidating_lifetime in that the Expires header is no longer
   added and the negative_revalidating_lifetime is effectively used in
   determining whether the item is fresh enough for the response.
   - The fix doesn't look all that hard to implement.

Disadvantages:

   - Implementing the feature adds various complexities to ATS. It
   complicates the documentation, it complicates the code, and will require
   tests to verify it works correctly. None of this complexity is large if the
   feature is worth it, and I'm happy to do it. But I have my doubts that it
   is worth it. Keep in mind that negative_revalidating_lifetime has never
   been a working feature in ATS and this has caused problems for no one. If
   its absence had caused problems, we would know about it because people
   would complain. In fact, the only reported issue with the feature is the
   confusion it caused by existing in the document in the first place. Also,
   keep in mind that this feature would only be useful, if it is at all, to
   *extend* the lifetime of these cached resources past max_stale_age. Both
   of those values would be used and we would want to use the max of each.
   Currently max_stale_age defaults to 7 days. How many users will want
   max_stale_age to be one value for general max-age computations, but use a
   different, larger, value for negative revalidation? I'm guessing not many.
   (Alternatively, I suppose we could use only negative_revalidating_lifetime
   instead of using both it and max_stale_age in these situations, but that
   would potentially be a breaking change for current configurations in which
   negative_revalidation would now only happen for the default 30 minutes of
   negative_revalidating_lifetime rather than the default 7 days of
   max_stale_age.)


Option 3: Remove negative_revalidating_lifetime
We can remove negative_revalidating_lifetime from the product. This would
involve:

   - Removing the documentation of negative_revalidating_lifetime
   
<https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime>
   from our records.config documentation. We would also want to explain the
   role that max_stale_age plays in limiting the usage of stale resources in
   negative revalidating circumstances.
   - Removing the addition of the faulty Expires header from HttpTransact.cc
   
<https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L4364>
   - Keep the parsing of negative_revalidating_lifetime, if it is used by a
   user, but add a Warning that it is defunct, deprecated, and will be removed
   in a future release.

Advantages:

   - This removes the confusion associated with having both a
   negative_revalidating_lifetime feature used for these negative revalidating
   resources and the max_stale_age feature.
   - It removes the bugs associated with negative_revalidating_lifetime
   because the feature is removed.

Disadvantages:

   - If someone would like a negative_revalidating_lifetime feature in the
   future, it will not exist and will need to be implemented.

Suggestion
I recommend Option 3: Remove negative_revalidating_lifetime. It's safe to
remove because the feature does not currently work and seemingly has never
worked; thus nothing is being taken away from ATS. On the other hand, the
buggy behavior and confusing documentation is alleviated by it being
removed. The downside to this option is that we won't have the added
negative_revalidating_lifetime feature in the event that someone wants it
in the future. But, again, that leaves us no worse off than we are today
since our current implementation doesn't effectively implement it.

Let me know if you have other thoughts or concerns.

Thanks,
Brian

-- 
"Come to Me, all who are weary and heavy-laden, and I will
give you rest. Take My yoke upon you and learn from Me, for
I am gentle and humble in heart, and you will find rest for
your souls. For My yoke is easy and My burden is light."

    ~ Matthew 11:28-30

Reply via email to