Hi Brian,
Thank you for the detailed and excellent write up on the current behavior! 
Just to play a bit of devil's advocate, I'm wondering whether the 
`negative_revalidating_lifetime` is being interpreted incorrectly. I do admit 
it's confusing given the name and the documentation (which can both clearly use 
some improvement), but, one potential interpretation of the current behavior 
could be:
Use `negative_revalidating_lifetime` as a delta extension to the object's life 
following a request that resulted in a Origin error (connection or 5xx). IOW, 
when configured, this timeout is used to not ping the Origin again and assume 
the object is fresh in cache for that duration. Arguably, this may be a way to 
minimize the load on the Origin that may be potentially experiencing systemic 
issues. 
The reason to still cap the overall object's life to 
`proxy.config.http.cache.max_stale_age` also seems like a reasonable/desirable 
behavior to me. This is a catchall upper bound on how long one would consider 
the object in their cache is still valid beyond which it is not usable anymore 
- meaning, at that point returning an error to the client is better than 
serving that stale copy and causing other side effects.
Tldr; I think if we adjusted the documentation to explain this somehow, then 
the current behavior doesn't sound all that embarrassing to me.
PS: Just to be clear, I'm not saying we should not change the current behavior 
in favor of a better alternative, but only trying to speculate/explain the 
possible reasoning behind the behavior.
Thanks.
Sudheer
    On Wednesday, March 24, 2021, 02:13:42 PM PDT, Brian Neradt 
<brian.ner...@gmail.com> wrote:  
 
 dev@trafficserver.apache.org:

Context
This design review concerns proxy.config.http.negative_revalidating_lifetime
<https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime>
and
is the result of investigating the feature for the following issue:

https://github.com/apache/trafficserver/issues/7425

Feature Overview
The ATS negative revalidation feature implements the option to serve stale
cached responses in the circumstance that the server is not reachable for
revalidation. This can happen due to connectivity issues against the
origin or the origin's service being down resulting in 5xx responses.

As a clarification, do not conflate negative revalidation with the negative
caching feature. While they sound alike, the two features are orthogonal to
each other. The ATS negative caching feature allows for caching of negative
HTTP responses (such as 404 "Not Found"). By contrast, as summarized above,
negative revalidation allows for serving some cached response that is stale
when the server is not reachable for revalidation.

negative_revalidating_lifetime issues
That's the negative revalidation feature: serve stale content when the
server is not accessible. This generally works as expected. The problem
lies in determining whether the stale resource is fresh enough via the
negative_revalidating_lifetime configuration for this feature. The
intention of this lifetime configuration was to provide a way for the user
to specify how old a stale resource can be when negative revalidation takes
place before it will no longer be served from the cache. Looking at the
documentation and the code for it, the *expected behavior* seemed to have
been the following:


  1. A client requests an item. The server responds with, say, a 200 OK
  and the ATS proxy caches the response.
  2. At some point in the future, a client requests this item again.
  3. ATS inspects the item and determines that it is stale.
  4. ATS attempts to validate its cached resource against the origin.
  5. Either a connection cannot be made to the origin or it receives a 5xx
  in response to its validation request.
  6. If negative revalidation is enabled, ATS compares the age of the
  stale resource against negative_revalidating_lifetime. It will then do one
  of two things:
      1. If the age of the cached resource is younger than
      negative_revalidating_lifetime, consider it "fresh enough" and
reply to the
      origin with the cached response.
      2. If the age is older than negative_revalidating_lifetime, reply
      with a 5xx response.

In reality step 6 does not happen. The stale resource is never re-evaluated
as fresh or still stale against negative_revalidating_lifetime. Rather, the
determining factor of whether the resource is served is
proxy.config.http.cache.max_stale_age
<https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-cache-max-stale-age>.
That is, if its age is less than max_stale_age, it is used; otherwise it is
not. In addition, an incorrectly calculated Expires header field is added
which is set to negative_revalidating_lifetime seconds from the current
time rather than negative_revalidating_lifetime seconds from the resources
calculated generation time.

These negative_revalidating_lifetime issues may very well be day zero bugs.
Looking at git, it goes back to at least 2009 in a commit
<https://github.com/apache/trafficserver/commit/a1651345d11f228dfb392cc79169760e80d25f06>
titled "Initial commit" (presumably when we transitioned source code
repositories). It has not worked for at least a very long time and people
have been relying, knowingly or not, upon max_stale_age to limit how stale
resources can be before they are used for negative revalidation.

This design review evaluates the options for how to handle
negative_revalidating_lifetime.

Options
Option 1: Document the Current Behavior
The first option is to leave the feature as is and document how
negative_revalidating_lifetime currently behaves.

Advantages:

  - This requires the least amount of work as it only involves a
  documentation change.
  - This will effectively avoid user confusion about
  negative_revalidating_lifetime so that no further issues will be filed and
  time wasted when it does not behave as expected.

Disadvantages

  - negative_revalidating_lifetime, as it currently behaves, would be hard
  to explain without embarrassing the project. We would have to say something
  like:

The value of negative_revalidating_lifetime is used to calculate an Expires
> header with a field value of negative_revalidating_lifetime seconds in the
> future from the time the cached resource is served. This header is added in
> the cached response. No other behavior is influenced by this configuration.
> The age limit enforced for such stale resources is configured
> via proxy.config.http.cache.max_stale_age.


This likely raises more questions than it answers. In reality, it would
probably be most helpful to steer users away from this feature and toward
max_stale_age.


Option 2: Fix the Issues
Alternatively, we can fix the issues surrounding
negative_revalidating_lifetime. Looking at the code, it seems that the
Expires header, which is currently incorrectly calculated, really isn't
needed. Rather we can make the negative_revalidating_lifetime conditionally
a part of the HttpTransact::is_stale_cache_response_returnable()
<https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L5943>
function.

In addition to max_stale_age, which is currently taken into account there,
we can also compare against negative_revalidating_lifetime to see whether
the resource can be served (but only when negative revalidation is being
used...the function is used in other contexts).

Advantages:

  - This addresses both the known issues with
  negative_revalidating_lifetime in that the Expires header is no longer
  added and the negative_revalidating_lifetime is effectively used in
  determining whether the item is fresh enough for the response.
  - The fix doesn't look all that hard to implement.

Disadvantages:

  - Implementing the feature adds various complexities to ATS. It
  complicates the documentation, it complicates the code, and will require
  tests to verify it works correctly. None of this complexity is large if the
  feature is worth it, and I'm happy to do it. But I have my doubts that it
  is worth it. Keep in mind that negative_revalidating_lifetime has never
  been a working feature in ATS and this has caused problems for no one. If
  its absence had caused problems, we would know about it because people
  would complain. In fact, the only reported issue with the feature is the
  confusion it caused by existing in the document in the first place. Also,
  keep in mind that this feature would only be useful, if it is at all, to
  *extend* the lifetime of these cached resources past max_stale_age. Both
  of those values would be used and we would want to use the max of each.
  Currently max_stale_age defaults to 7 days. How many users will want
  max_stale_age to be one value for general max-age computations, but use a
  different, larger, value for negative revalidation? I'm guessing not many.
  (Alternatively, I suppose we could use only negative_revalidating_lifetime
  instead of using both it and max_stale_age in these situations, but that
  would potentially be a breaking change for current configurations in which
  negative_revalidation would now only happen for the default 30 minutes of
  negative_revalidating_lifetime rather than the default 7 days of
  max_stale_age.)


Option 3: Remove negative_revalidating_lifetime
We can remove negative_revalidating_lifetime from the product. This would
involve:

  - Removing the documentation of negative_revalidating_lifetime
  
<https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime>
  from our records.config documentation. We would also want to explain the
  role that max_stale_age plays in limiting the usage of stale resources in
  negative revalidating circumstances.
  - Removing the addition of the faulty Expires header from HttpTransact.cc
  
<https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L4364>
  - Keep the parsing of negative_revalidating_lifetime, if it is used by a
  user, but add a Warning that it is defunct, deprecated, and will be removed
  in a future release.

Advantages:

  - This removes the confusion associated with having both a
  negative_revalidating_lifetime feature used for these negative revalidating
  resources and the max_stale_age feature.
  - It removes the bugs associated with negative_revalidating_lifetime
  because the feature is removed.

Disadvantages:

  - If someone would like a negative_revalidating_lifetime feature in the
  future, it will not exist and will need to be implemented.

Suggestion
I recommend Option 3: Remove negative_revalidating_lifetime. It's safe to
remove because the feature does not currently work and seemingly has never
worked; thus nothing is being taken away from ATS. On the other hand, the
buggy behavior and confusing documentation is alleviated by it being
removed. The downside to this option is that we won't have the added
negative_revalidating_lifetime feature in the event that someone wants it
in the future. But, again, that leaves us no worse off than we are today
since our current implementation doesn't effectively implement it.

Let me know if you have other thoughts or concerns.

Thanks,
Brian

-- 
"Come to Me, all who are weary and heavy-laden, and I will
give you rest. Take My yoke upon you and learn from Me, for
I am gentle and humble in heart, and you will find rest for
your souls. For My yoke is easy and My burden is light."

    ~ Matthew 11:28-30
  

Reply via email to