Hi Brian, Thank you for the detailed and excellent write up on the current behavior! Just to play a bit of devil's advocate, I'm wondering whether the `negative_revalidating_lifetime` is being interpreted incorrectly. I do admit it's confusing given the name and the documentation (which can both clearly use some improvement), but, one potential interpretation of the current behavior could be: Use `negative_revalidating_lifetime` as a delta extension to the object's life following a request that resulted in a Origin error (connection or 5xx). IOW, when configured, this timeout is used to not ping the Origin again and assume the object is fresh in cache for that duration. Arguably, this may be a way to minimize the load on the Origin that may be potentially experiencing systemic issues. The reason to still cap the overall object's life to `proxy.config.http.cache.max_stale_age` also seems like a reasonable/desirable behavior to me. This is a catchall upper bound on how long one would consider the object in their cache is still valid beyond which it is not usable anymore - meaning, at that point returning an error to the client is better than serving that stale copy and causing other side effects. Tldr; I think if we adjusted the documentation to explain this somehow, then the current behavior doesn't sound all that embarrassing to me. PS: Just to be clear, I'm not saying we should not change the current behavior in favor of a better alternative, but only trying to speculate/explain the possible reasoning behind the behavior. Thanks. Sudheer On Wednesday, March 24, 2021, 02:13:42 PM PDT, Brian Neradt <brian.ner...@gmail.com> wrote: dev@trafficserver.apache.org:
Context This design review concerns proxy.config.http.negative_revalidating_lifetime <https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime> and is the result of investigating the feature for the following issue: https://github.com/apache/trafficserver/issues/7425 Feature Overview The ATS negative revalidation feature implements the option to serve stale cached responses in the circumstance that the server is not reachable for revalidation. This can happen due to connectivity issues against the origin or the origin's service being down resulting in 5xx responses. As a clarification, do not conflate negative revalidation with the negative caching feature. While they sound alike, the two features are orthogonal to each other. The ATS negative caching feature allows for caching of negative HTTP responses (such as 404 "Not Found"). By contrast, as summarized above, negative revalidation allows for serving some cached response that is stale when the server is not reachable for revalidation. negative_revalidating_lifetime issues That's the negative revalidation feature: serve stale content when the server is not accessible. This generally works as expected. The problem lies in determining whether the stale resource is fresh enough via the negative_revalidating_lifetime configuration for this feature. The intention of this lifetime configuration was to provide a way for the user to specify how old a stale resource can be when negative revalidation takes place before it will no longer be served from the cache. Looking at the documentation and the code for it, the *expected behavior* seemed to have been the following: 1. A client requests an item. The server responds with, say, a 200 OK and the ATS proxy caches the response. 2. At some point in the future, a client requests this item again. 3. ATS inspects the item and determines that it is stale. 4. ATS attempts to validate its cached resource against the origin. 5. Either a connection cannot be made to the origin or it receives a 5xx in response to its validation request. 6. If negative revalidation is enabled, ATS compares the age of the stale resource against negative_revalidating_lifetime. It will then do one of two things: 1. If the age of the cached resource is younger than negative_revalidating_lifetime, consider it "fresh enough" and reply to the origin with the cached response. 2. If the age is older than negative_revalidating_lifetime, reply with a 5xx response. In reality step 6 does not happen. The stale resource is never re-evaluated as fresh or still stale against negative_revalidating_lifetime. Rather, the determining factor of whether the resource is served is proxy.config.http.cache.max_stale_age <https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-cache-max-stale-age>. That is, if its age is less than max_stale_age, it is used; otherwise it is not. In addition, an incorrectly calculated Expires header field is added which is set to negative_revalidating_lifetime seconds from the current time rather than negative_revalidating_lifetime seconds from the resources calculated generation time. These negative_revalidating_lifetime issues may very well be day zero bugs. Looking at git, it goes back to at least 2009 in a commit <https://github.com/apache/trafficserver/commit/a1651345d11f228dfb392cc79169760e80d25f06> titled "Initial commit" (presumably when we transitioned source code repositories). It has not worked for at least a very long time and people have been relying, knowingly or not, upon max_stale_age to limit how stale resources can be before they are used for negative revalidation. This design review evaluates the options for how to handle negative_revalidating_lifetime. Options Option 1: Document the Current Behavior The first option is to leave the feature as is and document how negative_revalidating_lifetime currently behaves. Advantages: - This requires the least amount of work as it only involves a documentation change. - This will effectively avoid user confusion about negative_revalidating_lifetime so that no further issues will be filed and time wasted when it does not behave as expected. Disadvantages - negative_revalidating_lifetime, as it currently behaves, would be hard to explain without embarrassing the project. We would have to say something like: The value of negative_revalidating_lifetime is used to calculate an Expires > header with a field value of negative_revalidating_lifetime seconds in the > future from the time the cached resource is served. This header is added in > the cached response. No other behavior is influenced by this configuration. > The age limit enforced for such stale resources is configured > via proxy.config.http.cache.max_stale_age. This likely raises more questions than it answers. In reality, it would probably be most helpful to steer users away from this feature and toward max_stale_age. Option 2: Fix the Issues Alternatively, we can fix the issues surrounding negative_revalidating_lifetime. Looking at the code, it seems that the Expires header, which is currently incorrectly calculated, really isn't needed. Rather we can make the negative_revalidating_lifetime conditionally a part of the HttpTransact::is_stale_cache_response_returnable() <https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L5943> function. In addition to max_stale_age, which is currently taken into account there, we can also compare against negative_revalidating_lifetime to see whether the resource can be served (but only when negative revalidation is being used...the function is used in other contexts). Advantages: - This addresses both the known issues with negative_revalidating_lifetime in that the Expires header is no longer added and the negative_revalidating_lifetime is effectively used in determining whether the item is fresh enough for the response. - The fix doesn't look all that hard to implement. Disadvantages: - Implementing the feature adds various complexities to ATS. It complicates the documentation, it complicates the code, and will require tests to verify it works correctly. None of this complexity is large if the feature is worth it, and I'm happy to do it. But I have my doubts that it is worth it. Keep in mind that negative_revalidating_lifetime has never been a working feature in ATS and this has caused problems for no one. If its absence had caused problems, we would know about it because people would complain. In fact, the only reported issue with the feature is the confusion it caused by existing in the document in the first place. Also, keep in mind that this feature would only be useful, if it is at all, to *extend* the lifetime of these cached resources past max_stale_age. Both of those values would be used and we would want to use the max of each. Currently max_stale_age defaults to 7 days. How many users will want max_stale_age to be one value for general max-age computations, but use a different, larger, value for negative revalidation? I'm guessing not many. (Alternatively, I suppose we could use only negative_revalidating_lifetime instead of using both it and max_stale_age in these situations, but that would potentially be a breaking change for current configurations in which negative_revalidation would now only happen for the default 30 minutes of negative_revalidating_lifetime rather than the default 7 days of max_stale_age.) Option 3: Remove negative_revalidating_lifetime We can remove negative_revalidating_lifetime from the product. This would involve: - Removing the documentation of negative_revalidating_lifetime <https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime> from our records.config documentation. We would also want to explain the role that max_stale_age plays in limiting the usage of stale resources in negative revalidating circumstances. - Removing the addition of the faulty Expires header from HttpTransact.cc <https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L4364> - Keep the parsing of negative_revalidating_lifetime, if it is used by a user, but add a Warning that it is defunct, deprecated, and will be removed in a future release. Advantages: - This removes the confusion associated with having both a negative_revalidating_lifetime feature used for these negative revalidating resources and the max_stale_age feature. - It removes the bugs associated with negative_revalidating_lifetime because the feature is removed. Disadvantages: - If someone would like a negative_revalidating_lifetime feature in the future, it will not exist and will need to be implemented. Suggestion I recommend Option 3: Remove negative_revalidating_lifetime. It's safe to remove because the feature does not currently work and seemingly has never worked; thus nothing is being taken away from ATS. On the other hand, the buggy behavior and confusing documentation is alleviated by it being removed. The downside to this option is that we won't have the added negative_revalidating_lifetime feature in the event that someone wants it in the future. But, again, that leaves us no worse off than we are today since our current implementation doesn't effectively implement it. Let me know if you have other thoughts or concerns. Thanks, Brian -- "Come to Me, all who are weary and heavy-laden, and I will give you rest. Take My yoke upon you and learn from Me, for I am gentle and humble in heart, and you will find rest for your souls. For My yoke is easy and My burden is light." ~ Matthew 11:28-30