Er, sorry that didn’t sound right. Agree with you that not updating max-age seems more like a bug if we are to go with the hypothesis.
> On Mar 26, 2021, at 3:02 PM, Sudheer Vinukonda <sudheervinuko...@yahoo.com> > wrote: > > > Thanks Brian! > > Hmm, it seems to me that the traces you shared actually further confirm the > hypothesis - ie ATS serves a locally cached stale object because of negative > revalidation (and Origin connection error/5xx), yet we do *not* want the > clients receiving that copy (whether it's an actual browser/device or a CDN) > to cache it. This makes sense because the problem is between ATS and its > Origin, but, we want the clients to continue trying to refresh their copy of > the object? > > > > On Friday, March 26, 2021, 01:01:15 PM PDT, Brian Neradt > <brian.ner...@gmail.com> wrote: > > > Thank you Sudheer. This is really helpful. That at least gives some > justification for the current behavior: try to communicate the lifetime of > the object to something in the future so the upstream clients or proxies > don't think they need to revalidate their cached objects. It is > insufficient, unfortunately, since the mechanism doesn't remove any max-age > directives which would override the Expires header. Take for example these > following responses, the first being what the origin Proxy Verifier replied > with and with which the cache was populated: > > [0]: Wrote 134 bytes in an HTTP/1 response to request with key 1 with > response status 200: > - "content-length": "32" > - "cache-control": "max-age=2" > - "date": "Thu, 25 Mar 2021 19:02:17 GMT" > > Note the Date is for yesterday. It is thus immediately stale since the > max-age is for 2 seconds. > > This second is the response received by the client for this same object and > was served out of ATS's cache. Proxy Verifier was configured to reply with > a 503, thus ATS served this stale resource out of cache because of negative > revalidating since max_stale_age is 7 days (the default value), so ATS > thinks it's fresh enough: > > [0]: Received an HTTP/1 200 response for key 2 with headers: > - "content-length": "32" > - "cache-control": "max-age=2" > - "date": "Thu, 25 Mar 2021 19:02:17 GMT" > - "Expires": "Fri, 26 Mar 2021 19:44:34 GMT" > - "Connection": "keep-alive" > - "Server": "ATS/10.0.0" > > You see the problem: the Expires is set to 30 minutes in the future (the > default value of negative_revalidating_lifetime). But the Date: header is > yesterday and the max-age directive is still present and says 2 seconds. > Since the RFC specifies preference of max-age directives over the Expires > header, upstream caches will still consider this stale. s-maxage would have > this problem as well. An Age header field can also be problematic for > different reasons. > > I say this only to clarify in more detail the limitations of what is > currently done with the Expires header. None of this invalidates your > observations. I'm more comfortable with and prefer option 1 now: simply > document the current behavior. We can at least explain its rationale while > acknowledging briefly its limitations. It has been like this for years and > no one has expressed problems it has caused. I'll put up a documentation PR. > > Thanks again for your thoughts. > > Brian > > > On Thu, Mar 25, 2021 at 11:27 AM Sudheer Vinukonda > <sudheervinuko...@yahoo.com.invalid> wrote: > > > Hi Brian, > > Thank you for the detailed and excellent write up on the current behavior! > > Just to play a bit of devil's advocate, I'm wondering whether the > > `negative_revalidating_lifetime` is being interpreted incorrectly. I do > > admit it's confusing given the name and the documentation (which can both > > clearly use some improvement), but, one potential interpretation of the > > current behavior could be: > > Use `negative_revalidating_lifetime` as a delta extension to the object's > > life following a request that resulted in a Origin error (connection or > > 5xx). IOW, when configured, this timeout is used to not ping the Origin > > again and assume the object is fresh in cache for that duration. Arguably, > > this may be a way to minimize the load on the Origin that may be > > potentially experiencing systemic issues. > > The reason to still cap the overall object's life to > > `proxy.config.http.cache.max_stale_age` also seems like a > > reasonable/desirable behavior to me. This is a catchall upper bound on how > > long one would consider the object in their cache is still valid beyond > > which it is not usable anymore - meaning, at that point returning an error > > to the client is better than serving that stale copy and causing other side > > effects. > > Tldr; I think if we adjusted the documentation to explain this somehow, > > then the current behavior doesn't sound all that embarrassing to me. > > PS: Just to be clear, I'm not saying we should not change the current > > behavior in favor of a better alternative, but only trying to > > speculate/explain the possible reasoning behind the behavior. > > Thanks. > > Sudheer > > On Wednesday, March 24, 2021, 02:13:42 PM PDT, Brian Neradt < > > brian.ner...@gmail.com> wrote: > > > > dev@trafficserver.apache.org: > > > > Context > > This design review concerns > > proxy.config.http.negative_revalidating_lifetime > > < > > https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime > > > > > and > > is the result of investigating the feature for the following issue: > > > > https://github.com/apache/trafficserver/issues/7425 > > > > Feature Overview > > The ATS negative revalidation feature implements the option to serve stale > > cached responses in the circumstance that the server is not reachable for > > revalidation. This can happen due to connectivity issues against the > > origin or the origin's service being down resulting in 5xx responses. > > > > As a clarification, do not conflate negative revalidation with the negative > > caching feature. While they sound alike, the two features are orthogonal to > > each other. The ATS negative caching feature allows for caching of negative > > HTTP responses (such as 404 "Not Found"). By contrast, as summarized above, > > negative revalidation allows for serving some cached response that is stale > > when the server is not reachable for revalidation. > > > > negative_revalidating_lifetime issues > > That's the negative revalidation feature: serve stale content when the > > server is not accessible. This generally works as expected. The problem > > lies in determining whether the stale resource is fresh enough via the > > negative_revalidating_lifetime configuration for this feature. The > > intention of this lifetime configuration was to provide a way for the user > > to specify how old a stale resource can be when negative revalidation takes > > place before it will no longer be served from the cache. Looking at the > > documentation and the code for it, the *expected behavior* seemed to have > > been the following: > > > > > > 1. A client requests an item. The server responds with, say, a 200 OK > > and the ATS proxy caches the response. > > 2. At some point in the future, a client requests this item again. > > 3. ATS inspects the item and determines that it is stale. > > 4. ATS attempts to validate its cached resource against the origin. > > 5. Either a connection cannot be made to the origin or it receives a 5xx > > in response to its validation request. > > 6. If negative revalidation is enabled, ATS compares the age of the > > stale resource against negative_revalidating_lifetime. It will then do > > one > > of two things: > > 1. If the age of the cached resource is younger than > > negative_revalidating_lifetime, consider it "fresh enough" and > > reply to the > > origin with the cached response. > > 2. If the age is older than negative_revalidating_lifetime, reply > > with a 5xx response. > > > > In reality step 6 does not happen. The stale resource is never re-evaluated > > as fresh or still stale against negative_revalidating_lifetime. Rather, the > > determining factor of whether the resource is served is > > proxy.config.http.cache.max_stale_age > > < > > https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-cache-max-stale-age > > >. > > That is, if its age is less than max_stale_age, it is used; otherwise it is > > not. In addition, an incorrectly calculated Expires header field is added > > which is set to negative_revalidating_lifetime seconds from the current > > time rather than negative_revalidating_lifetime seconds from the resources > > calculated generation time. > > > > These negative_revalidating_lifetime issues may very well be day zero bugs. > > Looking at git, it goes back to at least 2009 in a commit > > < > > https://github.com/apache/trafficserver/commit/a1651345d11f228dfb392cc79169760e80d25f06 > > > > > titled "Initial commit" (presumably when we transitioned source code > > repositories). It has not worked for at least a very long time and people > > have been relying, knowingly or not, upon max_stale_age to limit how stale > > resources can be before they are used for negative revalidation. > > > > This design review evaluates the options for how to handle > > negative_revalidating_lifetime. > > > > Options > > Option 1: Document the Current Behavior > > The first option is to leave the feature as is and document how > > negative_revalidating_lifetime currently behaves. > > > > Advantages: > > > > - This requires the least amount of work as it only involves a > > documentation change. > > - This will effectively avoid user confusion about > > negative_revalidating_lifetime so that no further issues will be filed > > and > > time wasted when it does not behave as expected. > > > > Disadvantages > > > > - negative_revalidating_lifetime, as it currently behaves, would be hard > > to explain without embarrassing the project. We would have to say > > something > > like: > > > > The value of negative_revalidating_lifetime is used to calculate an Expires > > > header with a field value of negative_revalidating_lifetime seconds in > > the > > > future from the time the cached resource is served. This header is added > > in > > > the cached response. No other behavior is influenced by this > > configuration. > > > The age limit enforced for such stale resources is configured > > > via proxy.config.http.cache.max_stale_age. > > > > > > This likely raises more questions than it answers. In reality, it would > > probably be most helpful to steer users away from this feature and toward > > max_stale_age. > > > > > > Option 2: Fix the Issues > > Alternatively, we can fix the issues surrounding > > negative_revalidating_lifetime. Looking at the code, it seems that the > > Expires header, which is currently incorrectly calculated, really isn't > > needed. Rather we can make the negative_revalidating_lifetime conditionally > > a part of the HttpTransact::is_stale_cache_response_returnable() > > < > > https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L5943 > > > > > function. > > > > In addition to max_stale_age, which is currently taken into account there, > > we can also compare against negative_revalidating_lifetime to see whether > > the resource can be served (but only when negative revalidation is being > > used...the function is used in other contexts). > > > > Advantages: > > > > - This addresses both the known issues with > > negative_revalidating_lifetime in that the Expires header is no longer > > added and the negative_revalidating_lifetime is effectively used in > > determining whether the item is fresh enough for the response. > > - The fix doesn't look all that hard to implement. > > > > Disadvantages: > > > > - Implementing the feature adds various complexities to ATS. It > > complicates the documentation, it complicates the code, and will require > > tests to verify it works correctly. None of this complexity is large if > > the > > feature is worth it, and I'm happy to do it. But I have my doubts that it > > is worth it. Keep in mind that negative_revalidating_lifetime has never > > been a working feature in ATS and this has caused problems for no one. If > > its absence had caused problems, we would know about it because people > > would complain. In fact, the only reported issue with the feature is the > > confusion it caused by existing in the document in the first place. Also, > > keep in mind that this feature would only be useful, if it is at all, to > > *extend* the lifetime of these cached resources past max_stale_age. Both > > of those values would be used and we would want to use the max of each. > > Currently max_stale_age defaults to 7 days. How many users will want > > max_stale_age to be one value for general max-age computations, but use a > > different, larger, value for negative revalidation? I'm guessing not > > many. > > (Alternatively, I suppose we could use only > > negative_revalidating_lifetime > > instead of using both it and max_stale_age in these situations, but that > > would potentially be a breaking change for current configurations in > > which > > negative_revalidation would now only happen for the default 30 minutes of > > negative_revalidating_lifetime rather than the default 7 days of > > max_stale_age.) > > > > > > Option 3: Remove negative_revalidating_lifetime > > We can remove negative_revalidating_lifetime from the product. This would > > involve: > > > > - Removing the documentation of negative_revalidating_lifetime > > < > > https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime > > > > > from our records.config documentation. We would also want to explain the > > role that max_stale_age plays in limiting the usage of stale resources in > > negative revalidating circumstances. > > - Removing the addition of the faulty Expires header from HttpTransact.cc > > < > > https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L4364 > > > > > - Keep the parsing of negative_revalidating_lifetime, if it is used by a > > user, but add a Warning that it is defunct, deprecated, and will be > > removed > > in a future release. > > > > Advantages: > > > > - This removes the confusion associated with having both a > > negative_revalidating_lifetime feature used for these negative > > revalidating > > resources and the max_stale_age feature. > > - It removes the bugs associated with negative_revalidating_lifetime > > because the feature is removed. > > > > Disadvantages: > > > > - If someone would like a negative_revalidating_lifetime feature in the > > future, it will not exist and will need to be implemented. > > > > Suggestion > > I recommend Option 3: Remove negative_revalidating_lifetime. It's safe to > > remove because the feature does not currently work and seemingly has never > > worked; thus nothing is being taken away from ATS. On the other hand, the > > buggy behavior and confusing documentation is alleviated by it being > > removed. The downside to this option is that we won't have the added > > negative_revalidating_lifetime feature in the event that someone wants it > > in the future. But, again, that leaves us no worse off than we are today > > since our current implementation doesn't effectively implement it. > > > > Let me know if you have other thoughts or concerns. > > > > Thanks, > > Brian > > > > -- > > "Come to Me, all who are weary and heavy-laden, and I will > > give you rest. Take My yoke upon you and learn from Me, for > > I am gentle and humble in heart, and you will find rest for > > your souls. For My yoke is easy and My burden is light." > > > > ~ Matthew 11:28-30 > > > > > > > -- > "Come to Me, all who are weary and heavy-laden, and I will > give you rest. Take My yoke upon you and learn from Me, for > I am gentle and humble in heart, and you will find rest for > your souls. For My yoke is easy and My burden is light." > > ~ Matthew 11:28-30