Miles mentioned at the summit that this should check "serve stale while error" before sending 503.
On Wed, Oct 2, 2019 at 1:47 PM Alan Carroll <solidwallofc...@verizonmedia.com.invalid> wrote: > You need to talk to the L7R group at the summit. What does it do if no > upstream is available? That's the actual issue my change would address. I > do think we should either make this change, or simply remove support for > handling dead upstrreams - the current implementation is worse than useless > IMHO. > > On Wed, Oct 2, 2019 at 3:41 PM zzz <z...@apache.org> wrote: > > > Yes, as a supplement for what Sudheer pointed out, on LI Traffic's > roadmap, > > we're adopting dynamic-discovery > > < > > > https://linkedin.github.io/rest.li/start/d2_quick_start#what-is-d2-in-a-nutshell > > > > > (D2), > > a client-side load-balancing strategy that can gracefully solve the > issue. > > If some origin servers are unresponsive, because of GC or other issues, > we > > measure the latency and the error rate from them and if it exceeds the > > specified threshold, D2 shifts load off these bad servers. We can work on > > opens sourcing D2 solution if it sounds attractive to the community. > > > > Sudheer Vinukonda <sudheervinuko...@yahoo.com.invalid> 于2019年10月1日周二 > > 下午6:39写道: > > > > > +1 on the problem :) > > > > > > We’ve often had very similar production issues too! > > > > > > A related issue here that plays a role in all of this mechanism is > server > > > session pools and the fact that ATS only marks a server down on a > > “connect > > > failure”. In particular, when ATS tries to reuse an existing idle > > > connection to an origin which is actually down (somehow the stale > > > connection wasn’t cleaned up in ATS, say the FIN was lost - seems to > > happen > > > more often than it should unfortunately) and tries to send the request > to > > > the origin and if that request fails (either via a tcp reset or a > > timeout), > > > we don’t have a retry at that point from the core nor do we mark that > > > server down. It feels like something that can (should) be improved > upon. > > > > > > I haven’t fully thought through on what a better handling would be but > it > > > sounds like something worth exploring as well? > > > > > > Also, this topic sounds like another great one to discuss during the > > > summit. > > > > > > - Sudheer > > > > > > > On Oct 1, 2019, at 5:48 PM, Alan M. Carroll < > > a...@network-geographics.com> > > > wrote: > > > > > > > > I would like to propose a change to how TS handles unresponsive > > upstream > > > destinations. The current logic is > > > > > > > > * If there is a round robin, skip elements that are down. > > > > * Attempt to connect. > > > > * If the connect fails, and the upstream is marked down and within > the > > > down server cache time, restrict the number of retries to > > > down_server.retries. > > > > > > > > We have had some productions problems recently which were made worse > by > > > the fact that the non-responding upstream continued to be pounded by > > > incoming requests, which came very fast and eventually caused TS to > fail > > > due to too many file descriptors open. > > > > > > > > I would like the following logic. > > > > > > > > * If there is a round robin, skip ahead to find the upstream that is > > > alive or has been down the longest. > > > > > > > > * If the upstream has been marked down less than the down server > cache > > > time, immediately fail (503) the request. > > > > > > > > * If the upstream has been marked down more than the down server > cache > > > time, allow the next request to attempt to connect and update the fail > > > > time to block further attempts. > > > > > > > > * If the "zombie" connect succeeds, mark the target as up (clear the > > > fail time). > > > > > > > > The desired effect is no upstream connection attempts are made during > > > the down window, and after only one attempt at most per window time, > > until > > > a connect succeeds, at which point the throttling is removed. > > > > > > > > > > > > The state machine for this looks like > > > > > > > > > > > > > > https://www.planttext.com/?text=RLBBJiCm4BpdAuPUEA2W2iGbuXK5a2f1GIeHZqjbqarhDRMZsA62hsTjfqr0NEBT6UzuTZU7c3tTlCRtiY1bA9uiI5bPIceIfBKeLXIV78-ZmoYwDbOqjjrKKGHmR0apX-opqPsG5TB2oa-w6a4OE81fToGe-JktEBZ0k2-PAC_YHQg5teQg4FET4EmhZx3r0CvHj4t3FkVzuzdZT7gRFU_pTxtvaCHf218yY6RjrBztHESWhIgM3BUGQQMeiDx6FOarNtD7WjmGXDKRMI1BPx0QRRrmzPqt1eQUGj7pYXegTKFGglhY3tMlOqdMmAuQtIvNnxsNi_4fSYK-MLDyLEHjCDPVy4tSwAtob0kOSqU2IGcbRTzUG3jx5_zCB8ZbjPVAQyRpUb0mTcls2qPyETmaOiwxSIF1L0_ni3A6tyX4-nLwoZKNy5-m6rMGwURcqMuVYP-sQOk2Z1doGSoH6-cPhc3WpUdjR3hO3GIXmJzCNZE-UoIw9hLmV_aF > > > > > > > > This is a sequence diagram showing the basic logic > > > > > > > > > > > > > > https://www.planttext.com/?text=ZLFBReCm4BplLwp8EGIrweL35DH-GGcNjZauP4aj6KoDITI_xu8XF4eVnzsTcPqTsRlA2cFhN068r6g3NbAHmUXCXVLGL9X4aEh26gtGsnkHPJB5cCo5J3lUny18QJ-PM5RMaCWTBMu5vKLOuW9NPqDK02GHhj-OXI4-qqTDbGKkk9MCMroiDmvr5mGjOiDEkX9ED9B-hwRl-r1efcZspkWGhSn9rmwyTVoOB6P2gdO6S3QwBl59Nmj07Bbu2Ew1EmkBly6eEFKwBpe_IkfuhwkZgkcmI5yqqmXD_3YLiZQbSZfpcTkmiUFSz8DkTpBTFOu7Xtn-CJNH6zOqNX6RnEUNj-ZhkxHqs78LSc_XcVmTqKyd96CxjmSqN13Yeo9XPhdEZkZVsOvd_3M-0000 > > > > > > > > This is a sequence diagram for a degenerate but still plausible case. > > > > > > > > > > > > > > https://www.planttext.com/?text=TP7FJeGm4CRlVOe973SYY_7YORCRVGA1Yxg70WFRM5fi1z6tTmgwYoOUc-_t_Sbqxqaw6dijAli1KGInLjs9AZsYa3LP1r7fqS6XGqCHI0_bOIlGDC3ysJCey_eldWKyqBKrvAo6g72oRLKDERftT3DMv4oHeayE63mvbFrYpuNed1q7UB9zfL1mFLmzns7WyOLjS0UF-3QYfr3ppJxOoGZMBc2v1fCaQQNIC2dJs0b8zGY3z1uzAwkOdyevQ3efmkCk96qs47SEqV2QB6WEcxzZblP5-5LUdeNhnP6bwKplbkUzHkbWtXU6pNEAbITKBYPZ2S7o1OFnyc-i1gCTQNs2ODooT9lUY3rdXY__0W00 > > > > > > > > This can happen when the connect timeout (even for down servers) is > > > longer than the fail window. In such a case, a new request can come in > > and > > > be allowed to proceed while a previous attempt is still in play. If the > > > upstream recovers between these two, it can be the case the second > > returns > > > a success before the final timeout on the first. It is for this reason > > the > > > connect attempt needs to track if it is a zombie case or not - for a > > > zombie, it must not set the fail time otherwise you can get a false > > > throttle for the upstream. > > > > > > > > > > > > >