Yes, as a supplement for what Sudheer pointed out, on LI Traffic's roadmap, we're adopting dynamic-discovery <https://linkedin.github.io/rest.li/start/d2_quick_start#what-is-d2-in-a-nutshell> (D2), a client-side load-balancing strategy that can gracefully solve the issue. If some origin servers are unresponsive, because of GC or other issues, we measure the latency and the error rate from them and if it exceeds the specified threshold, D2 shifts load off these bad servers. We can work on opens sourcing D2 solution if it sounds attractive to the community.
Sudheer Vinukonda <sudheervinuko...@yahoo.com.invalid> 于2019年10月1日周二 下午6:39写道: > +1 on the problem :) > > We’ve often had very similar production issues too! > > A related issue here that plays a role in all of this mechanism is server > session pools and the fact that ATS only marks a server down on a “connect > failure”. In particular, when ATS tries to reuse an existing idle > connection to an origin which is actually down (somehow the stale > connection wasn’t cleaned up in ATS, say the FIN was lost - seems to happen > more often than it should unfortunately) and tries to send the request to > the origin and if that request fails (either via a tcp reset or a timeout), > we don’t have a retry at that point from the core nor do we mark that > server down. It feels like something that can (should) be improved upon. > > I haven’t fully thought through on what a better handling would be but it > sounds like something worth exploring as well? > > Also, this topic sounds like another great one to discuss during the > summit. > > - Sudheer > > > On Oct 1, 2019, at 5:48 PM, Alan M. Carroll <a...@network-geographics.com> > wrote: > > > > I would like to propose a change to how TS handles unresponsive upstream > destinations. The current logic is > > > > * If there is a round robin, skip elements that are down. > > * Attempt to connect. > > * If the connect fails, and the upstream is marked down and within the > down server cache time, restrict the number of retries to > down_server.retries. > > > > We have had some productions problems recently which were made worse by > the fact that the non-responding upstream continued to be pounded by > incoming requests, which came very fast and eventually caused TS to fail > due to too many file descriptors open. > > > > I would like the following logic. > > > > * If there is a round robin, skip ahead to find the upstream that is > alive or has been down the longest. > > > > * If the upstream has been marked down less than the down server cache > time, immediately fail (503) the request. > > > > * If the upstream has been marked down more than the down server cache > time, allow the next request to attempt to connect and update the fail > > time to block further attempts. > > > > * If the "zombie" connect succeeds, mark the target as up (clear the > fail time). > > > > The desired effect is no upstream connection attempts are made during > the down window, and after only one attempt at most per window time, until > a connect succeeds, at which point the throttling is removed. > > > > > > The state machine for this looks like > > > > > https://www.planttext.com/?text=RLBBJiCm4BpdAuPUEA2W2iGbuXK5a2f1GIeHZqjbqarhDRMZsA62hsTjfqr0NEBT6UzuTZU7c3tTlCRtiY1bA9uiI5bPIceIfBKeLXIV78-ZmoYwDbOqjjrKKGHmR0apX-opqPsG5TB2oa-w6a4OE81fToGe-JktEBZ0k2-PAC_YHQg5teQg4FET4EmhZx3r0CvHj4t3FkVzuzdZT7gRFU_pTxtvaCHf218yY6RjrBztHESWhIgM3BUGQQMeiDx6FOarNtD7WjmGXDKRMI1BPx0QRRrmzPqt1eQUGj7pYXegTKFGglhY3tMlOqdMmAuQtIvNnxsNi_4fSYK-MLDyLEHjCDPVy4tSwAtob0kOSqU2IGcbRTzUG3jx5_zCB8ZbjPVAQyRpUb0mTcls2qPyETmaOiwxSIF1L0_ni3A6tyX4-nLwoZKNy5-m6rMGwURcqMuVYP-sQOk2Z1doGSoH6-cPhc3WpUdjR3hO3GIXmJzCNZE-UoIw9hLmV_aF > > > > This is a sequence diagram showing the basic logic > > > > > https://www.planttext.com/?text=ZLFBReCm4BplLwp8EGIrweL35DH-GGcNjZauP4aj6KoDITI_xu8XF4eVnzsTcPqTsRlA2cFhN068r6g3NbAHmUXCXVLGL9X4aEh26gtGsnkHPJB5cCo5J3lUny18QJ-PM5RMaCWTBMu5vKLOuW9NPqDK02GHhj-OXI4-qqTDbGKkk9MCMroiDmvr5mGjOiDEkX9ED9B-hwRl-r1efcZspkWGhSn9rmwyTVoOB6P2gdO6S3QwBl59Nmj07Bbu2Ew1EmkBly6eEFKwBpe_IkfuhwkZgkcmI5yqqmXD_3YLiZQbSZfpcTkmiUFSz8DkTpBTFOu7Xtn-CJNH6zOqNX6RnEUNj-ZhkxHqs78LSc_XcVmTqKyd96CxjmSqN13Yeo9XPhdEZkZVsOvd_3M-0000 > > > > This is a sequence diagram for a degenerate but still plausible case. > > > > > https://www.planttext.com/?text=TP7FJeGm4CRlVOe973SYY_7YORCRVGA1Yxg70WFRM5fi1z6tTmgwYoOUc-_t_Sbqxqaw6dijAli1KGInLjs9AZsYa3LP1r7fqS6XGqCHI0_bOIlGDC3ysJCey_eldWKyqBKrvAo6g72oRLKDERftT3DMv4oHeayE63mvbFrYpuNed1q7UB9zfL1mFLmzns7WyOLjS0UF-3QYfr3ppJxOoGZMBc2v1fCaQQNIC2dJs0b8zGY3z1uzAwkOdyevQ3efmkCk96qs47SEqV2QB6WEcxzZblP5-5LUdeNhnP6bwKplbkUzHkbWtXU6pNEAbITKBYPZ2S7o1OFnyc-i1gCTQNs2ODooT9lUY3rdXY__0W00 > > > > This can happen when the connect timeout (even for down servers) is > longer than the fail window. In such a case, a new request can come in and > be allowed to proceed while a previous attempt is still in play. If the > upstream recovers between these two, it can be the case the second returns > a success before the final timeout on the first. It is for this reason the > connect attempt needs to track if it is a zombie case or not - for a > zombie, it must not set the fail time otherwise you can get a false > throttle for the upstream. > > > >