+1 on the problem :) We’ve often had very similar production issues too!
A related issue here that plays a role in all of this mechanism is server session pools and the fact that ATS only marks a server down on a “connect failure”. In particular, when ATS tries to reuse an existing idle connection to an origin which is actually down (somehow the stale connection wasn’t cleaned up in ATS, say the FIN was lost - seems to happen more often than it should unfortunately) and tries to send the request to the origin and if that request fails (either via a tcp reset or a timeout), we don’t have a retry at that point from the core nor do we mark that server down. It feels like something that can (should) be improved upon. I haven’t fully thought through on what a better handling would be but it sounds like something worth exploring as well? Also, this topic sounds like another great one to discuss during the summit. - Sudheer > On Oct 1, 2019, at 5:48 PM, Alan M. Carroll <a...@network-geographics.com> > wrote: > > I would like to propose a change to how TS handles unresponsive upstream > destinations. The current logic is > > * If there is a round robin, skip elements that are down. > * Attempt to connect. > * If the connect fails, and the upstream is marked down and within the down > server cache time, restrict the number of retries to down_server.retries. > > We have had some productions problems recently which were made worse by the > fact that the non-responding upstream continued to be pounded by incoming > requests, which came very fast and eventually caused TS to fail due to too > many file descriptors open. > > I would like the following logic. > > * If there is a round robin, skip ahead to find the upstream that is alive or > has been down the longest. > > * If the upstream has been marked down less than the down server cache time, > immediately fail (503) the request. > > * If the upstream has been marked down more than the down server cache time, > allow the next request to attempt to connect and update the fail > time to block further attempts. > > * If the "zombie" connect succeeds, mark the target as up (clear the fail > time). > > The desired effect is no upstream connection attempts are made during the > down window, and after only one attempt at most per window time, until a > connect succeeds, at which point the throttling is removed. > > > The state machine for this looks like > > https://www.planttext.com/?text=RLBBJiCm4BpdAuPUEA2W2iGbuXK5a2f1GIeHZqjbqarhDRMZsA62hsTjfqr0NEBT6UzuTZU7c3tTlCRtiY1bA9uiI5bPIceIfBKeLXIV78-ZmoYwDbOqjjrKKGHmR0apX-opqPsG5TB2oa-w6a4OE81fToGe-JktEBZ0k2-PAC_YHQg5teQg4FET4EmhZx3r0CvHj4t3FkVzuzdZT7gRFU_pTxtvaCHf218yY6RjrBztHESWhIgM3BUGQQMeiDx6FOarNtD7WjmGXDKRMI1BPx0QRRrmzPqt1eQUGj7pYXegTKFGglhY3tMlOqdMmAuQtIvNnxsNi_4fSYK-MLDyLEHjCDPVy4tSwAtob0kOSqU2IGcbRTzUG3jx5_zCB8ZbjPVAQyRpUb0mTcls2qPyETmaOiwxSIF1L0_ni3A6tyX4-nLwoZKNy5-m6rMGwURcqMuVYP-sQOk2Z1doGSoH6-cPhc3WpUdjR3hO3GIXmJzCNZE-UoIw9hLmV_aF > > This is a sequence diagram showing the basic logic > > https://www.planttext.com/?text=ZLFBReCm4BplLwp8EGIrweL35DH-GGcNjZauP4aj6KoDITI_xu8XF4eVnzsTcPqTsRlA2cFhN068r6g3NbAHmUXCXVLGL9X4aEh26gtGsnkHPJB5cCo5J3lUny18QJ-PM5RMaCWTBMu5vKLOuW9NPqDK02GHhj-OXI4-qqTDbGKkk9MCMroiDmvr5mGjOiDEkX9ED9B-hwRl-r1efcZspkWGhSn9rmwyTVoOB6P2gdO6S3QwBl59Nmj07Bbu2Ew1EmkBly6eEFKwBpe_IkfuhwkZgkcmI5yqqmXD_3YLiZQbSZfpcTkmiUFSz8DkTpBTFOu7Xtn-CJNH6zOqNX6RnEUNj-ZhkxHqs78LSc_XcVmTqKyd96CxjmSqN13Yeo9XPhdEZkZVsOvd_3M-0000 > > This is a sequence diagram for a degenerate but still plausible case. > > https://www.planttext.com/?text=TP7FJeGm4CRlVOe973SYY_7YORCRVGA1Yxg70WFRM5fi1z6tTmgwYoOUc-_t_Sbqxqaw6dijAli1KGInLjs9AZsYa3LP1r7fqS6XGqCHI0_bOIlGDC3ysJCey_eldWKyqBKrvAo6g72oRLKDERftT3DMv4oHeayE63mvbFrYpuNed1q7UB9zfL1mFLmzns7WyOLjS0UF-3QYfr3ppJxOoGZMBc2v1fCaQQNIC2dJs0b8zGY3z1uzAwkOdyevQ3efmkCk96qs47SEqV2QB6WEcxzZblP5-5LUdeNhnP6bwKplbkUzHkbWtXU6pNEAbITKBYPZ2S7o1OFnyc-i1gCTQNs2ODooT9lUY3rdXY__0W00 > > This can happen when the connect timeout (even for down servers) is longer > than the fail window. In such a case, a new request can come in and be > allowed to proceed while a previous attempt is still in play. If the upstream > recovers between these two, it can be the case the second returns a success > before the final timeout on the first. It is for this reason the connect > attempt needs to track if it is a zombie case or not - for a zombie, it must > not set the fail time otherwise you can get a false throttle for the upstream. >