I would like to propose a change to how TS handles unresponsive upstream destinations. The current logic is
- If there is a round robin, skip elements that are down. - Attempt to connect. - If the connect fails, and the upstream is marked down and within the down server cache time, restrict the number of retries to down_server.retries. We have had some production problems recently which were made worse by the fact that the non-responding upstream continued to be pounded by incoming requests, which came very fast and eventually caused TS to fail due to too many file descriptors open. I would like the following logic. - If there is a round robin, skip ahead to find the upstream that is alive or has been down the longest. - If the upstream has been marked down less than the down server cache time, immediately fail (503) the request. - If the upstream has been marked down more than the down server cache time, allow the next request to attempt to connect and update the fail time to block further attempts. - If the "zombie" connect succeeds, mark the target as up (clear the fail time). The desired effect is no upstream connection attempts are made during the down window, and after only one attempt at most per window time, until a connect succeeds, at which point the throttling is removed. The state machine for this looks like https://www.planttext.com/?text=RLBBJiCm4BpdAuPUEA2W2iGbuXK5a2f1GIeHZqjbqarhDRMZsA62hsTjfqr0NEBT6UzuTZU7c3tTlCRtiY1bA9uiI5bPIceIfBKeLXIV78-ZmoYwDbOqjjrKKGHmR0apX-opqPsG5TB2oa-w6a4OE81fToGe-JktEBZ0k2-PAC_YHQg5teQg4FET4EmhZx3r0CvHj4t3FkVzuzdZT7gRFU_pTxtvaCHf218yY6RjrBztHESWhIgM3BUGQQMeiDx6FOarNtD7WjmGXDKRMI1BPx0QRRrmzPqt1eQUGj7pYXegTKFGglhY3tMlOqdMmAuQtIvNnxsNi_4fSYK-MLDyLEHjCDPVy4tSwAtob0kOSqU2IGcbRTzUG3jx5_zCB8ZbjPVAQyRpUb0mTcls2qPyETmaOiwxSIF1L0_ni3A6tyX4-nLwoZKNy5-m6rMGwURcqMuVYP-sQOk2Z1doGSoH6-cPhc3WpUdjR3hO3GIXmJzCNZE-UoIw9hLmV_aF This is a sequence diagram showing the basic logic https://www.planttext.com/?text=ZLFBReCm4BplLwp8EGIrweL35DH-GGcNjZauP4aj6KoDITI_xu8XF4eVnzsTcPqTsRlA2cFhN068r6g3NbAHmUXCXVLGL9X4aEh26gtGsnkHPJB5cCo5J3lUny18QJ-PM5RMaCWTBMu5vKLOuW9NPqDK02GHhj-OXI4-qqTDbGKkk9MCMroiDmvr5mGjOiDEkX9ED9B-hwRl-r1efcZspkWGhSn9rmwyTVoOB6P2gdO6S3QwBl59Nmj07Bbu2Ew1EmkBly6eEFKwBpe_IkfuhwkZgkcmI5yqqmXD_3YLiZQbSZfpcTkmiUFSz8DkTpBTFOu7Xtn-CJNH6zOqNX6RnEUNj-ZhkxHqs78LSc_XcVmTqKyd96CxjmSqN13Yeo9XPhdEZkZVsOvd_3M-0000 This is a sequence diagram for a degenerate but still plausible case. https://www.planttext.com/?text=TP7FJeGm4CRlVOe973SYY_7YORCRVGA1Yxg70WFRM5fi1z6tTmgwYoOUc-_t_Sbqxqaw6dijAli1KGInLjs9AZsYa3LP1r7fqS6XGqCHI0_bOIlGDC3ysJCey_eldWKyqBKrvAo6g72oRLKDERftT3DMv4oHeayE63mvbFrYpuNed1q7UB9zfL1mFLmzns7WyOLjS0UF-3QYfr3ppJxOoGZMBc2v1fCaQQNIC2dJs0b8zGY3z1uzAwkOdyevQ3efmkCk96qs47SEqV2QB6WEcxzZblP5-5LUdeNhnP6bwKplbkUzHkbWtXU6pNEAbITKBYPZ2S7o1OFnyc-i1gCTQNs2ODooT9lUY3rdXY__0W00 This can happen when the connect timeout (even for down servers) is longer than the fail window. In such a case, a new request can come in and be allowed to proceed while a previous attempt is still in play. If the upstream recovers between these two, it can be the case the second returns a success before the final timeout on the first. It is for this reason the connect attempt needs to track if it is a zombie case or not - for a zombie, it must not set the fail time otherwise you can get a false throttle for the upstream. <https://jira.vzbuilders.com/browse/YTSATS-2853#> Permalink <https://jira.vzbuilders.com/browse/YTSATS-2853?focusedCommentId=9307564&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-9307564> Edit <https://jira.vzbuilders.com/secure/EditComment!default.jspa?id=3956592&commentId=9307564> Delete <https://jira.vzbuilders.com/secure/DeleteComment!default.jspa?id=3956592&commentId=9307564> [image: snow_jira] <https://jira.vzbuilders.com/secure/ViewProfile.jspa?name=snow_jira>