Larger scale fix for down server marking

Alan Carroll Fri, 02 Oct 2020 08:12:09 -0700

I would like to propose a change to how TS handles unresponsive upstream
destinations. The current logic is


   - If there is a round robin, skip elements that are down.
   - Attempt to connect.
   - If the connect fails, and the upstream is marked down and within the
   down server cache time, restrict the number of retries to
   down_server.retries.

We have had some production problems recently which were made worse by the
fact that the non-responding upstream continued to be pounded by incoming
requests, which came very fast and eventually caused TS to fail due to too
many file descriptors open.

I would like the following logic.

   - If there is a round robin, skip ahead to find the upstream that is
   alive or has been down the longest.


   - If the upstream has been marked down less than the down server cache
   time, immediately fail (503) the request.


   - If the upstream has been marked down more than the down server cache
   time, allow the next request to attempt to connect and update the fail
   time to block further attempts.


   - If the "zombie" connect succeeds, mark the target as up (clear the
   fail time).

The desired effect is no upstream connection attempts are made during the
down window, and after only one attempt at most per window time, until a
connect succeeds, at which point the throttling is removed.

The state machine for this looks like

https://www.planttext.com/?text=RLBBJiCm4BpdAuPUEA2W2iGbuXK5a2f1GIeHZqjbqarhDRMZsA62hsTjfqr0NEBT6UzuTZU7c3tTlCRtiY1bA9uiI5bPIceIfBKeLXIV78-ZmoYwDbOqjjrKKGHmR0apX-opqPsG5TB2oa-w6a4OE81fToGe-JktEBZ0k2-PAC_YHQg5teQg4FET4EmhZx3r0CvHj4t3FkVzuzdZT7gRFU_pTxtvaCHf218yY6RjrBztHESWhIgM3BUGQQMeiDx6FOarNtD7WjmGXDKRMI1BPx0QRRrmzPqt1eQUGj7pYXegTKFGglhY3tMlOqdMmAuQtIvNnxsNi_4fSYK-MLDyLEHjCDPVy4tSwAtob0kOSqU2IGcbRTzUG3jx5_zCB8ZbjPVAQyRpUb0mTcls2qPyETmaOiwxSIF1L0_ni3A6tyX4-nLwoZKNy5-m6rMGwURcqMuVYP-sQOk2Z1doGSoH6-cPhc3WpUdjR3hO3GIXmJzCNZE-UoIw9hLmV_aF

This is a sequence diagram showing the basic logic

https://www.planttext.com/?text=ZLFBReCm4BplLwp8EGIrweL35DH-GGcNjZauP4aj6KoDITI_xu8XF4eVnzsTcPqTsRlA2cFhN068r6g3NbAHmUXCXVLGL9X4aEh26gtGsnkHPJB5cCo5J3lUny18QJ-PM5RMaCWTBMu5vKLOuW9NPqDK02GHhj-OXI4-qqTDbGKkk9MCMroiDmvr5mGjOiDEkX9ED9B-hwRl-r1efcZspkWGhSn9rmwyTVoOB6P2gdO6S3QwBl59Nmj07Bbu2Ew1EmkBly6eEFKwBpe_IkfuhwkZgkcmI5yqqmXD_3YLiZQbSZfpcTkmiUFSz8DkTpBTFOu7Xtn-CJNH6zOqNX6RnEUNj-ZhkxHqs78LSc_XcVmTqKyd96CxjmSqN13Yeo9XPhdEZkZVsOvd_3M-0000

This is a sequence diagram for a degenerate but still plausible case.

https://www.planttext.com/?text=TP7FJeGm4CRlVOe973SYY_7YORCRVGA1Yxg70WFRM5fi1z6tTmgwYoOUc-_t_Sbqxqaw6dijAli1KGInLjs9AZsYa3LP1r7fqS6XGqCHI0_bOIlGDC3ysJCey_eldWKyqBKrvAo6g72oRLKDERftT3DMv4oHeayE63mvbFrYpuNed1q7UB9zfL1mFLmzns7WyOLjS0UF-3QYfr3ppJxOoGZMBc2v1fCaQQNIC2dJs0b8zGY3z1uzAwkOdyevQ3efmkCk96qs47SEqV2QB6WEcxzZblP5-5LUdeNhnP6bwKplbkUzHkbWtXU6pNEAbITKBYPZ2S7o1OFnyc-i1gCTQNs2ODooT9lUY3rdXY__0W00

This can happen when the connect timeout (even for down servers) is longer
than the fail window. In such a case, a new request can come in and be
allowed to proceed while a previous attempt is still in play. If the
upstream recovers between these two, it can be the case the second returns
a success before the final timeout on the first. It is for this reason the
connect attempt needs to track if it is a zombie case or not - for a
zombie, it must not set the fail time otherwise you can get a false
throttle for the upstream.
<https://jira.vzbuilders.com/browse/YTSATS-2853#>
Permalink
<https://jira.vzbuilders.com/browse/YTSATS-2853?focusedCommentId=9307564&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-9307564>
 Edit
<https://jira.vzbuilders.com/secure/EditComment!default.jspa?id=3956592&commentId=9307564>
 Delete
<https://jira.vzbuilders.com/secure/DeleteComment!default.jspa?id=3956592&commentId=9307564>
[image: snow_jira]
<https://jira.vzbuilders.com/secure/ViewProfile.jspa?name=snow_jira>

Larger scale fix for down server marking

Reply via email to