Re: Larger scale fix for down server marking

Sudheer Vinukonda Fri, 02 Oct 2020 09:06:33 -0700

Sounds reasonable to me!

A related aspect for completeness would be to also drop all the existing keep 
alive sessions (including the “floor”) to a Server marked down, since session 
reuse doesn’t go through the connect logic, and we still have the same problem 
of client socket build up when those are picked depending on “how” down the 
Server is - we’ve had servers in a state where they were down but still don’t 
return an error right away and take a very long time to detect.


BTW, somehow it feels like we already had this discussion over the email :)? Is 
there any new change since or am I imagining the email discussion?

 

> On Oct 2, 2020, at 8:11 AM, Alan Carroll 
> <[email protected]> wrote:
> 
> I would like to propose a change to how TS handles unresponsive upstream
> destinations. The current logic is
> 
>   - If there is a round robin, skip elements that are down.
>   - Attempt to connect.
>   - If the connect fails, and the upstream is marked down and within the
>   down server cache time, restrict the number of retries to
>   down_server.retries.
> 
> We have had some production problems recently which were made worse by the
> fact that the non-responding upstream continued to be pounded by incoming
> requests, which came very fast and eventually caused TS to fail due to too
> many file descriptors open.
> 
> I would like the following logic.
> 
>   - If there is a round robin, skip ahead to find the upstream that is
>   alive or has been down the longest.
> 
> 
>   - If the upstream has been marked down less than the down server cache
>   time, immediately fail (503) the request.
> 
> 
>   - If the upstream has been marked down more than the down server cache
>   time, allow the next request to attempt to connect and update the fail
>   time to block further attempts.
> 
> 
>   - If the "zombie" connect succeeds, mark the target as up (clear the
>   fail time).
> 
> The desired effect is no upstream connection attempts are made during the
> down window, and after only one attempt at most per window time, until a
> connect succeeds, at which point the throttling is removed.
> 
> The state machine for this looks like
> 
> https://www.planttext.com/?text=RLBBJiCm4BpdAuPUEA2W2iGbuXK5a2f1GIeHZqjbqarhDRMZsA62hsTjfqr0NEBT6UzuTZU7c3tTlCRtiY1bA9uiI5bPIceIfBKeLXIV78-ZmoYwDbOqjjrKKGHmR0apX-opqPsG5TB2oa-w6a4OE81fToGe-JktEBZ0k2-PAC_YHQg5teQg4FET4EmhZx3r0CvHj4t3FkVzuzdZT7gRFU_pTxtvaCHf218yY6RjrBztHESWhIgM3BUGQQMeiDx6FOarNtD7WjmGXDKRMI1BPx0QRRrmzPqt1eQUGj7pYXegTKFGglhY3tMlOqdMmAuQtIvNnxsNi_4fSYK-MLDyLEHjCDPVy4tSwAtob0kOSqU2IGcbRTzUG3jx5_zCB8ZbjPVAQyRpUb0mTcls2qPyETmaOiwxSIF1L0_ni3A6tyX4-nLwoZKNy5-m6rMGwURcqMuVYP-sQOk2Z1doGSoH6-cPhc3WpUdjR3hO3GIXmJzCNZE-UoIw9hLmV_aF
> 
> This is a sequence diagram showing the basic logic
> 
> https://www.planttext.com/?text=ZLFBReCm4BplLwp8EGIrweL35DH-GGcNjZauP4aj6KoDITI_xu8XF4eVnzsTcPqTsRlA2cFhN068r6g3NbAHmUXCXVLGL9X4aEh26gtGsnkHPJB5cCo5J3lUny18QJ-PM5RMaCWTBMu5vKLOuW9NPqDK02GHhj-OXI4-qqTDbGKkk9MCMroiDmvr5mGjOiDEkX9ED9B-hwRl-r1efcZspkWGhSn9rmwyTVoOB6P2gdO6S3QwBl59Nmj07Bbu2Ew1EmkBly6eEFKwBpe_IkfuhwkZgkcmI5yqqmXD_3YLiZQbSZfpcTkmiUFSz8DkTpBTFOu7Xtn-CJNH6zOqNX6RnEUNj-ZhkxHqs78LSc_XcVmTqKyd96CxjmSqN13Yeo9XPhdEZkZVsOvd_3M-0000
> 
> This is a sequence diagram for a degenerate but still plausible case.
> 
> https://www.planttext.com/?text=TP7FJeGm4CRlVOe973SYY_7YORCRVGA1Yxg70WFRM5fi1z6tTmgwYoOUc-_t_Sbqxqaw6dijAli1KGInLjs9AZsYa3LP1r7fqS6XGqCHI0_bOIlGDC3ysJCey_eldWKyqBKrvAo6g72oRLKDERftT3DMv4oHeayE63mvbFrYpuNed1q7UB9zfL1mFLmzns7WyOLjS0UF-3QYfr3ppJxOoGZMBc2v1fCaQQNIC2dJs0b8zGY3z1uzAwkOdyevQ3efmkCk96qs47SEqV2QB6WEcxzZblP5-5LUdeNhnP6bwKplbkUzHkbWtXU6pNEAbITKBYPZ2S7o1OFnyc-i1gCTQNs2ODooT9lUY3rdXY__0W00
> 
> This can happen when the connect timeout (even for down servers) is longer
> than the fail window. In such a case, a new request can come in and be
> allowed to proceed while a previous attempt is still in play. If the
> upstream recovers between these two, it can be the case the second returns
> a success before the final timeout on the first. It is for this reason the
> connect attempt needs to track if it is a zombie case or not - for a
> zombie, it must not set the fail time otherwise you can get a false
> throttle for the upstream.
> <https://jira.vzbuilders.com/browse/YTSATS-2853#>
> Permalink
> <https://jira.vzbuilders.com/browse/YTSATS-2853?focusedCommentId=9307564&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-9307564>
> Edit
> <https://jira.vzbuilders.com/secure/EditComment!default.jspa?id=3956592&commentId=9307564>
> Delete
> <https://jira.vzbuilders.com/secure/DeleteComment!default.jspa?id=3956592&commentId=9307564>
> [image: snow_jira]
> <https://jira.vzbuilders.com/secure/ViewProfile.jspa?name=snow_jira>

Re: Larger scale fix for down server marking

Reply via email to