Re: Handling requests to an upstream that is unresponsive.

Aaron Canary Wed, 09 Oct 2019 09:00:59 -0700

Miles mentioned at the summit that this should check "serve stale while
error" before sending 503.


On Wed, Oct 2, 2019 at 1:47 PM Alan Carroll
<solidwallofc...@verizonmedia.com.invalid> wrote:

> You need to talk to the L7R group at the summit. What does it do if no
> upstream is available? That's the actual issue my change would address. I
> do think we should either make this change, or simply remove support for
> handling dead upstrreams - the current implementation is worse than useless
> IMHO.
>
> On Wed, Oct 2, 2019 at 3:41 PM zzz <z...@apache.org> wrote:
>
> > Yes, as a supplement for what Sudheer pointed out, on LI Traffic's
> roadmap,
> > we're adopting dynamic-discovery
> > <
> >
> https://linkedin.github.io/rest.li/start/d2_quick_start#what-is-d2-in-a-nutshell
> > >
> > (D2),
> > a client-side load-balancing strategy that can gracefully solve the
> issue.
> > If some origin servers are unresponsive, because of GC or other issues,
> we
> > measure the latency and the error rate from them and if it exceeds the
> > specified threshold, D2 shifts load off these bad servers. We can work on
> > opens sourcing D2 solution if it sounds attractive to the community.
> >
> > Sudheer Vinukonda <sudheervinuko...@yahoo.com.invalid> 于2019年10月1日周二
> > 下午6:39写道：
> >
> > > +1 on the problem :)
> > >
> > > We’ve often had very similar production issues too!
> > >
> > > A related issue here that plays a role in all of this mechanism is
> server
> > > session pools and the fact that ATS only marks a server down on a
> > “connect
> > > failure”. In particular, when ATS tries to reuse an existing idle
> > > connection to an origin which is actually down (somehow the stale
> > > connection wasn’t cleaned up in ATS, say the FIN was lost - seems to
> > happen
> > > more often than it should unfortunately) and tries to send the request
> to
> > > the origin and if that request fails (either via a tcp reset or a
> > timeout),
> > > we don’t have a retry at that point from the core nor do we mark that
> > > server down. It feels like something that can (should) be improved
> upon.
> > >
> > > I haven’t fully thought through on what a better handling would be but
> it
> > > sounds like something worth exploring as well?
> > >
> > > Also, this topic sounds like another great one to discuss during the
> > > summit.
> > >
> > > - Sudheer
> > >
> > > > On Oct 1, 2019, at 5:48 PM, Alan M. Carroll <
> > a...@network-geographics.com>
> > > wrote:
> > > >
> > > > I would like to propose a change to how TS handles unresponsive
> > upstream
> > > destinations. The current logic is
> > > >
> > > > * If there is a round robin, skip elements that are down.
> > > > * Attempt to connect.
> > > > * If the connect fails, and the upstream is marked down and within
> the
> > > down server cache time, restrict the number of retries to
> > > down_server.retries.
> > > >
> > > > We have had some productions problems recently which were made worse
> by
> > > the fact that the non-responding upstream continued to be pounded by
> > > incoming requests, which came very fast and eventually caused TS to
> fail
> > > due to too many file descriptors open.
> > > >
> > > > I would like the following logic.
> > > >
> > > > * If there is a round robin, skip ahead to find the upstream that is
> > > alive or has been down the longest.
> > > >
> > > > * If the upstream has been marked down less than the down server
> cache
> > > time, immediately fail (503) the request.
> > > >
> > > > * If the upstream has been marked down more than the down server
> cache
> > > time, allow the next request to attempt to connect and update the fail
> > > >  time to block further attempts.
> > > >
> > > > * If the "zombie" connect succeeds, mark the target as up (clear the
> > > fail time).
> > > >
> > > > The desired effect is no upstream connection attempts are made during
> > > the down window, and after only one attempt at most per window time,
> > until
> > > a connect succeeds, at which point the throttling is removed.
> > > >
> > > >
> > > > The state machine for this looks like
> > > >
> > > >
> > >
> >
> https://www.planttext.com/?text=RLBBJiCm4BpdAuPUEA2W2iGbuXK5a2f1GIeHZqjbqarhDRMZsA62hsTjfqr0NEBT6UzuTZU7c3tTlCRtiY1bA9uiI5bPIceIfBKeLXIV78-ZmoYwDbOqjjrKKGHmR0apX-opqPsG5TB2oa-w6a4OE81fToGe-JktEBZ0k2-PAC_YHQg5teQg4FET4EmhZx3r0CvHj4t3FkVzuzdZT7gRFU_pTxtvaCHf218yY6RjrBztHESWhIgM3BUGQQMeiDx6FOarNtD7WjmGXDKRMI1BPx0QRRrmzPqt1eQUGj7pYXegTKFGglhY3tMlOqdMmAuQtIvNnxsNi_4fSYK-MLDyLEHjCDPVy4tSwAtob0kOSqU2IGcbRTzUG3jx5_zCB8ZbjPVAQyRpUb0mTcls2qPyETmaOiwxSIF1L0_ni3A6tyX4-nLwoZKNy5-m6rMGwURcqMuVYP-sQOk2Z1doGSoH6-cPhc3WpUdjR3hO3GIXmJzCNZE-UoIw9hLmV_aF
> > > >
> > > > This is a sequence diagram showing the basic logic
> > > >
> > > >
> > >
> >
> https://www.planttext.com/?text=ZLFBReCm4BplLwp8EGIrweL35DH-GGcNjZauP4aj6KoDITI_xu8XF4eVnzsTcPqTsRlA2cFhN068r6g3NbAHmUXCXVLGL9X4aEh26gtGsnkHPJB5cCo5J3lUny18QJ-PM5RMaCWTBMu5vKLOuW9NPqDK02GHhj-OXI4-qqTDbGKkk9MCMroiDmvr5mGjOiDEkX9ED9B-hwRl-r1efcZspkWGhSn9rmwyTVoOB6P2gdO6S3QwBl59Nmj07Bbu2Ew1EmkBly6eEFKwBpe_IkfuhwkZgkcmI5yqqmXD_3YLiZQbSZfpcTkmiUFSz8DkTpBTFOu7Xtn-CJNH6zOqNX6RnEUNj-ZhkxHqs78LSc_XcVmTqKyd96CxjmSqN13Yeo9XPhdEZkZVsOvd_3M-0000
> > > >
> > > > This is a sequence diagram for a degenerate but still plausible case.
> > > >
> > > >
> > >
> >
> https://www.planttext.com/?text=TP7FJeGm4CRlVOe973SYY_7YORCRVGA1Yxg70WFRM5fi1z6tTmgwYoOUc-_t_Sbqxqaw6dijAli1KGInLjs9AZsYa3LP1r7fqS6XGqCHI0_bOIlGDC3ysJCey_eldWKyqBKrvAo6g72oRLKDERftT3DMv4oHeayE63mvbFrYpuNed1q7UB9zfL1mFLmzns7WyOLjS0UF-3QYfr3ppJxOoGZMBc2v1fCaQQNIC2dJs0b8zGY3z1uzAwkOdyevQ3efmkCk96qs47SEqV2QB6WEcxzZblP5-5LUdeNhnP6bwKplbkUzHkbWtXU6pNEAbITKBYPZ2S7o1OFnyc-i1gCTQNs2ODooT9lUY3rdXY__0W00
> > > >
> > > > This can happen when the connect timeout (even for down servers) is
> > > longer than the fail window. In such a case, a new request can come in
> > and
> > > be allowed to proceed while a previous attempt is still in play. If the
> > > upstream recovers between these two, it can be the case the second
> > returns
> > > a success before the final timeout on the first. It is for this reason
> > the
> > > connect attempt needs to track if it is a zombie case or not - for a
> > > zombie, it must not set the fail time otherwise you can get a false
> > > throttle for the upstream.
> > > >
> > >
> > >
> >
>

Re: Handling requests to an upstream that is unresponsive.

Reply via email to