HI, On Sat, Jun 18, 2016 at 03:41:22PM -0400, Selva Nair wrote: > > > This is possible, but the case for progressively increasing the restart > > > pause is not very strong. Can we get some feedback from people who serve > > > 1000's of users? > > > > I would generally consider it polite behaviour... (also, it might save > > the client from filled-up disks...) - didn't think of exponential back-off > > yet, but that would certainly be another approach to consider. > > > > I have no idea how complicated it would be to implement, though. > > Filling up the client log with repeated retries is a concern, indeed. > > Although a proper implementation needs the failed reconnect count per > ip/port combination which we do not currently keep track of, I think a > heuristic count may be good enough. One could use something like this: > > n = c->options.unsuccessful_attempts > m = c->options.connection_list->len > rc = n/m (a rough measure of retries per remote) > timeout = the_default_timeout << MIN(rc, 10) (exponential up to ~ 5000 > seconds). > throw a SIGHUP if rc exceeds some value (resets n and starts over) > > The retry count will be over-stated for situations like one remote name > that resolves to many IPs, but avoiding that requires more work.. Any > thoughts?
I'm not sure I understand the math. Let me try to reword. So it would increase "n" for each failed attempt, setting it to "0" if one succeeds (so the reconnect after a session aborts would start "fresh"). Leaving out "m" for the time being, "retry time" would then get scaled by 2^rc (capped by 2^10), so the initial 5s would become 10s, 20s, ... 5000s (this is not "timeout" as in "the unified TCP connection timeout" but the retry timer firing after one connection attempt is aborted). So, now we have 4 different remotes. You're scaling the exponent by 1/4 here, so the retry timer would be 5s 5s 5s 5s 10s 10s 10s 10s 20s 20s 20s 20s ... then (or, phrased differently, "one round uses the unscaled timer, the next round across all remotes uses 2^1, the third round uses 2^2"). If I understood the math right, I think this would be useful behaviour ;-) - fast failover if multiple remotes are there, exponentially slowing down if all remotes have been tried. Plus, fairly easy to implement as nearly all needed values are already around. gert -- USENET is *not* the non-clickable part of WWW! //www.muc.de/~gert/ Gert Doering - Munich, Germany g...@greenie.muc.de fax: +49-89-35655025 g...@net.informatik.tu-muenchen.de
signature.asc
Description: PGP signature