On Thu, Jun 23, 2016 at 1:52 PM, Gert Doering <g...@greenie.muc.de> wrote:
> On Sat, Jun 18, 2016 at 03:41:22PM -0400, Selva Nair wrote: > > > > This is possible, but the case for progressively increasing the > restart > > > > pause is not very strong. Can we get some feedback from people who > serve > > > > 1000's of users? > > > > > > I would generally consider it polite behaviour... (also, it might save > > > the client from filled-up disks...) - didn't think of exponential > back-off > > > yet, but that would certainly be another approach to consider. > > > > > > I have no idea how complicated it would be to implement, though. > > > > Filling up the client log with repeated retries is a concern, indeed. > > > > Although a proper implementation needs the failed reconnect count per > > ip/port combination which we do not currently keep track of, I think a > > heuristic count may be good enough. One could use something like this: > > > > n = c->options.unsuccessful_attempts > > m = c->options.connection_list->len > > rc = n/m (a rough measure of retries per remote) > > timeout = the_default_timeout << MIN(rc, 10) (exponential up to ~ 5000 > > seconds). > > throw a SIGHUP if rc exceeds some value (resets n and starts over) > > > > The retry count will be over-stated for situations like one remote name > > that resolves to many IPs, but avoiding that requires more work.. Any > > thoughts? > > I'm not sure I understand the math. Let me try to reword. > > So it would increase "n" for each failed attempt, setting it to "0" if > one succeeds (so the reconnect after a session aborts would start "fresh"). > Yes it gets reset to zero in intialization_sequence_completed(). Or after a SIGHUP. > > Leaving out "m" for the time being, "retry time" would then get scaled by > 2^rc (capped by 2^10), so the initial 5s would become 10s, 20s, ... 5000s > (this is not "timeout" as in "the unified TCP connection timeout" but the > retry timer firing after one connection attempt is aborted). > timeout was a poorly chose name, its the startup pause time -- controlled by variables like connect_retry_seconds or restart_sleep_seconds. It gets passed to openvpn_sleep. > > > So, now we have 4 different remotes. You're scaling the exponent by 1/4 > here, so the retry timer would be > > 5s 5s 5s 5s 10s 10s 10s 10s 20s 20s 20s 20s ... > > then (or, phrased differently, "one round uses the unscaled timer, the > next round across all remotes uses 2^1, the third round uses 2^2"). > > > If I understood the math right, I think this would be useful behaviour ;-) > - fast failover if multiple remotes are there, exponentially slowing down > if all remotes have been tried. Plus, fairly easy to implement as nearly > all needed values are already around. Yes, the book-keeping variables are already there. To avoid aggressive slowing down, may be we could start the scaling after rc has reached a threshold. Say don't do anything until rc = 5 and then start scaling the timer until rc reaches 15. Selva