On Thu, Jun 23, 2016 at 1:52 PM, Gert Doering <g...@greenie.muc.de> wrote:

> On Sat, Jun 18, 2016 at 03:41:22PM -0400, Selva Nair wrote:
> > > > This is possible, but the case for progressively increasing the
> restart
> > > > pause is not very strong. Can we get some feedback from people who
> serve
> > > > 1000's of users?
> > >
> > > I would generally consider it polite behaviour... (also, it might save
> > > the client from filled-up disks...) - didn't think of exponential
> back-off
> > > yet, but that would certainly be another approach to consider.
> > >
> > > I have no idea how complicated it would be to implement, though.
> >
> > Filling up the client log with repeated retries is a concern, indeed.
> >
> > Although a proper implementation needs the failed reconnect count per
> > ip/port combination which we do not currently keep track of, I think a
> > heuristic count may be good enough. One could use something like this:
> >
> > n = c->options.unsuccessful_attempts
> > m = c->options.connection_list->len
> > rc = n/m   (a rough measure of retries per remote)
> > timeout = the_default_timeout << MIN(rc, 10)  (exponential up to ~ 5000
> > seconds).
> > throw a SIGHUP if rc exceeds some value (resets n and starts over)
> >
> > The retry count will be over-stated for situations like one remote name
> > that resolves to many IPs, but avoiding that requires more work.. Any
> > thoughts?
>
> I'm not sure I understand the math.  Let me try to reword.
>
> So it would increase "n" for each failed attempt, setting it to "0" if
> one succeeds (so the reconnect after a session aborts would start "fresh").
>

Yes it gets reset to zero in intialization_sequence_completed(). Or after a
SIGHUP.


>
> Leaving out "m" for the time being, "retry time" would then get scaled by
> 2^rc (capped by 2^10), so the initial 5s would become 10s, 20s, ... 5000s
> (this is not "timeout" as in "the unified TCP connection timeout" but the
> retry timer firing after one connection attempt is aborted).
>

timeout was a poorly chose name, its the startup pause time -- controlled
by variables like
connect_retry_seconds or restart_sleep_seconds. It gets passed to
openvpn_sleep.


>
>
> So, now we have 4 different remotes.  You're scaling the exponent by 1/4
> here, so the retry timer would be
>
>   5s 5s 5s 5s 10s 10s 10s 10s 20s 20s 20s 20s ...
>
> then (or, phrased differently, "one round uses the unscaled timer, the
> next round across all remotes uses 2^1, the third round uses 2^2").
>
>
> If I understood the math right, I think this would be useful behaviour ;-)
> - fast failover if multiple remotes are there, exponentially slowing down
> if all remotes have been tried.  Plus, fairly easy to implement as nearly
> all needed values are already around.


Yes, the book-keeping variables are already there.

To avoid aggressive slowing down, may be we could start the scaling after
rc has reached a threshold. Say don't do anything until rc = 5 and then
start scaling the timer until rc reaches 15.

Selva

Reply via email to