On 11/20/2017 11:26 AM, Wes Hardaker wrote:
Michael StJohns <m...@nthpermutation.com> writes:

1 something.
I think that the consensus is clearly something like that.  Are you
(MSJ) interested in supplying a suggested final equation for it?


Ok - after thinking about it, it turns out to be fairly simple.


1) Initially, ignore the outliers - the servers that are down and will be down throughout the entire safety period.  It's probable that most of them were down during the original uptake period.

2) Assume a success rate of p per retry.  I'm going to use .01 - or for each retry period only 1 of 100 entities completes the last query.

3) Calculate Log.x(M) where M is the number of clients - arbitrarily chosen at 10M and where x is (1/(1-p)) - the failure rate (or put another way, the proportion of servers still waiting to complete after the previous retry interval).  Log.x(M) gives the number of intervals to reduce the set of uncompleted servers to 0 assuming normal probability.

That gives you 1603 fast retry intervals.   Setting p and M to different values gets you a range of answers:


        
        Number of Resolvers

                10,000  100,000         1,000,000       10,000,000      
100,000,000
Probability of Success Per Retry Interval 0.01 916.4212 1145.526 1374.632 1603.737 1832.84231
0.05    179.5623        224.4528        269.3434        314.23397       
359.12454
0.1     87.41738        109.2717        131.1261        152.98042       
174.834763
0.15    56.67242        70.84052        85.00862        99.176728       
113.344832
0.25    32.01569        40.01961        48.02354        56.027459       
64.0313822
0.5     13.28771        16.60964        19.93157        23.253497       
26.5754248
0.9     4       5       6       7       8


(Think of it this way.  Pretend you have 1000 resolvers and each has a 10% chance of completing in each interval.  After the first interval, 900 are left.  After the second 810, after the third...729 etc.  Ignoring rounding you need about 65 retries to get down to < 1 left which is Log1.11111(1000).

This doesn't account for the servers who are offline, but see (1) above for why its probably safe to ignore them.

So a publisher can pick an M and x (or p) that is their best guess from the data they have and calculate:

safetyInterval ::=  Log.x(M) * fastRetryInterval

Or simply make some worst case assumptions (.01 success rate, 10M clients) and use a number from the table.


Mike



_______________________________________________
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop

Reply via email to