On 17 March 2018 at 17:16, Dean Rasheed <dean.a.rash...@gmail.com> wrote: > Using the calculator above, you can see that the distribution is > fairly normal-like, but with a noticeable positive skew. The 2-stddev > interval is 0.6 to 9.4, and according to the calculator the > probability of the value being less than or equal to 1 is in the > ballpark of the 2.5% figure expected. So even with just 5 occurrences > in the sample, it's fairly close to a normal distribution. >
One thing this does illustrate is that the hypergeometric distribution is a discrete distribution and there can be quite large jumps in the probability from one value to the next, so care may be needed when approximating it with a continuous distribution. The standard technique used to handle this is to apply what is known as a continuity correction factor. Suppose that X is a random variable with a discrete hypergeometric distribution, and Y is a continuous normal distribution, with the same mean and variance, being used to approximate X. Then P(X>i) for some integer i is the same as P(X>=i+1), because X can only be integer-valued. The idea is then that you can use P(Y>i+0.5) to get a fairly good approximation to P(X>i). That would correspond to adding 0.5 to the right-hand side of the current test, i.e., if (mcv_counts[num_mcv - 1] > selec * samplerows + 2 * stddev + 0.5) => Common enough to be included in MCV-list A lot of the time that won't make much difference, except when dealing with the smaller counts at the tail end of the MCV list, where it might help avoid the too-many-mcvs problem, so I think it's worth trying out. Apparently, in textbooks, an interval like the mean +/- 2*stddev is known as a Wald confidence interval, and the mean +/- (2*devdev+0.5) is the continuity-corrected Wald interval, so it's probably worth mentioning that in the comments. They are generally regarded as quite crude approximations for hypergeometric distributions, and there's quite a bit of research around establishing more accurate confidence intervals for this kind of distribution, but the formulae involved tend to be quite complicated, whereas the Wald interval has the advantage of being very simple. In this context, I don't think we need to establish a particularly accurate confidence interval. We just want to be able to say that the value is probably more common than the non-mcv values, without being too rigorous about what "probably" means, as long as it works in practice to discount values that just happen to be a bit more common in the sample, but aren't actually more common in the table as a whole. Regards, Dean