+Matei for question about the source of this bit of code That's a good question; I remember wondering about this once upon a time.
First, GroupedSumEvaluator and GroupedMeanEvaluator look like dead code at this point. GroupedCountEvaluator is still used. MeanEvaluator is a better example, because it's straightforward. It's getting a confidence interval on the true mean from the sample stats using a t-distribution. CountEvaluator however I don't quite get ... val p = outputsMerged.toDouble / totalOutputs val mean = (sum + 1 - p) / p val variance = (sum + 1) * (1 - p) / (p * p) val stdev = math.sqrt(variance) val confFactor = new NormalDistribution(). inverseCumulativeProbability(1 - (1 - confidence) / 2) val low = mean - confFactor * stdev val high = mean + confFactor * stdev new BoundedDouble(mean, confidence, low, high) Given the mean/variance formula, this looks like it's modeling the rest of the count not yet observed as negative binomial. However I'd expect the mean to be something like sum * (1-p) / p instead, and that's not the mean of the total count but the rest of the count then. This also doesn't truncate the interval at 0 (count can't be negative), which could also be solved by not using a normal approximation too. Of course, I could be totally wrong about the model then. This is pretty old code, from years ago. Matei you at least merged it though may not have written it -- do you have any more info? I might start writing some tests for this since the result isn't directly tested in the unit tests, to see how it holds up. Then, the question is SumEvaluator, which seems to approach the estimation as an estimation of product of count and mean together, so it has some related questions. Sean On Sun, Oct 2, 2016 at 8:47 PM, philipghu <philguang...@gmail.com> wrote: > Hi, > > > I've been struggling to understand the statistical theory behind this piece > of code (from > /core/src/main/scala/org/apache/spark/partial/GroupedSumEvaluator.scala) > below, especially with respect to estimating the size of the population > (total tasks) and its variance. Also I'm trying to understand how the > variance of the sum is calculated like that. I'm struggling to find the > source online too. > > while (iter.hasNext) { > val entry = iter.next() > val counter = entry.getValue > val meanEstimate = counter.mean > val meanVar = counter.sampleVariance / counter.count > val countEstimate = (counter.count + 1 - p) / p > val countVar = (counter.count + 1) * (1 - p) / (p * p) > val sumEstimate = meanEstimate * countEstimate > val sumVar = (meanEstimate * meanEstimate * countVar) + > (countEstimate * countEstimate * meanVar) + > (meanVar * countVar) > val sumStdev = math.sqrt(sumVar) > val confFactor = studentTCacher.get(counter.count) > val low = sumEstimate - confFactor * sumStdev > val high = sumEstimate + confFactor * sumStdev > result.put(entry.getKey, new BoundedDouble(sumEstimate, confidence, > low, high)) > } > > Thanks and best regards, > Phil > > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/statistical-theory-behind-estimating-the-number-of-total-tasks-in-GroupedSumEvaluator-scala-tp27827.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org