FWIW I think there are in any event several small problems with these classes; I'm tracking it here and have a change almost ready:
https://issues.apache.org/jira/browse/SPARK-17768 On Mon, Oct 3, 2016 at 9:39 AM, Sean Owen <so...@cloudera.com> wrote: > +Matei for question about the source of this bit of code > > That's a good question; I remember wondering about this once upon a time. > > First, GroupedSumEvaluator and GroupedMeanEvaluator look like dead > code at this point. GroupedCountEvaluator is still used. > > MeanEvaluator is a better example, because it's straightforward. It's > getting a confidence interval on the true mean from the sample stats > using a t-distribution. > > CountEvaluator however I don't quite get ... > > val p = outputsMerged.toDouble / totalOutputs > val mean = (sum + 1 - p) / p > val variance = (sum + 1) * (1 - p) / (p * p) > val stdev = math.sqrt(variance) > val confFactor = new NormalDistribution(). > inverseCumulativeProbability(1 - (1 - confidence) / 2) > val low = mean - confFactor * stdev > val high = mean + confFactor * stdev > new BoundedDouble(mean, confidence, low, high) > > Given the mean/variance formula, this looks like it's modeling the > rest of the count not yet observed as negative binomial. However I'd > expect the mean to be something like sum * (1-p) / p instead, and > that's not the mean of the total count but the rest of the count then. > This also doesn't truncate the interval at 0 (count can't be > negative), which could also be solved by not using a normal > approximation too. > > Of course, I could be totally wrong about the model then. > > This is pretty old code, from years ago. Matei you at least merged it > though may not have written it -- do you have any more info? > > I might start writing some tests for this since the result isn't > directly tested in the unit tests, to see how it holds up. > > Then, the question is SumEvaluator, which seems to approach the > estimation as an estimation of product of count and mean together, so > it has some related questions. > > Sean > > On Sun, Oct 2, 2016 at 8:47 PM, philipghu <philguang...@gmail.com> wrote: >> Hi, >> >> >> I've been struggling to understand the statistical theory behind this piece >> of code (from >> /core/src/main/scala/org/apache/spark/partial/GroupedSumEvaluator.scala) >> below, especially with respect to estimating the size of the population >> (total tasks) and its variance. Also I'm trying to understand how the >> variance of the sum is calculated like that. I'm struggling to find the >> source online too. >> >> while (iter.hasNext) { >> val entry = iter.next() >> val counter = entry.getValue >> val meanEstimate = counter.mean >> val meanVar = counter.sampleVariance / counter.count >> val countEstimate = (counter.count + 1 - p) / p >> val countVar = (counter.count + 1) * (1 - p) / (p * p) >> val sumEstimate = meanEstimate * countEstimate >> val sumVar = (meanEstimate * meanEstimate * countVar) + >> (countEstimate * countEstimate * meanVar) + >> (meanVar * countVar) >> val sumStdev = math.sqrt(sumVar) >> val confFactor = studentTCacher.get(counter.count) >> val low = sumEstimate - confFactor * sumStdev >> val high = sumEstimate + confFactor * sumStdev >> result.put(entry.getKey, new BoundedDouble(sumEstimate, confidence, >> low, high)) >> } >> >> Thanks and best regards, >> Phil >> >> >> >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/statistical-theory-behind-estimating-the-number-of-total-tasks-in-GroupedSumEvaluator-scala-tp27827.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org