Hi,
I've been struggling to understand the statistical theory behind this piece of code (from /core/src/main/scala/org/apache/spark/partial/GroupedSumEvaluator.scala) below, especially with respect to estimating the size of the population (total tasks) and its variance. Also I'm trying to understand how the variance of the sum is calculated like that. I'm struggling to find the source online too. while (iter.hasNext) { val entry = iter.next() val counter = entry.getValue val meanEstimate = counter.mean val meanVar = counter.sampleVariance / counter.count val countEstimate = (counter.count + 1 - p) / p val countVar = (counter.count + 1) * (1 - p) / (p * p) val sumEstimate = meanEstimate * countEstimate val sumVar = (meanEstimate * meanEstimate * countVar) + (countEstimate * countEstimate * meanVar) + (meanVar * countVar) val sumStdev = math.sqrt(sumVar) val confFactor = studentTCacher.get(counter.count) val low = sumEstimate - confFactor * sumStdev val high = sumEstimate + confFactor * sumStdev result.put(entry.getKey, new BoundedDouble(sumEstimate, confidence, low, high)) } Thanks and best regards, Phil -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/statistical-theory-behind-estimating-the-number-of-total-tasks-in-GroupedSumEvaluator-scala-tp27827.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org