statistical theory behind estimating the number of total tasks in GroupedSumEvaluator.scala

philipghu Sun, 02 Oct 2016 12:48:06 -0700

Hi,


I've been struggling to understand the statistical theory behind this piece
of code (from
/core/src/main/scala/org/apache/spark/partial/GroupedSumEvaluator.scala)
below, especially with respect  to estimating the  size of the population
(total tasks) and its variance. Also I'm trying to understand how the
variance  of the sum is calculated like that. I'm struggling to find the
source online too. 

 while (iter.hasNext) {
        val entry = iter.next()
        val counter = entry.getValue
        val meanEstimate = counter.mean
        val meanVar = counter.sampleVariance / counter.count
        val countEstimate = (counter.count + 1 - p) / p
        val countVar = (counter.count + 1) * (1 - p) / (p * p)
        val sumEstimate = meanEstimate * countEstimate
        val sumVar = (meanEstimate * meanEstimate * countVar) +
                     (countEstimate * countEstimate * meanVar) +
                     (meanVar * countVar)
        val sumStdev = math.sqrt(sumVar)
        val confFactor = studentTCacher.get(counter.count)
        val low = sumEstimate - confFactor * sumStdev
        val high = sumEstimate + confFactor * sumStdev
        result.put(entry.getKey, new BoundedDouble(sumEstimate, confidence,
low, high))
      }

Thanks and best regards,
Phil







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/statistical-theory-behind-estimating-the-number-of-total-tasks-in-GroupedSumEvaluator-scala-tp27827.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

statistical theory behind estimating the number of total tasks in GroupedSumEvaluator.scala

Reply via email to