Re: statistical theory behind estimating the number of total tasks in GroupedSumEvaluator.scala

Sean Owen Mon, 03 Oct 2016 08:56:40 -0700

FWIW I think there are in any event several small problems with these
classes; I'm tracking it here and have a change almost ready:


https://issues.apache.org/jira/browse/SPARK-17768

On Mon, Oct 3, 2016 at 9:39 AM, Sean Owen <so...@cloudera.com> wrote:
> +Matei for question about the source of this bit of code
>
> That's a good question; I remember wondering about this once upon a time.
>
> First, GroupedSumEvaluator and GroupedMeanEvaluator look like dead
> code at this point. GroupedCountEvaluator is still used.
>
> MeanEvaluator is a better example, because it's straightforward. It's
> getting a confidence interval on the true mean from the sample stats
> using a t-distribution.
>
> CountEvaluator however I don't quite get ...
>
> val p = outputsMerged.toDouble / totalOutputs
> val mean = (sum + 1 - p) / p
> val variance = (sum + 1) * (1 - p) / (p * p)
> val stdev = math.sqrt(variance)
> val confFactor = new NormalDistribution().
>   inverseCumulativeProbability(1 - (1 - confidence) / 2)
> val low = mean - confFactor * stdev
> val high = mean + confFactor * stdev
> new BoundedDouble(mean, confidence, low, high)
>
> Given the mean/variance formula, this looks like it's modeling the
> rest of the count not yet observed as negative binomial. However I'd
> expect the mean to be something like sum * (1-p) / p instead, and
> that's not the mean of the total count but the rest of the count then.
> This also doesn't truncate the interval at 0 (count can't be
> negative), which could also be solved by not using a normal
> approximation too.
>
> Of course, I could be totally wrong about the model then.
>
> This is pretty old code, from years ago. Matei you at least merged it
> though may not have written it -- do you have any more info?
>
> I might start writing some tests for this since the result isn't
> directly tested in the unit tests, to see how it holds up.
>
> Then, the question is SumEvaluator, which seems to approach the
> estimation as an estimation of product of count and mean together, so
> it has some related questions.
>
> Sean
>
> On Sun, Oct 2, 2016 at 8:47 PM, philipghu <philguang...@gmail.com> wrote:
>> Hi,
>>
>>
>> I've been struggling to understand the statistical theory behind this piece
>> of code (from
>> /core/src/main/scala/org/apache/spark/partial/GroupedSumEvaluator.scala)
>> below, especially with respect  to estimating the  size of the population
>> (total tasks) and its variance. Also I'm trying to understand how the
>> variance  of the sum is calculated like that. I'm struggling to find the
>> source online too.
>>
>>  while (iter.hasNext) {
>>         val entry = iter.next()
>>         val counter = entry.getValue
>>         val meanEstimate = counter.mean
>>         val meanVar = counter.sampleVariance / counter.count
>>         val countEstimate = (counter.count + 1 - p) / p
>>         val countVar = (counter.count + 1) * (1 - p) / (p * p)
>>         val sumEstimate = meanEstimate * countEstimate
>>         val sumVar = (meanEstimate * meanEstimate * countVar) +
>>                      (countEstimate * countEstimate * meanVar) +
>>                      (meanVar * countVar)
>>         val sumStdev = math.sqrt(sumVar)
>>         val confFactor = studentTCacher.get(counter.count)
>>         val low = sumEstimate - confFactor * sumStdev
>>         val high = sumEstimate + confFactor * sumStdev
>>         result.put(entry.getKey, new BoundedDouble(sumEstimate, confidence,
>> low, high))
>>       }
>>
>> Thanks and best regards,
>> Phil
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/statistical-theory-behind-estimating-the-number-of-total-tasks-in-GroupedSumEvaluator-scala-tp27827.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: statistical theory behind estimating the number of total tasks in GroupedSumEvaluator.scala

Reply via email to