On 02/29/2012 05:15 PM, John Clements wrote:
Plot's new "density" function is awesome. I'd like to add something to it, 
though; independent control of the y axis.

Here's the motivating scenario; I'm looking at server logs, to try to see which users are 
hammering the handin server hardest. Suppose I take a list of numbers representing the 
seconds on which a submission occurred.  I can plot the density of these using (density 
…), but what I get is the relative density, rather than the absolute density. In this 
case, I want the y axis to have the units "elements per unit time". This is 
different from an application such as the one in the docs where the number of data points 
is irrelevant.

This problem becomes much more acute when I'm trying to compare two different 
sets of server logs; the current behavior essentially normalizes w.r.t. the 
number of points.

The easiest way to fix this is just to allow the user to have independent 
control over the y scaling, so that you can for instance write:

(plot (density all-seconds 0.0625
                #:y-adjust (/ 1 (length all-seconds)))
       #:width 800)

to get a graph that shows density in hits per second.

If you're only plotting the density graph, you could currently do this:

(define scale (/ 1 (length all-seconds)))
(parameterize ([plot-y-ticks  (ticks-scale (plot-y-ticks)
                                           (linear-scale scale))])
  (plot (density all-seconds 0.0625)))


But you probably don't want to. First, some background.

A Kernel Density Estimator (KDE) like `density' constructs an estimate of the probability distribution that generated some samples. It does this by centering a "kernel" at every point, adding them up pointwise, and normalizing. Conceptually, anyway; `density' uses a specialized algorithm that is efficient even with hundreds of thousands of samples, but only works with Gaussian kernels.

Using `density' to smear discrete points and accumulate them is a hack that will probably come back to haunt you sometime. You've already found one reason. There are two others, both of which come from the fact that KDEs are designed to converge to the correct density as the number of samples increases.

1. The kernel width has to be a function of the number of samples, which approaches zero as the number of samples increases. You've compensated for this, sort of, by multiplying the width by 0.0625. That won't always get the result you want.

2. The kernels are almost always symmetric, and probably not the shape you really want.

If you want to smear points and accumulate them in a way that properly represents server load, you should add up your own kernels that represent the resources it takes to process an assignment. This might help you get started:

(define (((kernel width) s) x)
  (exp (* -1/2 (sqr (/ (- s x) width)))))

(define width 0.01)
(define kernels (map (kernel width) all-seconds))
(plot (function (λ (y) (apply + (map (λ (k) (k y)) kernels)))
                (- (apply min all-seconds) (* width 4))
                (+ (apply max all-seconds) (* width 4))))


The kernel in this case is an unnormalized Gaussian centered on the log time. Using it means assuming that the log message is recorded in the exact middle of processing an assignment, that the middle of processing has the highest server load, and that the load is symmetric.

Wow, that ended up way longer than I intended.

Neil ⊥
____________________
 Racket Users list:
 http://lists.racket-lang.org/users

Reply via email to