Re: [racket] plot request/patch: independent control of y axis in density plots

Neil Toronto Wed, 29 Feb 2012 17:57:28 -0800

On 02/29/2012 05:15 PM, John Clements wrote:

Plot's new "density" function is awesome. I'd like to add something to it, 
though; independent control of the y axis.


Here's the motivating scenario; I'm looking at server logs, to try to see which users are 
hammering the handin server hardest. Suppose I take a list of numbers representing the 
seconds on which a submission occurred.  I can plot the density of these using (density 
…), but what I get is the relative density, rather than the absolute density. In this 
case, I want the y axis to have the units "elements per unit time". This is 
different from an application such as the one in the docs where the number of data points 
is irrelevant.

This problem becomes much more acute when I'm trying to compare two different 
sets of server logs; the current behavior essentially normalizes w.r.t. the 
number of points.

The easiest way to fix this is just to allow the user to have independent 
control over the y scaling, so that you can for instance write:

(plot (density all-seconds 0.0625
                #:y-adjust (/ 1 (length all-seconds)))
       #:width 800)

to get a graph that shows density in hits per second.


If you're only plotting the density graph, you could currently do this:

(define scale (/ 1 (length all-seconds)))
(parameterize ([plot-y-ticks  (ticks-scale (plot-y-ticks)
                                           (linear-scale scale))])
  (plot (density all-seconds 0.0625)))


But you probably don't want to. First, some background.

A Kernel Density Estimator (KDE) like `density' constructs an estimateof the probability distribution that generated some samples. It doesthis by centering a "kernel" at every point, adding them up pointwise,and normalizing. Conceptually, anyway; `density' uses a specializedalgorithm that is efficient even with hundreds of thousands of samples,but only works with Gaussian kernels.

Using `density' to smear discrete points and accumulate them is a hackthat will probably come back to haunt you sometime. You've already foundone reason. There are two others, both of which come from the fact thatKDEs are designed to converge to the correct density as the number ofsamples increases.

1. The kernel width has to be a function of the number of samples,which approaches zero as the number of samples increases. You'vecompensated for this, sort of, by multiplying the width by 0.0625. Thatwon't always get the result you want.

2. The kernels are almost always symmetric, and probably not the shapeyou really want.

If you want to smear points and accumulate them in a way that properlyrepresents server load, you should add up your own kernels thatrepresent the resources it takes to process an assignment. This mighthelp you get started:


(define (((kernel width) s) x)
  (exp (* -1/2 (sqr (/ (- s x) width)))))

(define width 0.01)
(define kernels (map (kernel width) all-seconds))
(plot (function (λ (y) (apply + (map (λ (k) (k y)) kernels)))
                (- (apply min all-seconds) (* width 4))
                (+ (apply max all-seconds) (* width 4))))

The kernel in this case is an unnormalized Gaussian centered on the logtime. Using it means assuming that the log message is recorded in theexact middle of processing an assignment, that the middle of processinghas the highest server load, and that the load is symmetric.


Wow, that ended up way longer than I intended.

Neil ⊥
____________________
 Racket Users list:
 http://lists.racket-lang.org/users

Re: [racket] plot request/patch: independent control of y axis in density plots

Reply via email to