The probability density function is not unitless - it is the derivative of the [cumulative] probability distribution function so it has units delta-probability-mass over delta-x. It must integrate to 1 (over the all possible x). hist(freq=FALSE,x) or hist(prob=TRUE,x) displays an estimate of the density function and the following example shows how the scale matches what you get from the presumed population density function.
> f function (n, sd) { x <- rnorm(n, sd = sd) hist(x, freq = FALSE) # estimated density s <- seq(min(x), max(x), len = 129) lines(s, dnorm(s, sd = sd), col = "red") # overlay expected density for this sample } > f(1e6, sd=1) > f(100, sd=1) > f(100, sd=0.0001) > f(1e6, sd=0.0001) Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On > Behalf > Of J Toll > Sent: Tuesday, January 22, 2013 2:48 PM > To: r-help > Subject: [R] density of hist(freq = FALSE) inversely affected by data > magnitude > > Hi, > > I have a couple of observations, a question or two, and perhaps a > suggestion related to the plotting of density on the y-axis within the > hist() function when freq=FALSE. I was using the function and trying > to develop an intuitive understanding of what the density is telling > me. After reading through this fairly helpful post: > > http://stats.stackexchange.com/questions/17258/odd-problem-with-a-histogram-in-r- > with-a-relative-frequency-axis > > I finally realized that in the case where freq = FALSE, the y-axis > isn't really telling me the density. It's actually indicating the > density multiplied by the bin size. I assume this is for the case > where the bins may be of non-regular size. > > from hist.default: > > dens <- counts/(n * diff(breaks)) > > So the count in each bin is divided by the total number of > observations (n) multiplied by the size of the bin. The problem, as I > see it, is that the density ends up being scaled by the size of the > bins, which is inversely proportional to the magnitude of the data. > Therefore the magnitude of the data is directly affecting the density, > which seems problematic. > > For example*: > > set.seed(4444) > x <- runif(100) > y <- x / 1000 > > par(mfrow = c(2, 1)) > hist(x, prob = TRUE) > hist(y, prob = TRUE) > > >From this example, you see that the density for the y histogram is > 1000 times larger, simply because the y data is 1000 times smaller. > Again, that seems problematic. It seems to me, that the density > should be unit-less, but here it's affected by the magnitude of the > data. > > So, my question is, why is density calculated this way? > > For the case where all the bins are of the same size, I would think > density should simply be calculated as: > > dens <- counts / n > > Of course, that might be somewhat misleading for the case where the > bin sizes vary. So then why not calculate density as: > > dens <- counts / (n * diff(breaks) / min(diff(breaks))) > > Dividing diff(breaks) by min(diff(breaks)) removes the scaling effect > of the magnitude of the data, and simply leaves the relative > difference in bin size. > > For the case where all the bins are the same size, the calculation is > equivalent to dens <- counts / n > > For all other cases, the density is scaled by the size of the bin, but > unaffected by the magnitude of the data. > > So, what am I misunderstanding? Why is density calculated as it is, > and what does it mean? > > Thanks, > > > James > > > *example from > http://stats.stackexchange.com/questions/17258/odd-problem-with-a- > histogram-in-r-with-a-relative-frequency-axis > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.