On Mon, Jul 18, 2011 at 2:08 PM, Paul Smith <phh...@gmail.com> wrote: > On Mon, Jul 18, 2011 at 9:11 PM, Joshua Wiley <jwiley.ps...@gmail.com> wrote: >>> [snip] I guess that I must have a data frame to plot a histogram. >> >> Not at all! >> >> ## a *vector* of 100 million observation >> x <- rnorm(10^8) >> ## a histogram for it (see attached for the result from my system) >> hist(x) >> >> No data frame required. I would not try this straight in anything but >> traditional graphics for a 100 million observation vector, but if you >> wanted it made in ggplot2 or something, you could prebin the data and >> THEN plot bars corresponding to the bins. > > Thanks, Joshua, for your answer. > > True: A vector is enough to supply data for hist(). But my point is: > Can a histogram be drawn without having all data on the computer > memory? You partially answer this question by suggesting to prebind > the data. Can this prebinning process be done transparently but chunk > by chunk of data underneath?
Sure, as long as you can figure out some basic details about the full dataset. Just define your breaks, and then for chunks of the data at a time, count how many fall into any particular bin. Once you are done, add up all the counts for each bin, and voila. ## Get these values from the full data (using SQL) x <- rnorm(1000) n <- length(x) minx <- min(x) maxx <- max(x) ## Sturges style breaks breaks <- pretty(c(minx, maxx), n = ceiling(log2(n) + 1)) nB <- length(breaks) fuzz <- rep(1e-07 * median(diff(breaks)), nB) fuzz[1] <- fuzz[1] * -1 fuzzybreaks <- breaks + fuzz chunks <- 10 counts <- matrix(NA, nrow = chunks, ncol = nB - 1, dimnames = list(paste("Sec", 1:chunks, sep = ''), as.character(fuzzybreaks[-1]))) for(i in 1:chunks) { index <- seq(1, n/chunks) + (n/chunks * (i - 1)) counts[i, ] <- hist(x[index], breaks = fuzzybreaks)$counts } ## The heights of your bars colSums(counts) ## results using hist() on x all at once hist(x)$counts You would not even need to know the number of chunks you were going to split your data into before hand, I just did it for convenience and to instatiate a full sized matrix to hold the results. If you are selecting subsets of your data using SQL rather than R, it becomes even simpler. Once you have your fuzzybreaks, you just keep calling hist on your new data with using the predefined breaks and saving the results. Still, I do not break about 4.5 GB of memory used to just plot a histogram on a 100 million observation vector, and it is difficult to imagine the shape of the distribution changing appreciably using a random sample of 100 million observations. It also takes less than 10 seconds to calculate and draw the histogram on my computer. The point being, I suspect you will spend more time getting everything setup and working than seems worth it because you can easily and quickly create a histogram on so large of vectors already, the distribution is unlikely to vary anyway. Whatever floats your boat, though. Cheers, Josh > > Paul > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Joshua Wiley Ph.D. Student, Health Psychology University of California, Los Angeles https://joshuawiley.com/ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.