On Wed, 4 Mar 2009, Vadlamani, Satish {FLNA} wrote:

Hi:
Sorry if this is a double post. I posted the same thing this morning and did 
not see it.

I just started using R and am asking the following questions so that I can plan 
for the future when I may have to analyze volume data.

1) What are the limitations of R when it comes to handling large datasets? Say for example something like 200M rows and 15 columns data frame (between >1.5 to 2 GB in size)? Will the limitation be based on the specifications of the hardware or R itself?

It depends a lot on what you want to do.  The default situation in R is that 
all the data are loaded into memory, in which case the rule of thumb is that 
you want data sets no larger than 1/3 of memory. If you have, say, a system 
with 8Gb memory and a 64-bit version of R you should be ok.

It is often possible to work with much larger data sets than this, you just 
need to arrange for the whole thing not to be loaded simultaneously.  The right 
strategy depends on the problem.

For example, linear and generalized linear models on large data sets can be 
fitted with the biglm package.  The various database interface packages and the 
packages for netCDF and HDF5 allow subsets of a data set to be loaded easily. 
Packages such as bigmemory and ff allow at least some operations to be carried 
out on file-backed data objects.


2) Is R 32 bit compiled or 64 bit (on say Windows and AIX)

On AIX, 64 bit. On Windows, currently only 32-bit although there is work 
towards a 64-bit version.


4) Should I be looking at SAS also only for this reason (we do have SAS in-house but the problem is that I am still not sure what we have license for, etc.)

I would guess that it would be cheaper to buy hardware on which the problem can 
be solved in R than to buy a SAS license (last time I looked, suitable 
rack-mount Linux boxes were under USD3000). If you already have SAS available 
it would be worth looking at it. For some large-data problems it will be faster 
or easier to use, but not for all.


     -thomas

Thomas Lumley                   Assoc. Professor, Biostatistics
[email protected]        University of Washington, Seattle

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to