When you read your file into R, show the structure of the object: str(tab)
also the size of the object: object.size(tab) This will tell you what your data looks like and the size taken in R. Also in read.table, use colClasses to define what the format of the data is; may make it faster. You might want to force a garbage collection 'gc()' to see if that frees up any memory. If your input is about 2M lines and it looks like there are three column (alpha, numeric, numeric), I would guess that you will probably have an object.size of about 50MB. This information would help. On Mon, Sep 14, 2009 at 11:11 PM, Evan Klitzke <e...@eklitzke.org> wrote: > Hello all, > > To start with, these measurements are on Linux with R 2.9.2 (64-bit > build) and Python 2.6 (also 64-bit). > > I've been investigating R for some log file analysis that I've been > doing. I'm coming at this from the angle of a programmer whose > primarily worked in Python. As I've been playing around with R, I've > noticed that R seems to use a *lot* of memory, especially compared to > Python. Here's an example of what I'm talking about. I have a sample > data file whose characteristics are like this: > > [e...@t500 ~]$ ls -lh 20090708.tab > -rw-rw-r-- 1 evan evan 63M 2009-07-08 20:56 20090708.tab > > [e...@t500 ~]$ head 20090708.tab > spice 1247036405.04 0.0141088962555 > spice 1247036405.01 0.046797990799 > spice 1247036405.13 0.0137498378754 > spice 1247036404.87 0.0594480037689 > spice 1247036405.02 0.0170919895172 > topic 1247036404.74 0.512196063995 > user_details 1247036404.64 0.242133140564 > spice 1247036405.23 0.0408620834351 > biz_details 1247036405.04 0.40732884407 > spice 1247036405.35 0.0501029491425 > > [e...@t500 ~]$ wc -l 20090708.tab > 1797601 20090708.tab > > So it's basically a CSV file (actually, space delimited) where all of > the lines are three columns, a low-cardinality string, a double, and a > double. The file itself is 63M. Python can load all of the data from > the file really compactly (source for the script at the bottom of the > message): > > [e...@t500 ~]$ python code/scratch/pymem.py > VIRT = 25230, RSS = 860 > VIRT = 81142, RSS = 55825 > > So this shows that my Python process starts out at 860K RSS memory > before doing any processing, and ends at 55M of RSS memory. This is > pretty good, actually it's better than the size of the file, since a > double can be stored more compactly than the textual data stored in > the data file. > > Since I'm new to R I didn't know how to read /proc and so forth, so > instead I launched an R repl and used ps to record the RSS memory > usage before and after running the following statement: > >> tab <- read.table("~/20090708.tab") > > The numbers I measured were: > VIRT = 176820, RSS = 26180 (just after starting the repl) > VIRT = 414284, RSS = 263708 (after executing the command) > > This kind of concerns me. I can understand why R uses more memory at > startup, since it's launching a full repl which my Python script > wasn't doing. But I would have expected the memory usage to not have > grown more like Python did after loading the data. In fact, R ought to > be able to use less memory, since the first column is textual and has > low cardinality (I think 7 distinct values), so storing it as a factor > should be very memory efficient. > > For the things that I want to use R for, I know I'll be processing > much larger datasets, and at the rate that R is consuming memory it > may not be possible to fully load the data into memory. I'm concerned > that it may not be worth pursuing learning R if it's possible to load > the data into memory using something like Python but not R. I don't > want to overlook the possibility that I'm overlooking something, since > I'm new to the language. Can anyone answer for me: > * What is R doing with all of that memory? > * Is there something I did wrong? Is there a more memory-efficient > way to load this data? > * Are there R modules that can store large data-sets in a more > memory-efficient way? Can anyone relate their experiences with them? > > For reference, here's the Python script I used to measure Python's memory > usage: > > import os > > def show_mem(): > statm = open('/proc/%d/statm' % os.getpid()).read() > print 'VIRT = %s, RSS = %s' % tuple(statm.split(' ')[:2]) > > def read_data(fname): > servlets = [] > timestamps = [] > elapsed = [] > > for line in open(fname, 'r'): > s, t, e = line.strip().split(' ') > servlets.append(s) > timestamps.append(float(t)) > elapsed.append(float(e)) > > show_mem() > > if __name__ == '__main__': > show_mem() > read_data('/home/evan/20090708.tab') > > > -- > Evan Klitzke <e...@eklitzke.org> :wq > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.