Hello all, To start with, these measurements are on Linux with R 2.9.2 (64-bit build) and Python 2.6 (also 64-bit).
I've been investigating R for some log file analysis that I've been doing. I'm coming at this from the angle of a programmer whose primarily worked in Python. As I've been playing around with R, I've noticed that R seems to use a *lot* of memory, especially compared to Python. Here's an example of what I'm talking about. I have a sample data file whose characteristics are like this: [e...@t500 ~]$ ls -lh 20090708.tab -rw-rw-r-- 1 evan evan 63M 2009-07-08 20:56 20090708.tab [e...@t500 ~]$ head 20090708.tab spice 1247036405.04 0.0141088962555 spice 1247036405.01 0.046797990799 spice 1247036405.13 0.0137498378754 spice 1247036404.87 0.0594480037689 spice 1247036405.02 0.0170919895172 topic 1247036404.74 0.512196063995 user_details 1247036404.64 0.242133140564 spice 1247036405.23 0.0408620834351 biz_details 1247036405.04 0.40732884407 spice 1247036405.35 0.0501029491425 [e...@t500 ~]$ wc -l 20090708.tab 1797601 20090708.tab So it's basically a CSV file (actually, space delimited) where all of the lines are three columns, a low-cardinality string, a double, and a double. The file itself is 63M. Python can load all of the data from the file really compactly (source for the script at the bottom of the message): [e...@t500 ~]$ python code/scratch/pymem.py VIRT = 25230, RSS = 860 VIRT = 81142, RSS = 55825 So this shows that my Python process starts out at 860K RSS memory before doing any processing, and ends at 55M of RSS memory. This is pretty good, actually it's better than the size of the file, since a double can be stored more compactly than the textual data stored in the data file. Since I'm new to R I didn't know how to read /proc and so forth, so instead I launched an R repl and used ps to record the RSS memory usage before and after running the following statement: > tab <- read.table("~/20090708.tab") The numbers I measured were: VIRT = 176820, RSS = 26180 (just after starting the repl) VIRT = 414284, RSS = 263708 (after executing the command) This kind of concerns me. I can understand why R uses more memory at startup, since it's launching a full repl which my Python script wasn't doing. But I would have expected the memory usage to not have grown more like Python did after loading the data. In fact, R ought to be able to use less memory, since the first column is textual and has low cardinality (I think 7 distinct values), so storing it as a factor should be very memory efficient. For the things that I want to use R for, I know I'll be processing much larger datasets, and at the rate that R is consuming memory it may not be possible to fully load the data into memory. I'm concerned that it may not be worth pursuing learning R if it's possible to load the data into memory using something like Python but not R. I don't want to overlook the possibility that I'm overlooking something, since I'm new to the language. Can anyone answer for me: * What is R doing with all of that memory? * Is there something I did wrong? Is there a more memory-efficient way to load this data? * Are there R modules that can store large data-sets in a more memory-efficient way? Can anyone relate their experiences with them? For reference, here's the Python script I used to measure Python's memory usage: import os def show_mem(): statm = open('/proc/%d/statm' % os.getpid()).read() print 'VIRT = %s, RSS = %s' % tuple(statm.split(' ')[:2]) def read_data(fname): servlets = [] timestamps = [] elapsed = [] for line in open(fname, 'r'): s, t, e = line.strip().split(' ') servlets.append(s) timestamps.append(float(t)) elapsed.append(float(e)) show_mem() if __name__ == '__main__': show_mem() read_data('/home/evan/20090708.tab') -- Evan Klitzke <e...@eklitzke.org> :wq ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.