Thanks for the help all! Good to know that there's an answer. Unfortunately, I don't have the rights to install programs so I wasn't able to try devel and I've never heard of R patched but I'm guessing I can't install that either. I'll see if I can get someone to do that.
Much appreciated! Sean On Aug 9, 2012, at 10:05 PM, Prof Brian Ripley <rip...@stats.ox.ac.uk> wrote: > As the posting guide asked you to before posting, try R-patched. That has > the NEWS items > > • duplicated(), unique() and similar now support vectors of > lengths above 2^29 on 64-bit platforms. > > • unique() and similar would infinite-loop if called on a vector of > length > 2^29 (but reported that the vector was too long for 2^30 > or more). > > If you want to work on such large datasets, you might want to consider using > R-devel which has a number of enhancements already with more in the pipeline. > > On 10/08/2012 01:29, Sean Ruddy wrote: >> Hi, >> >> First, thanks in advance. Some useful info: >> >>> version >> platform x86_64-unknown-linux-gnu >> arch x86_64 >> os linux-gnu >> system x86_64, linux-gnu >> version.string R version 2.15.1 (2012-06-22) >> >> I'm trying to use the table() function on a 2 column matrix that has 711 >> million rows (see below). However, it freezes. If I subset the matrix to be >> less than or equal to 2^29 (500+ million) then the table() function >> finishes in minutes. As soon as I go larger than that--beginning with >> 2^29+1--it gets stuck, ie. nothing happens even after hours of running. I >> assume it has something to do with memory since I believe that's the 32 bit >> limit but I'm running on a 64 bit machine. >> >> Here's the matrix: >> >>> head(DRI.mtx) >> >> POSITION BP >> 38076904 C >> 38076905 C >> 38076906 A >> 38076907 T >> 38076908 C >> 38076909 C >> >> >> The result from table (if the matrix has less than 2^29 rows) is >> >>> head(table(DRI.mtx)) >> >> BP >> POSITION A C G N T >> 115247036 17 0 0 0 0 >> 115247037 31 0 0 0 0 >> 115247038 46 0 0 0 0 >> 115247039 0 0 54 0 0 >> 115247040 0 0 1 0 66 >> 115247041 0 0 0 0 78 >> >> >> I've tracked the problem down to the C-file, "unique.c". table() calls >> factor() which calls unique() which I believe calls "unique.c". Browsing >> through the C file I found an if statement that checks if the size of the >> vector is larger than 2^30-1. If TRUE it gives the error message "too large >> for hashing". I do not get any error message when I run table() on the full >> matrix but I wonder if maybe I should be and if the limit of 2^30 is too >> high and should be lowered. Maybe it's just my set up or maybe it has >> nothing to do with unique.c. I don't know. >> >> Here's the part of unique.c I was referring to: >> >> /* >> Choose M to be the smallest power of 2 >> not less than 2*n and set K = log2(M). >> Need K >= 1 and hence M >= 2, and 2^M <= 2^31 -1, hence n <= 2^30. >> >> Dec 2004: modified from 4*n to 2*n, since in the worst case we have >> a 50% full table, and that is still rather efficient -- see >> R. Sedgewick (1998) Algorithms in C++ 3rd edition p.606. >> */ >> static void MKsetup(int n, HashData *d) >> { >> int n2 = 2 * n; >> if(n < 0 || n > 1073741824) /* protect against overflow to -ve */ >> error(_("length %d is too large for hashing"), n); >> d->M = 2; >> d->K = 1; >> while (d->M < n2) { >> d->M *= 2; >> d->K += 1; >> } >> } >> >> "n" I presume is the number of rows of the matrix so I don't see why this >> wouldn't run properly though I'm not sure what is causing the problem in >> the unique.c file and I have no idea how to troubleshoot. >> >> I have a work around that reads in chunks at a time, but I'm very >> interested in why there appears to be a limit at 2^29 when according to the >> unique.c file it should be twice that. >> >> Thanks for the help. >> >> -Sean >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > -- > Brian D. Ripley, rip...@stats.ox.ac.uk > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.