Hi, First, thanks in advance. Some useful info:
>version platform x86_64-unknown-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu version.string R version 2.15.1 (2012-06-22) I'm trying to use the table() function on a 2 column matrix that has 711 million rows (see below). However, it freezes. If I subset the matrix to be less than or equal to 2^29 (500+ million) then the table() function finishes in minutes. As soon as I go larger than that--beginning with 2^29+1--it gets stuck, ie. nothing happens even after hours of running. I assume it has something to do with memory since I believe that's the 32 bit limit but I'm running on a 64 bit machine. Here's the matrix: >head(DRI.mtx) POSITION BP 38076904 C 38076905 C 38076906 A 38076907 T 38076908 C 38076909 C The result from table (if the matrix has less than 2^29 rows) is >head(table(DRI.mtx)) BP POSITION A C G N T 115247036 17 0 0 0 0 115247037 31 0 0 0 0 115247038 46 0 0 0 0 115247039 0 0 54 0 0 115247040 0 0 1 0 66 115247041 0 0 0 0 78 I've tracked the problem down to the C-file, "unique.c". table() calls factor() which calls unique() which I believe calls "unique.c". Browsing through the C file I found an if statement that checks if the size of the vector is larger than 2^30-1. If TRUE it gives the error message "too large for hashing". I do not get any error message when I run table() on the full matrix but I wonder if maybe I should be and if the limit of 2^30 is too high and should be lowered. Maybe it's just my set up or maybe it has nothing to do with unique.c. I don't know. Here's the part of unique.c I was referring to: /* Choose M to be the smallest power of 2 not less than 2*n and set K = log2(M). Need K >= 1 and hence M >= 2, and 2^M <= 2^31 -1, hence n <= 2^30. Dec 2004: modified from 4*n to 2*n, since in the worst case we have a 50% full table, and that is still rather efficient -- see R. Sedgewick (1998) Algorithms in C++ 3rd edition p.606. */ static void MKsetup(int n, HashData *d) { int n2 = 2 * n; if(n < 0 || n > 1073741824) /* protect against overflow to -ve */ error(_("length %d is too large for hashing"), n); d->M = 2; d->K = 1; while (d->M < n2) { d->M *= 2; d->K += 1; } } "n" I presume is the number of rows of the matrix so I don't see why this wouldn't run properly though I'm not sure what is causing the problem in the unique.c file and I have no idea how to troubleshoot. I have a work around that reads in chunks at a time, but I'm very interested in why there appears to be a limit at 2^29 when according to the unique.c file it should be twice that. Thanks for the help. -Sean [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.