Hello all, I have some genetic datasets (gzipped) that contain 6 columns and upwards of 10s of billions of rows. The largest dataset is about 16 GB on file, gzipped (!). I need to sort them according to columns 1, 2, and 3. The setkey() function in the data.table package does this quickly, but of course we're limited by R not being able to index vectors with > 2^31 elements, and bringing in only the parts of the dataset we need is not applicable here.
I'm asking for practical advice from people who've done this or who have ideas. We'd like to be able to sort the biggest datasets in hours rather than days (or weeks!). We cannot have any process take over 50 GB RAM max (we'd prefer smaller so we can parallelize). . Relational databases seem too slow, but maybe I am wrong. A quick look at the bigmemory package doesn't turn up an ability to sort like this, but again, maybe I'm wrong. My computer programmer writes in C++, so if you have ideas in C++, that works too. Any help would be much appreciated... Thanks! Matt -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.