2008/9/22 José E. Lozano <[EMAIL PROTECTED]>: > Exactly, raw data, but a little more complex since all the 500000 variables > are in text format, so the width is around 2,500,000.
> Thanks, I will check. Right now I am reading line by line the file. It's > time consuming, but since I will do it only once, just to rearrange the data > into smaller tables to query, it's ok. A language like python, perl, or even awk might be able to help you slice your data up. > Is genetic DNA data (individuals genotyped), hence the large amount of > columns to analyze. So is each line just ACCGTATAT etc etc? If you have fixed width fields in a file, so that every line is the same length, then you can use random access methods to get to a particular value - just multiply the line length by the row number you want and add the column number. In R you can do this with seek() on a connection. This should be fast because it seeks by bytes, instead of having to scan all the comma-separated stuff. The only problem comes when your data doesn't quite conform, and you can end up reading junk. When doing this, it's a good idea to test your dataset first to make sure the lines and fields are right. Example with dummy.dna: aaaccctttgggaaa gattacagattacaa aaaaaaacccccggg gggggtgggggtggg aaaaaaaaaaccccc each line has 15 bases, and on my OS there's one additional invisible character to mark the line end. Windows uses 2, but your data might not be Windows format... So anyway, my multiplier is 16. Hence to get a slice of the file of four columns from column 7 for some rows: > dna=file("dummy.dna") > open(dna,open="rb") > for(r in 2:4){seek(dna,7+(r-1)*16);print(readChar(dna,4))} [1] "gatt" [1] "cccc" [1] "gggg" The speed of this should be independent of the size of your data file. Barry ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.