I've been trying to figure out how to read in a large file for a few days now, and after extensive research I'm still not sure what to do.
I have a large comma delimited text file that contains 59 fields in each record. There is also a header every 121 records This function works well for smallish records getcsv=function(fname){ ff=file(description = fname) x <- readLines(ff) closeAllConnections() x <- x[x != ""] # REMOVE BLANKS x=x[grep("^[-0-9]", x)] # REMOVE ALL TEXT spl=strsplit(x,',') # THIS PART IS SLOW, BUT MANAGABLE xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]]))))) return(xx) } It's not elegant, but it works. For 121,000 records it completes in 2.3 seconds For 121,000*5 records it completes in 63 seconds For 121,000*10 records it doesn't complete When I try other methods to read the file in chunks (using scan), the process breaks down because I have to start at the beginning of the file on every iteration. For example: fnn=function(n,col){ a=122*(n-1)+2 xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0)) xx=xx[xx!=''] xx=matrix(xx,ncol=49,byrow=TRUE) xx[,col] } system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds system.time(sapply(91:90,fnn,c=26)) # 1.09 Seconds system.time(sapply(901:910,fnn,c=26)) # 5.78 Seconds Even though I'm only getting the 26th column for 10 sets of records, it takes a lot longer the further into the file I go. How can I tell scan to pick up where it left off, without it starting at the beginning?? There must be a good example somewhere. I have done a lot of research (in fact, thank you to Michael J. Crawley and others for your help thus far) Thanks, Gene [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.