I've found that opening a connection, and scanning (in a loop)
line-by-line, is far faster than either read.table or read.fwf. E.g,
here's a file (temp2) that has 1500 rows and 550K columns:

showConnections(all=TRUE)
con <- file("temp2",open='r')
system.time({
for (i in 0:(num.samp-1)){
  new.gen[i+1,] <- scan(con,what='integer',nlines=1)}
})
close(con)
#THIS TAKES 4.6 MINUTES




system.time({
new.gen2 <- 
read.fwf(con,widths=rep(1,num.cols),buffersize=100,header=FALSE,colClasses=rep('integer',num.cols))
})
#THIS TAKES OVER 20 MINUTES (I GOT BORED OF WAITING AND KILLED IT)


This seems surprising to me. Can anyone see some other way to speed
this type of thing up?

Matt


On Sat, Jul 24, 2010 at 1:55 PM, Greg Snow <greg.s...@imail.org> wrote:
> You may want to look at the biglm package as another way to regression models 
> on very large data sets.
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.s...@imail.org
> 801.408.8111
>
>
>> -----Original Message-----
>> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
>> project.org] On Behalf Of babyfoxlo...@sina.com
>> Sent: Friday, July 23, 2010 10:10 AM
>> To: r-help@r-project.org
>> Subject: [R] How to deal with more than 6GB dataset using R?
>>
>> &nbsp;Hi there,
>>
>> Sorry to bother those who are not interested in this problem.
>>
>> I'm dealing with a large data set, more than 6 GB file, and doing
>> regression test with those data. I was wondering are there any
>> efficient ways to read those data? Instead of just using read.table()?
>> BTW, I'm using a 64bit version desktop and a 64bit version R, and the
>> memory for the desktop is enough for me to use.
>> Thanks.
>>
>>
>> --Gin
>>
>>       [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Matthew C Keller
Asst. Professor of Psychology
University of Colorado at Boulder
www.matthewckeller.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to