Here is a test that I ran where the difference was rather the data was in a single column or 3700 columns. If in a single column, the 'scan' and 'read.table' were comparable; with 3700 columns, read.table took 3X longer. using 'colClasses' did not make a difference:
> x.n <- as.character(runif(3700)) > x.f <- tempfile() > # just write out a file of numbers in a single column > # 3700 * 500 = 1.85M lines > writeLines(rep(x.n, 500), con = x.f) > file.info(x.f) size isdir mode mtime C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 35154500 FALSE 666 2011-12-07 06:13:56 ctime atime C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 2011-12-07 06:13:52 2011-12-07 06:13:52 exe C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 no > system.time(x.n.read <- scan(x.f)) Read 1850000 items user system elapsed 4.04 0.05 4.10 > dim(x.n.read) NULL > object.size(x.n.read) 14800040 bytes > system.time(x.n.read <- read.table(x.f)) # comparible to 'scan' user system elapsed 4.68 0.06 4.74 > object.size(x.n.read) 14800672 bytes > > # now create data with 3700 columns > # and 500 rows (1.85M numbers) > x.long <- paste(x.n, collapse = ',') > writeLines(rep(x.long, 500), con = x.f) > file.info(x.f) size isdir mode mtime C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 33305000 FALSE 666 2011-12-07 06:14:11 ctime atime C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 2011-12-07 06:13:52 2011-12-07 06:13:52 C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 no > system.time(x.long.read <- scan(x.f, sep = ',')) Read 1850000 items user system elapsed 4.21 0.02 4.23 > dim(x.long.read) NULL > object.size(x.long.read) 14800040 bytes > # takes 3 times as long as 'scan' > system.time(x.long.read <- read.table(x.f, sep = ',')) user system elapsed 13.24 0.06 13.33 > dim(x.long.read) [1] 500 3700 > object.size(x.long.read) 15185368 bytes > > > # using colClasses > system.time(x.long.read <- read.table(x.f, sep = ',' + , colClasses = rep('numeric', 3700) + ) + ) user system elapsed 12.39 0.06 12.48 > > On Tue, Dec 6, 2011 at 4:33 PM, Gene Leynes <gley...@gmail.com> wrote: > Mark, > > Thanks for your suggestions. > > That's a good idea about the NULL columns; I didn't think of that. > Surprisingly, it didn't have any effect on the time. > > This problem was just a curiosity, I already did the import using Excel and > VBA. I was just going to illustrate the power and simplicity of R, but it > ironically it's been much slower and harder in R... > The VBA was painful and messy, and took me over an hour to write; but at > least it worked quickly and reliably. > The R code was clean and only took me about 5 minutes to write, but the run > time was prohibitively slow! > > I profiled the code, but that offers little insight to me. > > Profile results with 10 line file: > >> summaryRprof("C:/Users/gene.leynes/Desktop/test.out") > $by.self > self.time self.pct total.time total.pct > scan 12.24 53.50 12.24 53.50 > read.table 10.58 46.24 22.88 100.00 > type.convert 0.04 0.17 0.04 0.17 > make.names 0.02 0.09 0.02 0.09 > > $by.total > total.time total.pct self.time self.pct > read.table 22.88 100.00 10.58 46.24 > scan 12.24 53.50 12.24 53.50 > type.convert 0.04 0.17 0.04 0.17 > make.names 0.02 0.09 0.02 0.09 > > $sample.interval > [1] 0.02 > > $sampling.time > [1] 22.88 > > > Profile results with 250 line file: > >> summaryRprof("C:/Users/gene.leynes/Desktop/test.out") > $by.self > self.time self.pct total.time total.pct > scan 23.88 68.15 23.88 68.15 > read.table 10.78 30.76 35.04 100.00 > type.convert 0.30 0.86 0.32 0.91 > character 0.02 0.06 0.02 0.06 > file 0.02 0.06 0.02 0.06 > lapply 0.02 0.06 0.02 0.06 > unlist 0.02 0.06 0.02 0.06 > > $by.total > total.time total.pct self.time self.pct > read.table 35.04 100.00 10.78 30.76 > scan 23.88 68.15 23.88 68.15 > type.convert 0.32 0.91 0.30 0.86 > sapply 0.04 0.11 0.00 0.00 > character 0.02 0.06 0.02 0.06 > file 0.02 0.06 0.02 0.06 > lapply 0.02 0.06 0.02 0.06 > unlist 0.02 0.06 0.02 0.06 > simplify2array 0.02 0.06 0.00 0.00 > > $sample.interval > [1] 0.02 > > $sampling.time > [1] 35.04 > > > > > On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds <marklee...@gmail.com> wrote: > >> hi gene: maybe someone else will reply with some subtleties that I'm not >> aware of. one other thing >> that might help: if you know which columns you want , you can set the >> others to NULL through >> colClasses and this should speed things up also. For example, say you knew >> you only wanted the >> first four columns and they were character. then you could do, >> >> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), >> rep(NULL,3696)). >> >> hopefully someone else will say something that does the trick. it seems >> odd to me as far as the >> difference in timings ? good luck. >> >> >> >> >> >> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes <gley...@gmail.com> wrote: >> >>> Mark, >>> >>> Thank you for the reply >>> >>> I neglected to mention that I had already set >>> options(stringsAsFactors=FALSE) >>> >>> I agree, skipping the factor determination can help performance. >>> >>> The main reason that I wanted to use read.table is because it will >>> correctly determine the column classes for me. I don't really want to >>> specify 3700 column classes! (I'm not sure what they are anyway). >>> >>> >>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds <marklee...@gmail.com> wrote: >>> >>>> Hi Gene: Sometimes using colClasses in read.table can speed things up. >>>> If you know what your variables are ahead of time and what you want them to >>>> be, this allows you to be specific by specifying, character or numeric, >>>> etc and often it makes things faster. others will have more to say. >>>> >>>> also, if most of your variables are characters, R will try to turn >>>> convert them into factors by default. If you use as.is = TRUE it won't >>>> do this and that might speed things up also. >>>> >>>> >>>> Rejoinder: above tidbits are just from experience. I don't know if >>>> it's in stone or a hard and fast rule. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes <gley...@gmail.com> wrote: >>>> >>>>> ** Disclaimer: I'm looking for general suggestions ** >>>>> I'm sorry, but can't send out the file I'm using, so there is no >>>>> reproducible example. >>>>> >>>>> I'm using read.table and it's taking over 30 seconds to read a tiny >>>>> file. >>>>> The strange thing is that it takes roughly the same amount of time if >>>>> the >>>>> file is 100 times larger. >>>>> >>>>> After re-reviewing the data Import / Export manual I think the best >>>>> approach would be to use Python, or perhaps the readLines function, but >>>>> I >>>>> was hoping to understand why the simple read.table approach wasn't >>>>> working >>>>> as expected. >>>>> >>>>> Some relevant facts: >>>>> >>>>> 1. There are about 3700 columns. Maybe this is the problem? Still >>>>> the >>>>> >>>>> file size is not very large. >>>>> 2. The file encoding is ANSI, but I'm not specifying that in the >>>>> >>>>> function. Setting fileEncoding="ANSI" produces an "unsupported >>>>> conversion" >>>>> error >>>>> 3. readLines imports the lines quickly >>>>> 4. scan imports the file quickly also >>>>> >>>>> >>>>> Obviously, scan and readLines would require more coding to identify >>>>> columns, etc. >>>>> >>>>> my code: >>>>> system.time(dat <- read.table('C:/test.txt', nrows=-1, sep='\t', >>>>> header=TRUE)) >>>>> >>>>> It's taking 33.4 seconds and the file size is only 315 kb! >>>>> >>>>> Thanks >>>>> >>>>> Gene >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> >>>> >>> >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.