On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote: > R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file > verbatim: system.time(read.table("test2.txt"))
About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8. Gene, are you by any chance storing the file in a heavily virus-scanned system directory? -pd > Michael > > 2011/12/7 Gene Leynes <gley...@gmail.com>: >> Peter, >> >> You're quite right; it's nearly impossible to make progress without a >> working example. >> >> I created an ** extremely simplified ** example for distribution. The real >> data has numeric, character, and boolean classes. >> >> The file still takes 25.08 seconds to read, despite it's small size. >> >> I neglected to mention that I'm using R 2.13.0 and I"m on a windows 7 >> machine (not that it should particularly matter with this type of data / >> functions). >> >> ## The code: >> options(stringsAsFactors=FALSE) >> system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t', header=TRUE)) >> str(dat, 0) >> >> >> Thanks again! >> >> >> >> On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard <pda...@gmail.com> wrote: >> >>> >>> On Dec 6, 2011, at 22:33 , Gene Leynes wrote: >>> >>>> Mark, >>>> >>>> Thanks for your suggestions. >>>> >>>> That's a good idea about the NULL columns; I didn't think of that. >>>> Surprisingly, it didn't have any effect on the time. >>> >>> Hmm, I think you want "character" and "NULL" there (i.e., quoted). Did you >>> fix both? >>> >>>>> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), >>>>> rep(NULL,3696)). >>> >>> As a general matter, if you want people to dig into this, they need some >>> paraphrase of the file to play with. Would it be possible to set up a small >>> R program that generates a data file which displays the issue? Everything I >>> try seems to take about a second to read in. >>> >>> -pd >>> >>>> >>>> This problem was just a curiosity, I already did the import using Excel >>> and >>>> VBA. I was just going to illustrate the power and simplicity of R, but >>> it >>>> ironically it's been much slower and harder in R... >>>> The VBA was painful and messy, and took me over an hour to write; but at >>>> least it worked quickly and reliably. >>>> The R code was clean and only took me about 5 minutes to write, but the >>> run >>>> time was prohibitively slow! >>>> >>>> I profiled the code, but that offers little insight to me. >>>> >>>> Profile results with 10 line file: >>>> >>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out") >>>> $by.self >>>> self.time self.pct total.time total.pct >>>> scan 12.24 53.50 12.24 53.50 >>>> read.table 10.58 46.24 22.88 100.00 >>>> type.convert 0.04 0.17 0.04 0.17 >>>> make.names 0.02 0.09 0.02 0.09 >>>> >>>> $by.total >>>> total.time total.pct self.time self.pct >>>> read.table 22.88 100.00 10.58 46.24 >>>> scan 12.24 53.50 12.24 53.50 >>>> type.convert 0.04 0.17 0.04 0.17 >>>> make.names 0.02 0.09 0.02 0.09 >>>> >>>> $sample.interval >>>> [1] 0.02 >>>> >>>> $sampling.time >>>> [1] 22.88 >>>> >>>> >>>> Profile results with 250 line file: >>>> >>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out") >>>> $by.self >>>> self.time self.pct total.time total.pct >>>> scan 23.88 68.15 23.88 68.15 >>>> read.table 10.78 30.76 35.04 100.00 >>>> type.convert 0.30 0.86 0.32 0.91 >>>> character 0.02 0.06 0.02 0.06 >>>> file 0.02 0.06 0.02 0.06 >>>> lapply 0.02 0.06 0.02 0.06 >>>> unlist 0.02 0.06 0.02 0.06 >>>> >>>> $by.total >>>> total.time total.pct self.time self.pct >>>> read.table 35.04 100.00 10.78 30.76 >>>> scan 23.88 68.15 23.88 68.15 >>>> type.convert 0.32 0.91 0.30 0.86 >>>> sapply 0.04 0.11 0.00 0.00 >>>> character 0.02 0.06 0.02 0.06 >>>> file 0.02 0.06 0.02 0.06 >>>> lapply 0.02 0.06 0.02 0.06 >>>> unlist 0.02 0.06 0.02 0.06 >>>> simplify2array 0.02 0.06 0.00 0.00 >>>> >>>> $sample.interval >>>> [1] 0.02 >>>> >>>> $sampling.time >>>> [1] 35.04 >>>> >>>> >>>> >>>> >>>> On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds <marklee...@gmail.com> wrote: >>>> >>>>> hi gene: maybe someone else will reply with some subtleties that I'm >>> not >>>>> aware of. one other thing >>>>> that might help: if you know which columns you want , you can set the >>>>> others to NULL through >>>>> colClasses and this should speed things up also. For example, say you >>> knew >>>>> you only wanted the >>>>> first four columns and they were character. then you could do, >>>>> >>>>> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), >>>>> rep(NULL,3696)). >>>>> >>>>> hopefully someone else will say something that does the trick. it seems >>>>> odd to me as far as the >>>>> difference in timings ? good luck. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes <gley...@gmail.com> wrote: >>>>> >>>>>> Mark, >>>>>> >>>>>> Thank you for the reply >>>>>> >>>>>> I neglected to mention that I had already set >>>>>> options(stringsAsFactors=FALSE) >>>>>> >>>>>> I agree, skipping the factor determination can help performance. >>>>>> >>>>>> The main reason that I wanted to use read.table is because it will >>>>>> correctly determine the column classes for me. I don't really want to >>>>>> specify 3700 column classes! (I'm not sure what they are anyway). >>>>>> >>>>>> >>>>>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds <marklee...@gmail.com> >>> wrote: >>>>>> >>>>>>> Hi Gene: Sometimes using colClasses in read.table can speed things up. >>>>>>> If you know what your variables are ahead of time and what you want >>> them to >>>>>>> be, this allows you to be specific by specifying, character or >>> numeric, >>>>>>> etc and often it makes things faster. others will have more to say. >>>>>>> >>>>>>> also, if most of your variables are characters, R will try to turn >>>>>>> convert them into factors by default. If you use as.is = TRUE it >>> won't >>>>>>> do this and that might speed things up also. >>>>>>> >>>>>>> >>>>>>> Rejoinder: above tidbits are just from experience. I don't know if >>>>>>> it's in stone or a hard and fast rule. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes <gley...@gmail.com> >>> wrote: >>>>>>> >>>>>>>> ** Disclaimer: I'm looking for general suggestions ** >>>>>>>> I'm sorry, but can't send out the file I'm using, so there is no >>>>>>>> reproducible example. >>>>>>>> >>>>>>>> I'm using read.table and it's taking over 30 seconds to read a tiny >>>>>>>> file. >>>>>>>> The strange thing is that it takes roughly the same amount of time if >>>>>>>> the >>>>>>>> file is 100 times larger. >>>>>>>> >>>>>>>> After re-reviewing the data Import / Export manual I think the best >>>>>>>> approach would be to use Python, or perhaps the readLines function, >>> but >>>>>>>> I >>>>>>>> was hoping to understand why the simple read.table approach wasn't >>>>>>>> working >>>>>>>> as expected. >>>>>>>> >>>>>>>> Some relevant facts: >>>>>>>> >>>>>>>> 1. There are about 3700 columns. Maybe this is the problem? Still >>>>>>>> the >>>>>>>> >>>>>>>> file size is not very large. >>>>>>>> 2. The file encoding is ANSI, but I'm not specifying that in the >>>>>>>> >>>>>>>> function. Setting fileEncoding="ANSI" produces an "unsupported >>>>>>>> conversion" >>>>>>>> error >>>>>>>> 3. readLines imports the lines quickly >>>>>>>> 4. scan imports the file quickly also >>>>>>>> >>>>>>>> >>>>>>>> Obviously, scan and readLines would require more coding to identify >>>>>>>> columns, etc. >>>>>>>> >>>>>>>> my code: >>>>>>>> system.time(dat <- read.table('C:/test.txt', nrows=-1, sep='\t', >>>>>>>> header=TRUE)) >>>>>>>> >>>>>>>> It's taking 33.4 seconds and the file size is only 315 kb! >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> Gene >>>>>>>> >>>>>>>> [[alternative HTML version deleted]] >>>>>>>> >>>>>>>> ______________________________________________ >>>>>>>> R-help@r-project.org mailing list >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>> PLEASE do read the posting guide >>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>> -- >>> Peter Dalgaard, Professor, >>> Center for Statistics, Copenhagen Business School >>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >>> Phone: (+45)38153501 >>> Email: pd....@cbs.dk Priv: pda...@gmail.com >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd....@cbs.dk Priv: pda...@gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.