By the way, here's my original session information. (I can never remember the name of that command when I want it). It's strange that Petr is having the problem with 2.14. It's relatively fast on my machine with R 2.14.
> sessionInfo() R version 2.13.0 (2011-04-13) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base > On Thu, Dec 8, 2011 at 3:06 AM, Rainer M Krug <r.m.k...@gmail.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 08/12/11 09:32, Petr PIKAL wrote: > > Hi > > > >> system.time(dat<-read.table("test2.txt")) > > user system elapsed 32.38 0.00 32.40 > > > >> system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t', > > header=TRUE)) user system elapsed 32.30 0.03 32.36 > > > > Couldn't.it be a Windows issue? > > Likely - here on Linux I get: > > > system.time(dat <- read.table('tmp/test2.txt', nrows=-1, sep='\t', > header=TRUE)) > user system elapsed > 1.560 0.000 1.579 > > sessionInfo() > R version 2.14.0 (2011-10-31) > Platform: i686-pc-linux-gnu (32-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > version > _ > platform i686-pc-linux-gnu > arch i686 > os linux-gnu > system i686, linux-gnu > status > major 2 > minor 14.0 > year 2011 > month 10 > day 31 > svn rev 57496 > language R > version.string R version 2.14.0 (2011-10-31) > > > > > Cheers, > > Rainer > > > _ platform i386-pc-mingw32 arch i386 os > > mingw32 system i386, mingw32 status Under > > development (unstable) major 2 minor 14.0 year > > 2011 month 04 day 27 svn rev 55657 > > language R version.string R version 2.14.0 Under development > > (unstable) (2011-04-27 r55657) > >> > > > > > >> dim(dat) > > [1] 7 3765 > >> > > > > But from the dat file it seems to me that its structure is somehow > > weird. > > > >> head(names(dat)) > > [1] "X..Hydrogen" "Helium" "Lithium" "Beryllium" "Boron" > > [6] "Carbon" > >> tail(names(dat)) > > [1] "Sulfur.32" "Chlorine.32" "Argon.32" "Potassium.32" > > "Calcium.32" [6] "Scandium.32" > >> > > > > There is row of names which has repeating values. Maybe the most > > time is spent by checking the names validity. > > > > Regards Petr > > > > r-help-boun...@r-project.org napsal dne 07.12.2011 23:11:10: > > > >> peter dalgaard <pda...@gmail.com> Odeslal: > >> r-help-boun...@r-project.org > >> > >> 07.12.2011 23:11 > >> > >> Komu > >> > >> "R. Michael Weylandt" <michael.weyla...@gmail.com> > >> > >> Kopie > >> > >> r-help@r-project.org, Gene Leynes <gley...@gmail.com> > >> > >> P?edm?t > >> > >> Re: [R] read.table performance > >> > >> > >> On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote: > >> > >>> R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file > >>> verbatim: system.time(read.table("test2.txt")) > >> > >> About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8. > >> > >> Gene, are you by any chance storing the file in a heavily > >> virus-scanned system directory? > >> > >> -pd > >> > >>> Michael > >>> > >>> 2011/12/7 Gene Leynes <gley...@gmail.com>: > >>>> Peter, > >>>> > >>>> You're quite right; it's nearly impossible to make progress > >>>> without a working example. > >>>> > >>>> I created an ** extremely simplified ** example for > >>>> distribution. The > > real > >>>> data has numeric, character, and boolean classes. > >>>> > >>>> The file still takes 25.08 seconds to read, despite it's > >>>> small size. > >>>> > >>>> I neglected to mention that I'm using R 2.13.0 and I"m on a > >>>> windows 7 machine (not that it should particularly matter > >>>> with this type of > > data / > >>>> functions). > >>>> > >>>> ## The code: options(stringsAsFactors=FALSE) system.time(dat > >>>> <- read.table('test2.txt', nrows=-1, sep='\t', > > header=TRUE)) > >>>> str(dat, 0) > >>>> > >>>> > >>>> Thanks again! > >>>> > >>>> > >>>> > >>>> On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard > >>>> <pda...@gmail.com> > > wrote: > >>>> > >>>>> > >>>>> On Dec 6, 2011, at 22:33 , Gene Leynes wrote: > >>>>> > >>>>>> Mark, > >>>>>> > >>>>>> Thanks for your suggestions. > >>>>>> > >>>>>> That's a good idea about the NULL columns; I didn't think > >>>>>> of that. Surprisingly, it didn't have any effect on the > >>>>>> time. > >>>>> > >>>>> Hmm, I think you want "character" and "NULL" there (i.e., > >>>>> quoted). > > Did you > >>>>> fix both? > >>>>> > >>>>>>> read.table(whatever, as.is=TRUE, colClasses = > >>>>>>> c(rep(character,4), rep(NULL,3696)). > >>>>> > >>>>> As a general matter, if you want people to dig into this, > >>>>> they need > > some > >>>>> paraphrase of the file to play with. Would it be possible > >>>>> to set up > > a small > >>>>> R program that generates a data file which displays the > >>>>> issue? > > Everything I > >>>>> try seems to take about a second to read in. > >>>>> > >>>>> -pd > >>>>> > >>>>>> > >>>>>> This problem was just a curiosity, I already did the > >>>>>> import using > > Excel > >>>>> and > >>>>>> VBA. I was just going to illustrate the power and > >>>>>> simplicity of R, > > but > >>>>> it > >>>>>> ironically it's been much slower and harder in R... The > >>>>>> VBA was painful and messy, and took me over an hour to > >>>>>> write; > > but at > >>>>>> least it worked quickly and reliably. The R code was > >>>>>> clean and only took me about 5 minutes to write, but > > the > >>>>> run > >>>>>> time was prohibitively slow! > >>>>>> > >>>>>> I profiled the code, but that offers little insight to > >>>>>> me. > >>>>>> > >>>>>> Profile results with 10 line file: > >>>>>> > >>>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out") > >>>>>> $by.self self.time self.pct total.time total.pct scan > >>>>>> 12.24 53.50 12.24 53.50 read.table > >>>>>> 10.58 46.24 22.88 100.00 type.convert > >>>>>> 0.04 0.17 0.04 0.17 make.names 0.02 > >>>>>> 0.09 0.02 0.09 > >>>>>> > >>>>>> $by.total total.time total.pct self.time self.pct > >>>>>> read.table 22.88 100.00 10.58 46.24 scan > >>>>>> 12.24 53.50 12.24 53.50 type.convert > >>>>>> 0.04 0.17 0.04 0.17 make.names 0.02 > >>>>>> 0.09 0.02 0.09 > >>>>>> > >>>>>> $sample.interval [1] 0.02 > >>>>>> > >>>>>> $sampling.time [1] 22.88 > >>>>>> > >>>>>> > >>>>>> Profile results with 250 line file: > >>>>>> > >>>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out") > >>>>>> $by.self self.time self.pct total.time total.pct scan > >>>>>> 23.88 68.15 23.88 68.15 read.table > >>>>>> 10.78 30.76 35.04 100.00 type.convert > >>>>>> 0.30 0.86 0.32 0.91 character 0.02 > >>>>>> 0.06 0.02 0.06 file 0.02 0.06 > >>>>>> 0.02 0.06 lapply 0.02 0.06 0.02 > >>>>>> 0.06 unlist 0.02 0.06 0.02 > >>>>>> 0.06 > >>>>>> > >>>>>> $by.total total.time total.pct self.time self.pct > >>>>>> read.table 35.04 100.00 10.78 30.76 > >>>>>> scan 23.88 68.15 23.88 68.15 > >>>>>> type.convert 0.32 0.91 0.30 0.86 > >>>>>> sapply 0.04 0.11 0.00 0.00 > >>>>>> character 0.02 0.06 0.02 0.06 > >>>>>> file 0.02 0.06 0.02 0.06 > >>>>>> lapply 0.02 0.06 0.02 0.06 > >>>>>> unlist 0.02 0.06 0.02 0.06 > >>>>>> simplify2array 0.02 0.06 0.00 0.00 > >>>>>> > >>>>>> $sample.interval [1] 0.02 > >>>>>> > >>>>>> $sampling.time [1] 35.04 > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds > >>>>>> <marklee...@gmail.com> > > wrote: > >>>>>> > >>>>>>> hi gene: maybe someone else will reply with some > >>>>>>> subtleties that > > I'm > >>>>> not > >>>>>>> aware of. one other thing that might help: if you know > >>>>>>> which columns you want , you can set > > the > >>>>>>> others to NULL through colClasses and this should speed > >>>>>>> things up also. For example, say > > you > >>>>> knew > >>>>>>> you only wanted the first four columns and they were > >>>>>>> character. then you could do, > >>>>>>> > >>>>>>> read.table(whatever, as.is=TRUE, colClasses = > >>>>>>> c(rep(character,4), rep(NULL,3696)). > >>>>>>> > >>>>>>> hopefully someone else will say something that does the > >>>>>>> trick. it > > seems > >>>>>>> odd to me as far as the difference in timings ? good > >>>>>>> luck. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes > >>>>>>> <gley...@gmail.com> > > wrote: > >>>>>>> > >>>>>>>> Mark, > >>>>>>>> > >>>>>>>> Thank you for the reply > >>>>>>>> > >>>>>>>> I neglected to mention that I had already set > >>>>>>>> options(stringsAsFactors=FALSE) > >>>>>>>> > >>>>>>>> I agree, skipping the factor determination can help > >>>>>>>> performance. > >>>>>>>> > >>>>>>>> The main reason that I wanted to use read.table is > >>>>>>>> because it > > will > >>>>>>>> correctly determine the column classes for me. I > >>>>>>>> don't really > > want to > >>>>>>>> specify 3700 column classes! (I'm not sure what they > >>>>>>>> are > > anyway). > >>>>>>>> > >>>>>>>> > >>>>>>>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds > > <marklee...@gmail.com> > >>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi Gene: Sometimes using colClasses in read.table > >>>>>>>>> can speed > > things up. > >>>>>>>>> If you know what your variables are ahead of time > >>>>>>>>> and what you > > want > >>>>> them to > >>>>>>>>> be, this allows you to be specific by specifying, > >>>>>>>>> character or > >>>>> numeric, > >>>>>>>>> etc and often it makes things faster. others will > >>>>>>>>> have more to > > say. > >>>>>>>>> > >>>>>>>>> also, if most of your variables are characters, R > >>>>>>>>> will try to > > turn > >>>>>>>>> convert them into factors by default. If you use > >>>>>>>>> as.is = TRUE it > >>>>> won't > >>>>>>>>> do this and that might speed things up also. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Rejoinder: above tidbits are just from > >>>>>>>>> experience. I don't > > know if > >>>>>>>>> it's in stone or a hard and fast rule. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes > >>>>>>>>> <gley...@gmail.com> > >>>>> wrote: > >>>>>>>>> > >>>>>>>>>> ** Disclaimer: I'm looking for general > >>>>>>>>>> suggestions ** I'm sorry, but can't send out the > >>>>>>>>>> file I'm using, so there is > > no > >>>>>>>>>> reproducible example. > >>>>>>>>>> > >>>>>>>>>> I'm using read.table and it's taking over 30 > >>>>>>>>>> seconds to read a > > tiny > >>>>>>>>>> file. The strange thing is that it takes roughly > >>>>>>>>>> the same amount of > > time if > >>>>>>>>>> the file is 100 times larger. > >>>>>>>>>> > >>>>>>>>>> After re-reviewing the data Import / Export > >>>>>>>>>> manual I think the > > best > >>>>>>>>>> approach would be to use Python, or perhaps the > >>>>>>>>>> readLines > > function, > >>>>> but > >>>>>>>>>> I was hoping to understand why the simple > >>>>>>>>>> read.table approach > > wasn't > >>>>>>>>>> working as expected. > >>>>>>>>>> > >>>>>>>>>> Some relevant facts: > >>>>>>>>>> > >>>>>>>>>> 1. There are about 3700 columns. Maybe this is > >>>>>>>>>> the problem? > > Still > >>>>>>>>>> the > >>>>>>>>>> > >>>>>>>>>> file size is not very large. 2. The file encoding > >>>>>>>>>> is ANSI, but I'm not specifying that in > > the > >>>>>>>>>> > >>>>>>>>>> function. Setting fileEncoding="ANSI" produces > >>>>>>>>>> an > > "unsupported > >>>>>>>>>> conversion" error 3. readLines imports the lines > >>>>>>>>>> quickly 4. scan imports the file quickly also > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Obviously, scan and readLines would require more > >>>>>>>>>> coding to > > identify > >>>>>>>>>> columns, etc. > >>>>>>>>>> > >>>>>>>>>> my code: system.time(dat <- > >>>>>>>>>> read.table('C:/test.txt', nrows=-1, > > sep='\t', > >>>>>>>>>> header=TRUE)) > >>>>>>>>>> > >>>>>>>>>> It's taking 33.4 seconds and the file size is > >>>>>>>>>> only 315 kb! > >>>>>>>>>> > >>>>>>>>>> Thanks > >>>>>>>>>> > >>>>>>>>>> Gene > >>>>>>>>>> > >>>>>>>>>> [[alternative HTML version deleted]] > >>>>>>>>>> > >>>>>>>>>> ______________________________________________ > >>>>>>>>>> R-help@r-project.org mailing list > >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>> PLEASE do read the posting guide > >>>>>>>>>> http://www.R-project.org/posting-guide.html and > >>>>>>>>>> provide commented, minimal, self-contained, > >>>>>>>>>> reproducible > > code. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> [[alternative HTML version deleted]] > >>>>>> > >>>>>> ______________________________________________ > >>>>>> R-help@r-project.org mailing list > >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do > >>>>>> read the posting guide > >>>>> http://www.R-project.org/posting-guide.html > >>>>>> and provide commented, minimal, self-contained, > >>>>>> reproducible code. > >>>>> > >>>>> -- Peter Dalgaard, Professor, Center for Statistics, > >>>>> Copenhagen Business School Solbjerg Plads 3, 2000 > >>>>> Frederiksberg, Denmark Phone: (+45)38153501 Email: > >>>>> pd....@cbs.dk Priv: pda...@gmail.com > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>> > >>>> ______________________________________________ > >>>> R-help@r-project.org mailing list > >>>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read > >>>> the posting guide > > http://www.R-project.org/posting-guide.html > >>>> and provide commented, minimal, self-contained, reproducible > >>>> code. > >>>> > >> > >> -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen > >> Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark > >> Phone: (+45)38153501 Email: pd....@cbs.dk Priv: > >> pda...@gmail.com > >> > >> ______________________________________________ > >> R-help@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the > >> posting guide > > http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible > >> code. > > > > ______________________________________________ R-help@r-project.org > > mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do > > read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > - -- > Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation > Biology, UCT), Dipl. Phys. (Germany) > > Centre of Excellence for Invasion Biology > Stellenbosch University > South Africa > > Tel : +33 - (0)9 53 10 27 44 > Cell: +33 - (0)6 85 62 59 98 > Fax : +33 - (0)9 58 10 27 44 > > Fax (D): +49 - (0)3 21 21 25 22 44 > > email: rai...@krugs.de > > Skype: RMkrug > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk7gffsACgkQoYgNqgF2egpNpACeLbyAXB1pLGgyt7hAE7QAWe9i > uV0An1Z8tvGw/1+40JM6YSe3aDqQoRkh > =/mB7 > -----END PGP SIGNATURE----- > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.