Once you've read the data with readLines() head(raw) gets you the first 6 lines or head(nchar(raw)) the lengths of those lines
Once you know which row has the headers (here it is 2), just use names(dta) <- read.table(text=raw[2], stringsAsFactors=FALSE) to add column names to the data frame. ------- David > -----Original Message----- > From: Fisher Dennis [mailto:fis...@plessthan.com] > Sent: Wednesday, November 28, 2012 1:42 PM > To: dcarl...@tamu.edu > Cc: r-help@r-project.org > Subject: Re: [R] Speeding reading of large file > > An interesting approach -- I lose the column names (which I need) but I > could get them with something cute such as: > 1. read the first few lines only with readLines(FILENAME, n=10) > 2. use your approach to read.table -- this will grab the column > names > 3. replace the headers in the full version with the correct > column names > > Dennis Fisher MD > P < (The "P Less Than" Company) > Phone: 1-866-PLessThan (1-866-753-7784) > Fax: 1-866-PLessThan (1-866-753-7784) > www.PLessThan.com > > On Nov 28, 2012, at 11:32 AM, David L Carlson wrote: > > > Using your first approach, this should be faster > > > > raw <- readLines(con=filename) > > dta <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE) > > > > ---------------------------------------------- > > David L Carlson > > Associate Professor of Anthropology > > Texas A&M University > > College Station, TX 77843-4352 > > > >> -----Original Message----- > >> From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- > >> project.org] On Behalf Of Fisher Dennis > >> Sent: Wednesday, November 28, 2012 11:43 AM > >> To: r-help@r-project.org > >> Subject: [R] Speeding reading of large file > >> > >> R 2.15.1 > >> OS X and Windows > >> > >> Colleagues, > >> > >> I have a file that looks that this: > >> TABLE NO. 1 > >> PTID TIME AMT FORM PERIOD IPRED > >> CWRES EVID CP PRED RES WRES > >> 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 > >> 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 > 0.0000E+00 > >> 0.0000E+00 > >> 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 > >> 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 > 0.0000E+00 > >> 0.0000E+00 > >> 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 > 0.0000E+00 > >> 0.0000E+00 > >> 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 > 0.0000E+00 > >> 0.0000E+00 > >> 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 > 0.0000E+00 > >> 0.0000E+00 > >> 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 > 0.0000E+00 > >> 0.0000E+00 > >> 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 > 0.0000E+00 > >> 0.0000E+00 > >> 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 > 0.0000E+00 > >> 0.0000E+00 > >> > >> The file is reasonably large (> 10^6 lines) and the two line header > is > >> repeated periodically in the file. > >> I need to read this file in as a data frame. Note that the number > of > >> columns, the column headers, and the number of replicates of the > >> headers are not known in advance. > >> > >> I have tried two approaches to this: > >> First Approach: > >> 1. readLines(FILENAME) to read in the file > >> 2. use grep to find the repeat headers; strip out the > >> repeat headers > >> 3. write() the object to tempfile, read in that temporary > >> file using read.table(tempfile, header=TRUE, skip=1) [an alternative > is > >> to use textConnection but that does not appear to speed things] > >> > >> Second Approach: > >> 1. TEMP <- read.table(FILENAME, header=TRUE, skip=1, > >> fill=TRUE, as.is=TRUE) > >> 2. get rid of the errant entries with: > >> TEMP[!is.na(as.numeric(TEMP[,1])),] > >> 3. reading of the character entries forced all columns to > >> character mode. Therefore, I convert each column to numeric: > >> for (COL in 1:ncol(TEMP)) TEMP[,COL] <- > >> as.numeric(TEMP[,COL]) > >> The second approach is ~ 20% faster than the first. With the second > >> approach, the conversion to numeric occupies 50% of the elapsed > time. > >> > >> Is there some approach that would be much faster? For example, > would a > >> vectorized approach to conversion to numeric improve throughput? > Or, > >> is there some means to ensure that all data are read as numeric (I > >> tried to use colClasses but that triggered an error when the text > >> string was encountered). > >> > >> ############################ > >> A dput version of the data is: > >> c("TABLE NO. 1", " PTID TIME AMT FORM > >> PERIOD IPRED CWRES EVID CP PRED > >> RES WRES", > >> " 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 > >> 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 > 0.0000E+00 > >> 0.0000E+00", > >> " 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 > >> 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 > 0.0000E+00 > >> 0.0000E+00", > >> " 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 > 0.0000E+00 > >> 0.0000E+00", > >> " 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 > 0.0000E+00 > >> 0.0000E+00", > >> " 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 > 0.0000E+00 > >> 0.0000E+00", > >> " 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 > 0.0000E+00 > >> 0.0000E+00", > >> " 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 > 0.0000E+00 > >> 0.0000E+00", > >> " 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 > 0.0000E+00 > >> 0.0000E+00" > >> ) > >> > >> This can be assembled into a large dataset and written to a file > named > >> FILENAME with the following code: > >> cat(c("TABLE NO. 1", " PTID TIME AMT FORM > >> PERIOD IPRED CWRES EVID CP PRED > >> RES WRES", > >> " 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 > >> 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 > 0.0000E+00 > >> 0.0000E+00", > >> " 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 > >> 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 > 0.0000E+00 > >> 0.0000E+00", > >> " 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 > 0.0000E+00 > >> 0.0000E+00", > >> " 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 > 0.0000E+00 > >> 0.0000E+00", > >> " 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 > 0.0000E+00 > >> 0.0000E+00", > >> " 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 > 0.0000E+00 > >> 0.0000E+00", > >> " 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 > 0.0000E+00 > >> 0.0000E+00", > >> " 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 > >> 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 > 0.0000E+00 > >> 0.0000E+00" > >> )[rep(1:10, 1000)], file="FILENAME", sep="\n") > >> > >> > >> Dennis > >> > >> > >> Dennis Fisher MD > >> P < (The "P Less Than" Company) > >> Phone: 1-866-PLessThan (1-866-753-7784) > >> Fax: 1-866-PLessThan (1-866-753-7784) > >> www.PLessThan.com > >> > >> ______________________________________________ > >> R-help@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide http://www.R-project.org/posting- > >> guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.