Hi again, Changing my code by defining vectors outside the loop and combining them afterwards helped a lot so now the code does not slow down anymore and I was able to parse the file in less than 2 hours. Not fantastic but it works.
I will William's the last suggestion of how to parse it without looping through for next time I have to parse a large file. Many thanks for your help! Frederik On Thu, Apr 14, 2011 at 4:58 PM, William Dunlap <wdun...@tibco.com> wrote: > [see below] > > From: Frederik Lang [mailto:frederikl...@gmail.com] > Sent: Thursday, April 14, 2011 12:56 PM > To: William Dunlap > Cc: r-help@r-project.org > Subject: Re: [R] Incremental ReadLines > > > > Hi Bill, > > Thank you so much for your suggestions. I will try and alter my > code. > > > Regarding the even shorter solution outside the loop it looks > good but my problem is that not all observations have the same variables > so that three different observations might look like this: > > > Id: 1 > Var1: false > Var2: 6 > Var3: 8 > > Id: 2 > missing > > Id: 3 > Var1: true > 3 4 5 > Var2: 7 > Var3: 3 > > > Doing it without looping through I thought my data had to quite > systematic, which it is not. I might be wrong though. > > Doing the simple preallocation that I describe should speed it up > a lot with very little effort. It is more work to manipulate the > columns one at a time instead of using data.frame subscripting and > it may not be worth it if you have lots of columns. > > If you have a lot of this sort of file and feel that it will be worth > the programming time to do something fancier, here is some code that > reads lines of the form > > > cat(lines, sep="\n") > Id: First > Var1: false > Var2: 6 > Var3: 8 > > Id: Second > Id: Last > Var1: true > Var3: 8 > > and produces a matrix with the Id's along the rows and the Var's > along the columns: > > > f(lines) > Var1 Var2 Var3 > First "false" "6" "8" > Second NA NA NA > Last "true" NA "8" > > The function f is: > > f <- function (lines) > { > # keep only lines with colons > lines <- grep(value = TRUE, "^.+:", lines) > lines <- gsub("^[[:space:]]+|[[:space:]]+$", "", lines) > isIdLine <- grepl("^Id:", lines) > group <- cumsum(isIdLine) > rownames <- sub("^Id:[[:space:]]*", "", lines[isIdLine]) > lines <- lines[!isIdLine] > group <- group[!isIdLine] > varname <- sub("[[:space:]]*:.*$", "", lines) > value <- sub(".*:[[:space:]]*", "", lines) > colnames <- unique(varname) > col <- match(varname, colnames) > retval <- array(NA_character_, c(length(rownames), > length(colnames)), > dimnames = list(rownames, colnames)) > retval[cbind(group, col)] <- value > retval > } > > The main trick is the matrix subscript given to retval on the > penultimate line. > > Thanks again, > > > Frederik > > > > On Thu, Apr 14, 2011 at 12:56 PM, William Dunlap > <wdun...@tibco.com> wrote: > > > I have two suggestions to speed up your code, if you > must use a loop. > > First, don't grow your output dataset at each iteration. > Instead of > cases <- 0 > output <- numeric(cases) > while(length(line <- readLines(input, n=1))==1) { > cases <- cases + 1 > output[cases] <- as.numeric(line) > } > preallocate the output vector to be about the size of > its eventual length (slightly bigger is better), > replacing > output <- numeric(0) > with the likes of > output <- numeric(500000) > and when you are done with the loop trim down the length > if it is too big > if (cases < length(output)) length(output) <- cases > Growing your dataset in a loop can cause quadratic or > worse > growth in time with problem size and the above sort of > code should make the time grow linearly with problem > size. > > Second, don't do data.frame subscripting inside your > loop. > Instead of > data <- data.frame(Id=numeric(cases)) > while(...) { > data[cases, 1] <- newValue > } > do > Id <- numeric(cases) > while(...) { > Id[cases] <- newValue > } > data <- data.frame(Id = Id) > This is just the general principal that you don't want > to > repeat the same operation over and over in a loop. > dataFrame[i,j] first extracts column j then extracts > element > i from that column. Since the column is the same every > iteration > you may as well extract the column outside of the loop. > > Avoiding the loop altogether is the fastest. E.g., the > code > you showed does the same thing as > idLines <- grep(value=TRUE, "Id:", readLines(file)) > data.frame(Id = as.numeric(sub("^.*Id:[[:space:]]*", > "", idLines))) > You can also use an external process (perl or grep) to > filter > out the lines that are not of interest. > > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > > > > -----Original Message----- > > From: r-help-boun...@r-project.org > > [mailto:r-help-boun...@r-project.org] On Behalf Of > Freds > > Sent: Wednesday, April 13, 2011 10:58 AM > > To: r-help@r-project.org > > Subject: Re: [R] Incremental ReadLines > > > > > Hi there, > > > > I am having a similar problem with reading in a large > text > > file with around > > 550.000 observations with each 10 to 100 lines of > > description. I am trying > > to parse it in R but I have troubles with the size of > the > > file. It seems > > like it is slowing down dramatically at some point. I > would > > be happy for any > > suggestions. Here is my code, which works fine when I > am > > doing a subsample > > of my dataset. > > > > #Defining datasource > > file <- "filename.txt" > > > > #Creating placeholder for data and assigning column > names > > data <- data.frame(Id=NA) > > > > #Starting by case = 0 > > case <- 0 > > > > #Opening a connection to data > > input <- file(file, "rt") > > > > #Going through cases > > repeat { > > line <- readLines(input, n=1) > > if (length(line)==0) break > > if (length(grep("Id:",line)) != 0) { > > case <- case + 1 ; data[case,] <-NA > > split_line <- strsplit(line,"Id:") > > data[case,1] <- as.numeric(split_line[[1]][2]) > > } > > } > > > > #Closing connection > > close(input) > > > > #Saving dataframe > > write.csv(data,'data.csv') > > > > > > Kind regards, > > > > > > Frederik > > > > > > -- > > View this message in context: > > > http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3 > 447859.html > <http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3%0A447859 > .html> > > Sent from the R help mailing list archive at > Nabble.com. > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, > reproducible code. > > > > > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.