Re: [R] Incremental ReadLines

Frederik Lang Sun, 17 Apr 2011 17:29:52 -0700

Hi again,

Changing my code by defining vectors outside the loop and combining them
afterwards helped a lot so now the code does not slow down anymore and I was
able to parse the file in less than 2 hours. Not fantastic but it works.


I will William's the last suggestion of how to parse it without looping
through for next time I have to parse a large file.

Many thanks for your help!


Frederik

On Thu, Apr 14, 2011 at 4:58 PM, William Dunlap <wdun...@tibco.com> wrote:

> [see below]
>
> From: Frederik Lang [mailto:frederikl...@gmail.com]
> Sent: Thursday, April 14, 2011 12:56 PM
> To: William Dunlap
> Cc: r-help@r-project.org
> Subject: Re: [R] Incremental ReadLines
>
>
>
>         Hi Bill,
>
>        Thank you so much for your suggestions. I will try and alter my
> code.
>
>
>        Regarding the even shorter solution outside the loop it looks
> good but my problem is that not all observations have the same variables
> so that three different observations might look like this:
>
>
>        Id: 1
>        Var1: false
>        Var2: 6
>        Var3: 8
>
>        Id: 2
>        missing
>
>        Id: 3
>        Var1: true
>        3 4 5
>        Var2: 7
>        Var3: 3
>
>
>        Doing it without looping through I thought my data had to quite
> systematic, which it is not. I might be wrong though.
>
> Doing the simple preallocation that I describe should speed it up
> a lot with very little effort.  It is more work to manipulate the
> columns one at a time instead of using data.frame subscripting and
> it may not be worth it if you have lots of columns.
>
> If you have a lot of this sort of file and feel that it will be worth
> the programming time to do something fancier, here is some code that
> reads lines of the form
>
> > cat(lines, sep="\n")
> Id: First
>   Var1: false
>  Var2: 6
>  Var3: 8
>
> Id: Second
> Id: Last
>  Var1: true
>  Var3: 8
>
> and produces a matrix with the Id's along the rows and the Var's
> along the columns:
>
> > f(lines)
>       Var1    Var2 Var3
> First  "false" "6"  "8"
> Second NA      NA   NA
> Last   "true"  NA   "8"
>
> The function f is:
>
> f <- function (lines)
> {
>    # keep only lines with colons
>    lines <- grep(value = TRUE, "^.+:", lines)
>    lines <- gsub("^[[:space:]]+|[[:space:]]+$", "", lines)
>    isIdLine <- grepl("^Id:", lines)
>    group <- cumsum(isIdLine)
>    rownames <- sub("^Id:[[:space:]]*", "", lines[isIdLine])
>    lines <- lines[!isIdLine]
>    group <- group[!isIdLine]
>    varname <- sub("[[:space:]]*:.*$", "", lines)
>    value <- sub(".*:[[:space:]]*", "", lines)
>    colnames <- unique(varname)
>    col <- match(varname, colnames)
>    retval <- array(NA_character_, c(length(rownames),
> length(colnames)),
>        dimnames = list(rownames, colnames))
>    retval[cbind(group, col)] <- value
>    retval
> }
>
> The main trick is the matrix subscript given to retval on the
> penultimate line.
>
>        Thanks again,
>
>
>        Frederik
>
>
>
>        On Thu, Apr 14, 2011 at 12:56 PM, William Dunlap
> <wdun...@tibco.com> wrote:
>
>
>                I have two suggestions to speed up your code, if you
>                must use a loop.
>
>                First, don't grow your output dataset at each iteration.
>                Instead of
>                    cases <- 0
>                    output <- numeric(cases)
>                    while(length(line <- readLines(input, n=1))==1) {
>                       cases <- cases + 1
>                       output[cases] <- as.numeric(line)
>                    }
>                preallocate the output vector to be about the size of
>                its eventual length (slightly bigger is better),
> replacing
>                    output <- numeric(0)
>                with the likes of
>                    output <- numeric(500000)
>                and when you are done with the loop trim down the length
>                if it is too big
>                    if (cases < length(output)) length(output) <- cases
>                Growing your dataset in a loop can cause quadratic or
> worse
>                growth in time with problem size and the above sort of
>                code should make the time grow linearly with problem
> size.
>
>                Second, don't do data.frame subscripting inside your
> loop.
>                Instead of
>                    data <- data.frame(Id=numeric(cases))
>                    while(...) {
>                        data[cases, 1] <- newValue
>                    }
>                do
>                    Id <- numeric(cases)
>                    while(...) {
>                        Id[cases] <- newValue
>                    }
>                    data <- data.frame(Id = Id)
>                This is just the general principal that you don't want
> to
>                repeat the same operation over and over in a loop.
>                dataFrame[i,j] first extracts column j then extracts
> element
>                i from that column.  Since the column is the same every
> iteration
>                you may as well extract the column outside of the loop.
>
>                Avoiding the loop altogether is the fastest.  E.g., the
> code
>                you showed does the same thing as
>                  idLines <- grep(value=TRUE, "Id:", readLines(file))
>                  data.frame(Id = as.numeric(sub("^.*Id:[[:space:]]*",
> "", idLines)))
>                You can also use an external process (perl or grep) to
> filter
>                out the lines that are not of interest.
>
>
>                Bill Dunlap
>                Spotfire, TIBCO Software
>                wdunlap tibco.com
>
>
>                > -----Original Message-----
>                > From: r-help-boun...@r-project.org
>                > [mailto:r-help-boun...@r-project.org] On Behalf Of
> Freds
>                > Sent: Wednesday, April 13, 2011 10:58 AM
>                > To: r-help@r-project.org
>                > Subject: Re: [R] Incremental ReadLines
>                >
>
>                > Hi there,
>                >
>                > I am having a similar problem with reading in a large
> text
>                > file with around
>                > 550.000 observations with each 10 to 100 lines of
>                > description. I am trying
>                > to parse it in R but I have troubles with the size of
> the
>                > file. It seems
>                > like it is slowing down dramatically at some point. I
> would
>                > be happy for any
>                > suggestions. Here is my code, which works fine when I
> am
>                > doing a subsample
>                > of my dataset.
>                >
>                > #Defining datasource
>                > file <- "filename.txt"
>                >
>                > #Creating placeholder for data and assigning column
> names
>                > data <- data.frame(Id=NA)
>                >
>                > #Starting by case = 0
>                > case <- 0
>                >
>                > #Opening a connection to data
>                > input <- file(file, "rt")
>                >
>                > #Going through cases
>                > repeat {
>                >   line <- readLines(input, n=1)
>                >   if (length(line)==0) break
>                >   if (length(grep("Id:",line)) != 0) {
>                >     case <- case + 1 ; data[case,] <-NA
>                >     split_line <- strsplit(line,"Id:")
>                >     data[case,1] <- as.numeric(split_line[[1]][2])
>                >     }
>                > }
>                >
>                > #Closing connection
>                > close(input)
>                >
>                > #Saving dataframe
>                > write.csv(data,'data.csv')
>                >
>                >
>                > Kind regards,
>                >
>                >
>                > Frederik
>                >
>                >
>                > --
>                > View this message in context:
>                >
> http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3
>                447859.html
> <http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3%0A447859
> .html>
>                > Sent from the R help mailing list archive at
> Nabble.com.
>                >
>                > ______________________________________________
>                > R-help@r-project.org mailing list
>                > https://stat.ethz.ch/mailman/listinfo/r-help
>                > PLEASE do read the posting guide
>                > http://www.R-project.org/posting-guide.html
>                > and provide commented, minimal, self-contained,
> reproducible code.
>                >
>
>
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Incremental ReadLines

Reply via email to