Re: [R] Incremental ReadLines

William Dunlap Thu, 14 Apr 2011 14:01:16 -0700

[see below]

From: Frederik Lang [mailto:frederikl...@gmail.com] 
Sent: Thursday, April 14, 2011 12:56 PM
To: William Dunlap
Cc: r-help@r-project.org
Subject: Re: [R] Incremental ReadLines




        Hi Bill,
        
        Thank you so much for your suggestions. I will try and alter my
code.
        
        
        Regarding the even shorter solution outside the loop it looks
good but my problem is that not all observations have the same variables
so that three different observations might look like this:
        
        
        Id: 1
        Var1: false
        Var2: 6
        Var3: 8
        
        Id: 2
        missing
        
        Id: 3
        Var1: true
        3 4 5
        Var2: 7
        Var3: 3
        
        
        Doing it without looping through I thought my data had to quite
systematic, which it is not. I might be wrong though.

Doing the simple preallocation that I describe should speed it up
a lot with very little effort.  It is more work to manipulate the
columns one at a time instead of using data.frame subscripting and
it may not be worth it if you have lots of columns.

If you have a lot of this sort of file and feel that it will be worth
the programming time to do something fancier, here is some code that
reads lines of the form

> cat(lines, sep="\n")
Id: First
  Var1: false
  Var2: 6
  Var3: 8

Id: Second
Id: Last
  Var1: true
  Var3: 8

and produces a matrix with the Id's along the rows and the Var's
along the columns:

> f(lines)
       Var1    Var2 Var3
First  "false" "6"  "8"
Second NA      NA   NA
Last   "true"  NA   "8"

The function f is:

f <- function (lines)
{
    # keep only lines with colons
    lines <- grep(value = TRUE, "^.+:", lines)
    lines <- gsub("^[[:space:]]+|[[:space:]]+$", "", lines)
    isIdLine <- grepl("^Id:", lines)
    group <- cumsum(isIdLine)
    rownames <- sub("^Id:[[:space:]]*", "", lines[isIdLine])
    lines <- lines[!isIdLine]
    group <- group[!isIdLine]
    varname <- sub("[[:space:]]*:.*$", "", lines)
    value <- sub(".*:[[:space:]]*", "", lines)
    colnames <- unique(varname)
    col <- match(varname, colnames)
    retval <- array(NA_character_, c(length(rownames),
length(colnames)),
        dimnames = list(rownames, colnames))
    retval[cbind(group, col)] <- value
    retval
}

The main trick is the matrix subscript given to retval on the
penultimate line.

        Thanks again,
        
        
        Frederik
        
        
        
        On Thu, Apr 14, 2011 at 12:56 PM, William Dunlap
<wdun...@tibco.com> wrote:
        

                I have two suggestions to speed up your code, if you
                must use a loop.
                
                First, don't grow your output dataset at each iteration.
                Instead of
                    cases <- 0
                    output <- numeric(cases)
                    while(length(line <- readLines(input, n=1))==1) {
                       cases <- cases + 1
                       output[cases] <- as.numeric(line)
                    }
                preallocate the output vector to be about the size of
                its eventual length (slightly bigger is better),
replacing
                    output <- numeric(0)
                with the likes of
                    output <- numeric(500000)
                and when you are done with the loop trim down the length
                if it is too big
                    if (cases < length(output)) length(output) <- cases
                Growing your dataset in a loop can cause quadratic or
worse
                growth in time with problem size and the above sort of
                code should make the time grow linearly with problem
size.
                
                Second, don't do data.frame subscripting inside your
loop.
                Instead of
                    data <- data.frame(Id=numeric(cases))
                    while(...) {
                        data[cases, 1] <- newValue
                    }
                do
                    Id <- numeric(cases)
                    while(...) {
                        Id[cases] <- newValue
                    }
                    data <- data.frame(Id = Id)
                This is just the general principal that you don't want
to
                repeat the same operation over and over in a loop.
                dataFrame[i,j] first extracts column j then extracts
element
                i from that column.  Since the column is the same every
iteration
                you may as well extract the column outside of the loop.
                
                Avoiding the loop altogether is the fastest.  E.g., the
code
                you showed does the same thing as
                  idLines <- grep(value=TRUE, "Id:", readLines(file))
                  data.frame(Id = as.numeric(sub("^.*Id:[[:space:]]*",
"", idLines)))
                You can also use an external process (perl or grep) to
filter
                out the lines that are not of interest.
                
                
                Bill Dunlap
                Spotfire, TIBCO Software
                wdunlap tibco.com
                

                > -----Original Message-----
                > From: r-help-boun...@r-project.org
                > [mailto:r-help-boun...@r-project.org] On Behalf Of
Freds
                > Sent: Wednesday, April 13, 2011 10:58 AM
                > To: r-help@r-project.org
                > Subject: Re: [R] Incremental ReadLines
                >
                
                > Hi there,
                >
                > I am having a similar problem with reading in a large
text
                > file with around
                > 550.000 observations with each 10 to 100 lines of
                > description. I am trying
                > to parse it in R but I have troubles with the size of
the
                > file. It seems
                > like it is slowing down dramatically at some point. I
would
                > be happy for any
                > suggestions. Here is my code, which works fine when I
am
                > doing a subsample
                > of my dataset.
                >
                > #Defining datasource
                > file <- "filename.txt"
                >
                > #Creating placeholder for data and assigning column
names
                > data <- data.frame(Id=NA)
                >
                > #Starting by case = 0
                > case <- 0
                >
                > #Opening a connection to data
                > input <- file(file, "rt")
                >
                > #Going through cases
                > repeat {
                >   line <- readLines(input, n=1)
                >   if (length(line)==0) break
                >   if (length(grep("Id:",line)) != 0) {
                >     case <- case + 1 ; data[case,] <-NA
                >     split_line <- strsplit(line,"Id:")
                >     data[case,1] <- as.numeric(split_line[[1]][2])
                >     }
                > }
                >
                > #Closing connection
                > close(input)
                >
                > #Saving dataframe
                > write.csv(data,'data.csv')
                >
                >
                > Kind regards,
                >
                >
                > Frederik
                >
                >
                > --
                > View this message in context:
                >
http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3
                447859.html
<http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3%0A447859
.html> 
                > Sent from the R help mailing list archive at
Nabble.com.
                >
                > ______________________________________________
                > R-help@r-project.org mailing list
                > https://stat.ethz.ch/mailman/listinfo/r-help
                > PLEASE do read the posting guide
                > http://www.R-project.org/posting-guide.html
                > and provide commented, minimal, self-contained,
reproducible code.
                >
                

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Incremental ReadLines

Reply via email to