Re: [R] Incremental ReadLines

Gene Leynes Mon, 02 Nov 2009 11:58:29 -0800

James,

I think those are Unix commands?  I'm on Windows, so that's not an option
(for now)


Also the suggestions posed by Duncan and Phil seem to be working.  Thank you
so much, such a simple thing to add the "r" or "rt" to the file connection.


I read about blocking, but I didn't imagine that it meant "chunks".  I was
thinking something more like "blocking out", or guarding (perhaps for
security).

On Mon, Nov 2, 2009 at 1:47 PM, James W. MacDonald <jmac...@med.umich.edu>wrote:

> Hi Gene,
>
> Rather than using R to parse this file, have you considered using either
> grep or sed to pre-process the file and then read it in?
>
> It looks like you just want lines starting with numbers, so something like
>
> grep '^[0-9]\+' thefile.csv > otherfile.csv
>
> should be much faster, and then you can just read in otherfile.csv using
> read.csv().
>
> Best,
>
> Jim
>
>
>
> Gene Leynes wrote:
>
>> I've been trying to figure out how to read in a large file for a few days
>> now, and after extensive research I'm still not sure what to do.
>>
>> I have a large comma delimited text file that contains 59 fields in each
>> record.
>> There is also a header every 121 records
>>
>> This function works well for smallish records
>> getcsv=function(fname){
>>    ff=file(description = fname)
>>    x <- readLines(ff)
>>    closeAllConnections()
>>    x <- x[x != ""]          # REMOVE BLANKS
>>    x=x[grep("^[-0-9]", x)]  # REMOVE ALL TEXT
>>
>>    spl=strsplit(x,',')      # THIS PART IS SLOW, BUT MANAGABLE
>>
>>
>> xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]])))))
>>    return(xx)
>> }
>> It's not elegant, but it works.
>> For 121,000 records it completes in 2.3 seconds
>> For 121,000*5 records it completes in 63 seconds
>> For 121,000*10 records it doesn't complete
>>
>> When I try other methods to read the file in chunks (using scan), the
>> process breaks down because I have to start at the beginning of the file
>> on
>> every iteration.
>> For example:
>> fnn=function(n,col){
>>    a=122*(n-1)+2
>>    xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0))
>>    xx=xx[xx!='']
>>    xx=matrix(xx,ncol=49,byrow=TRUE)
>>    xx[,col]
>> }
>> system.time(sapply(1:10,fnn,c=26))     # 0.31 Seconds
>> system.time(sapply(91:90,fnn,c=26))    # 1.09 Seconds
>> system.time(sapply(901:910,fnn,c=26))  # 5.78 Seconds
>>
>> Even though I'm only getting the 26th column for 10 sets of records, it
>> takes a lot longer the further into the file I go.
>>
>> How can I tell scan to pick up where it left off, without it starting at
>> the
>> beginning??  There must be a good example somewhere.
>>
>> I have done a lot of research (in fact, thank you to Michael J. Crawley
>> and
>> others for your help thus far)
>>
>> Thanks,
>>
>> Gene
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Incremental ReadLines

Reply via email to