Re: [R] How long does skipping in read.table take

Mike Marchywka Fri, 22 Oct 2010 18:22:36 -0700








----------------------------------------
> From: ggrothendi...@gmail.com
> Date: Fri, 22 Oct 2010 18:28:14 -0400
> To: dimitri.liakhovit...@gmail.com
> CC: r-help@r-project.org
> Subject: Re: [R] How long does skipping in read.table take
>
> On Fri, Oct 22, 2010 at 5:17 PM, Dimitri Liakhovitski
>  wrote:
> > I know I could figure it out empirically - but maybe based on your
> > experience you can tell me if it's doable in a reasonable amount of
> > time:
> > I have a table (in .txt) with a 17,000,000 rows (and 30 columns).
> > I can't read it all in (there are many strings). So I thought I could
> > read it in in parts (e.g., 1 milllion) using nrows= and skip.
> > I was able to read in the first 1,000,000 rows no problem in 45 sec.
> > But then I tried to skip 16,999,999 rows and then read in things. Then
> > R crashed. Should I try again - or is it too many rows to skip for R?
> >
>
> You could try read.csv.sql in sqldf.
>
> library(sqldf)
> read.csv.sql("myfile.csv", skip = 1000, header = FALSE)
> or
> read.csv.sql("myfile.csv, sql = "select * from file 2000, 1000")
>
> The first skips the first 1000 lines including the header and the
> second one skips 1000 rows (but still reads in the header) and then
> reads 2000 rows. You may or may not need to specify other arguments
> as well. For example, you may need to specify eol = "\n" or other
> depending on your line endings.
>
> Unlike read.csv, read.csv.sql reads the data directly into an sqlite
> database (which it creates on the fly for you). The data does not go
> through R during this operation. From there it reads only the data
> you ask for into R so R never sees the skipped over data. After all
> that it automatically deletes the database.


The first time I saw this suggested I thought I would wait to 
reply because it seemed a bit of an odd suggestion and I thought
I was missing some R-speak and a reply would waste everyone's time. However,
I still don't see what I'm missing here. A database is generally a big table
of data with various indicies and locks that facilitate concurrent updates and 
responses to arbitrary queries. This is fine for hotel reservation systems
where you need "ACID" performance but makes little sense with constant
data which will be accessed sequentially. A fast DB could take milliseonds to 
response,
an anticipatory streaming system could always have data in nanoseconds. 
Is this thing really acting as a "DB" or is there something more to it?
Is there no well buffered streaming system for data you will use in order?

It sounds like you are just building indicies and then deleteing them
but never really using random access. Is there not better way?

Thanks



>

> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>





Mike Marchywka | V.P. Technology

415-264-8477
marchy...@phluant.com

Online Advertising and Analytics for Mobile
http://www.phluant.com

                                          
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Re: [R] How long does skipping in read.table take

Reply via email to