On Fri, Oct 22, 2010 at 5:17 PM, Dimitri Liakhovitski
<dimitri.liakhovit...@gmail.com> wrote:
> I know I could figure it out empirically - but maybe based on your
> experience you can tell me if it's doable in a reasonable amount of
> time:
> I have a table (in .txt) with a 17,000,000 rows (and 30 columns).
> I can't read it all in (there are many strings). So I thought I could
> read it in in parts (e.g., 1 milllion) using nrows= and skip.
> I was able to read in the first 1,000,000 rows no problem in 45 sec.
> But then I tried to skip 16,999,999 rows and then read in things. Then
> R crashed. Should I try again - or is it too many rows to skip for R?
>

You could try read.csv.sql in sqldf.

library(sqldf)
read.csv.sql("myfile.csv", skip = 1000, header = FALSE)
or
read.csv.sql("myfile.csv, sql = "select * from file 2000, 1000")

The first skips the first 1000 lines including the header and the
second one skips 1000 rows (but still reads in the header) and then
reads 2000 rows.  You may or may not need to specify other arguments
as well. For example, you may need to specify eol = "\n" or other
depending on your line endings.

Unlike read.csv, read.csv.sql reads the data directly into an sqlite
database (which it creates on the fly for you).  The data does not go
through R during this operation.  From there it reads only the data
you ask for into R so R never sees the skipped over data.  After all
that it automatically deletes the database.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to