[R] Big Data reading subsample csv

Tudor Medallion Thu, 16 Aug 2012 07:55:35 -0700

Hello,

I'm most grateful for your time to read this.


I have a uber size 30GB file of 6 million records and 3000 (mostly
categorical data) columns in csv format. I want to bootstrap subsamples for
multinomial regression, but it's proving difficult even with my 64GB RAM
 in my machine and twice that swap file , the process becomes super slow
and halts.

I'm thinking about generating subsample indicies in R and feeding them into
a system command using sed or awk, but don't know how to do this. If
someone knew of a clean way to do this using just R commands, I would be
really grateful.

One problem is that I need to pick complete observations of subsamples,
that is I need to have all the rows of a particular multinomial observation
- they are not the same length from observation to observation. I plan to
use glmnet and then some fancy transforms to get an approximation to the
multinomial case. One other point is that I don't know how to choose sample
size to fit around memory limits.

Appreciate your thoughts greatly.


> R.version

platform       x86_64-pc-linux-gnu
arch           x86_64
os             linux-gnu
system         x86_64, linux-gnu
status
major          2
minor          15.1
year           2012
month          06
day            22
svn rev        59600
language       R
version.string R version 2.15.1 (2012-06-22)
nickname       Roasted Marshmallows


tags: read.csv(), system(), awk, sed, sample(), glmnet, multinomial, MASS.

Yoda

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Big Data reading subsample csv

Reply via email to