Re: [R] Importing random subsets of a data file

Khurram Nadeem Wed, 23 Jul 2014 11:47:50 -0700

It is great to see so many nice resources available. Thanks for the
suggestions and directing me to useful solutions. Using the 'awk' code
within R seems very promising for my problem. Also, I am looking into
reading the random samples from SQLite database as indicated by Greg. As my
algorithm runs independently on each random sample, implementing the whole
experiment in parallel (e.g. using snow package) would further speed up
computations.


--
Khurram


On Wed, Jul 23, 2014 at 1:56 PM, Greg Snow <538...@gmail.com> wrote:

> For speed your best choice is probably to load your data into a
> database, then pull your samples from the database.  A simple database
> is SQLite and there are R packages that work directly with that
> database.
>
> Can the later samples contain some of the same rows as previous
> samples?  Or once a row is used in a sample, it can never be used
> again in a later sample?  If the former you could use R to choose a
> sample of "row numbers" then ask the database for those rows (some
> databases have the concept of rows built in, others would need a
> sequential column of "row numbers" added), then repeat for each
> sample.  If the later then you could add a column to the database
> based on randomly generated numbers and create an index (sort) by that
> column, then select the 1st n observations as the 1st sample, the next
> n observations as the 2nd sample, etc.
>
> On Wed, Jul 23, 2014 at 9:33 AM, Khurram Nadeem <khurram.na...@gmail.com>
> wrote:
> > Hi R folks,
> >
> > Here is my problem.
> >
> > *1.* I have a large data file (say, in .csv or .txt format) containing 1
> > million rows and 500 variables (columns).
> >
> > *2.* My statistical algorithm does not require the entire dataset but
> just
> > a small random sample from the original 1 million rows.
> >
> > *3. *This algorithm needs to be applied 10000 times, each time
> generating a
> > different random sample from the 'big' file as described in (2) above.
> >
> > Is there a way to 'import' only a (random) subset of rows from the .csv
> > file without importing the entire dataset? A quick search on various R
> > forums suggest that read.table() does not have this functionality.
> > Obviously, I want to avoid importing the whole file because of memory
> > issues. Looking forward to your help.
> >
> > Thanks,
> > Khurram
> > ------------------------
> >  Khurram Nadeem
> >  Postdoctoral Research Fellow
> >  Department of Mathematics & Statistics
> >  Acadia University, NS, Canada.
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Gregory (Greg) L. Snow Ph.D.
> 538...@gmail.com
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Importing random subsets of a data file

Reply via email to