Re: [R] Importing random subsets of a data file

2014-07-23 Thread James White
Here's a stack overflow question addressing the same issue. http://stackoverflow.com/a/22261345 Hopefully it will help. Thanks > Date: Wed, 23 Jul 2014 12:33:11 -0300 > From: khurram.na...@gmail.com > To: r-help@r-project.org > Subject: [R] Importing random subsets of a data fil

Re: [R] Importing random subsets of a data file

2014-07-23 Thread Khurram Nadeem
It is great to see so many nice resources available. Thanks for the suggestions and directing me to useful solutions. Using the 'awk' code within R seems very promising for my problem. Also, I am looking into reading the random samples from SQLite database as indicated by Greg. As my algorithm runs

Re: [R] Importing random subsets of a data file

2014-07-23 Thread Greg Snow
For speed your best choice is probably to load your data into a database, then pull your samples from the database. A simple database is SQLite and there are R packages that work directly with that database. Can the later samples contain some of the same rows as previous samples? Or once a row i

Re: [R] Importing random subsets of a data file

2014-07-23 Thread David Winsemius
I think an external program like awk (or gawk) would be better. You can call it with the R system() function if needed. http://stackoverflow.com/questions/7514896/select-random-3000-lines-from-a-file-with-awk-codes You might want to sample once and then break into sequential subsets rather than

Re: [R] Importing random subsets of a data file

2014-07-23 Thread Sarah Goslee
Hi, You can use scan() with the nlines and skip arguments to read in a single line from anywhere in a file. Sarah On Wed, Jul 23, 2014 at 11:33 AM, Khurram Nadeem wrote: > Hi R folks, > > Here is my problem. > > *1.* I have a large data file (say, in .csv or .txt format) containing 1 > million

[R] Importing random subsets of a data file

2014-07-23 Thread Khurram Nadeem
Hi R folks, Here is my problem. *1.* I have a large data file (say, in .csv or .txt format) containing 1 million rows and 500 variables (columns). *2.* My statistical algorithm does not require the entire dataset but just a small random sample from the original 1 million rows. *3. *This algorit