R Community:

At the risk of getting my hands slapped by posting "too much" on the forum, I've described the strategy for reading only certain portions of huge .csv files below.

I think that this very well could be of interest to others... I'm sure that I'm not alone in the need to read only certain variables (ie, columns) from VERY huge .csv files.

It has been suggested by Charles Berry, Ted Harding, and Brian Riply to use the unix "cut" command along with the R pipe() function. THeir advice has been invaluable.

As I've written the code so farm I'm finding that the "cut" command is not reading the file properly... or at least in the manner that I'm expecting.

Here was my strategy:
*STEP 1. read the whole huge file --- (almost impossible! even with a very good computer!) STEP 2. use the pipe and cut commands to read only the desired columns of the file STEP 3. compare results by tabulating a variable from the whole file with the file obtained in (2)*

I found that the comparision gave different tabulations!  :-(

I've provided my code below. I'd be quite grateful for suggestions on how to fix this.

My sincere thanks to all who have or will provide guidance on this problem.

Phil Smith
Duluth, GA

*## STEP 1: read the whole huge file*
##
## read the whole file
##
   your.file    <-    c("//home//philipsmith//mydata.csv")
   dat        <-    read.csv( file = your.file )

##
## read the names from the 1st line of the whole file
## that line contains all of the variable names
##
col.namz <- c( scan( your.file , what=character(0), nlines=1 , sep=",") )

##
## check to see whether  all of the column names from the whole file
## are the same as in col.namz
##
    all( col.namz == names(dat))

##
## they are!! :-)
##

*## STEP 2: use the pipe and cut commands to read only the desired columns of the file*
##
## designate which variable names are to be read
## using the unix command "cut" and the function pipe()
##
   colz    <-    c("ESTIAP07" )

##
## find the column numbers in the whole file that correspond to
## the variables designated to be read by the unix command
## and specified in the colz vector
##

   col.pos     <-     match( colz , col.namz , nomatch=0 )
   ##
   ## the following line is commented out,
   ## since for this example the number of designated variables
   ## by colz is only 1 variable
   ##
   ## col.pos        <-    paste( col.pos , collapse=',' )

##
## character string of file name for unix read with cut function
##
   fn        <-    c("/home/philipsmith/mydata.csv")

##
## create a character vector of the unix command
##
unix.cmd <- paste( "cut -d, -f" , col.pos , " " , fn , sep = '' )

##
## read the designated columns, only, from the whole file
## using pipe() and the unix command cut
##
   gnu.dat        <-    read.csv( pipe ( description=unix.cmd ) )



*## STEP 3. compare results by tabulating a variable from the whole file with the file obtained in (2)*
##
## tabulate the designated variable from the whole file
##
   table( dat$ESTIAP07 )

##
## tabulate the designated variable from the file
## that has the designated columns, only
##
   table( gnu.dat$ESTIAP07 )

> table( dat$ESTIAP07 )

1 2 4 5 6 7 8 10 11 12 13 14 16 17 18 19 20 22 24 25 340 278 304 319 334 295 405 342 519 474 413 476 511 322 517 393 364 377 447 425 27 28 29 30 31 34 35 36 37 38 40 41 44 46 47 49 50 51 52 53 462 382 368 502 385 494 454 497 484 385 360 419 355 466 461 369 372 431 384 331 54 55 56 57 58 59 60 61 62 63 64 65 66 68 69 72 73 74 75 76 478 468 348 323 363 287 322 364 317 363 423 337 409 312 370 360 348 309 244 300
77  79  80 773
307 454 445 340
>
> ##
> ## tabulate the designated variable from the file
> ## that has the designated columns, only
> ##
> table( gnu.dat$ESTIAP07 )

1 2 3 4 5 6 7 8 10 11 12 13 14 16 17 18 19 20 22 24 342 291 1 308 319 334 295 405 341 518 471 413 476 511 322 517 393 363 377 446 25 27 28 29 30 31 34 35 36 37 38 40 41 44 46 47 49 50 51 52 425 461 382 368 502 385 494 454 496 483 385 360 419 354 466 461 369 371 431 384 53 54 55 56 57 58 59 60 61 62 63 64 65 66 68 69 72 73 74 75 331 478 467 348 322 363 287 320 364 317 363 423 337 408 312 368 360 347 309 243
76  77  79  80 157 773
300 307 454 445   1 340
> ?pipe
>

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to