Thanks to you all. Modifying the code to use rbind instead of merge worked like a charm - I can only speculate why. Best
*Ben Caldwell* PhD Candidate University of California, Berkeley 130 Mulford Hall #3114 Berkeley, CA 94720 Office 223 Mulford Hall (510)859-3358 On Sat, Nov 3, 2012 at 2:19 PM, jim holtman <jholt...@gmail.com> wrote: > It easier than that. I forgot I can do it entirely within R: > > setwd("/temp/csv") > files <- Sys.glob("daily*csv") > output <- file('Rcombined.csv', 'w') > for (i in files){ > cat(i, '\n') # write out file processing > input <- readLines(i) > input <- input[-1L] # delete header > writeLines(input, output) > } > close(output) > > > > On Sat, Nov 3, 2012 at 4:56 PM, jim holtman <jholt...@gmail.com> wrote: > > These are not commands, but programs you can use. Here is a file copy > > program in "perl" (I spelt it wrong in the email); This will copy all > > the files that have "daily" in their names. It also skips the first > > line of each file assuming that it is the header. > > > > perl can be found on most systems. www.activestate.com has a > > version that runs under Windows and that is what I am using. > > > > > > chdir "/temp/csv"; # my directory with files > > @files = glob "daily*csv"; # get files to copy (daily......csv) > > open OUTPUT, ">combined.csv"; # output file > > # loop for each file > > foreach $file (@files) { > > print $file, "\n"; # print file being processed > > open INPUT, "<" . $file; > > # assume that the first line is a header, so skip it > > $header = <INPUT>; > > @all = <INPUT>; # read rest of the file > > close INPUT; > > print OUTPUT @all; # append to the output > > } > > close OUTPUT; > > > > Here is what was printed on the console: > > > > > > C:\Users\Owner>perl copyFiles.pl > > daily.BO.csv > > daily.C.csv > > daily.CL.csv > > daily.CT.csv > > daily.GC.csv > > daily.HO.csv > > daily.KC.csv > > daily.LA.csv > > daily.LN.csv > > daily.LP.csv > > daily.LX.csv > > daily.NG.csv > > daily.S.csv > > daily.SB.csv > > daily.SI.csv > > daily.SM.csv > > > > Which was a list of all the files copied. > > > > On Sat, Nov 3, 2012 at 4:08 PM, Benjamin Caldwell > > <btcaldw...@berkeley.edu> wrote: > >> Jim, > >> > >> Where can I find documentation of the commands you mention? > >> Thanks > >> > >> > >> > >> > >> > >> On Sat, Nov 3, 2012 at 12:15 PM, jim holtman <jholt...@gmail.com> > wrote: > >>> > >>> A faster way would be to use something like 'per', 'awk' or 'sed'. > >>> You can strip off the header line of each CSV (if it has one) and then > >>> concatenate the files together. This is very efficient use of memory > >>> since you are just reading one file at a time and then writing it out. > >>> Will probably be a lot faster since no conversions have to be done. > >>> Once you have the one large file, then you can play with it (load it > >>> if you have enough memory, or load it into a database). > >>> > >>> On Sat, Nov 3, 2012 at 11:37 AM, Jeff Newmiller > >>> <jdnew...@dcn.davis.ca.us> wrote: > >>> > On the absence of any data examples from you per the posting > guidelines, > >>> > I will refer you to the help files for the melt function in the > reshape2 > >>> > package. Note that there can be various mixtures of wide versus > long... > >>> > such as a wide file with one date column and columns representing > all stock > >>> > prices and all trade volumes. The longest format would be what melt > gives > >>> > (date, column name, and value) but an in-between format would have > one > >>> > distinct column each for dollar values and volume values with a > column > >>> > indicating ticker label and of course another for date. > >>> > > >>> > If your csv files can be grouped according to those with similar > column > >>> > "types", then as you read them in you can use cbind( > csvlabel="somelabel", > >>> > csvdf) to distinguish it and then rbind those data frames together > to create > >>> > an intermediate-width data frame. When dealing with large amounts of > data > >>> > you will want to minimize the amount of reshaping you do, but it > would > >>> > require knowledge of your data and algorithms to say any more. > >>> > > >>> > > --------------------------------------------------------------------------- > >>> > Jeff Newmiller The ..... ..... Go > >>> > Live... > >>> > DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. > Live > >>> > Go... > >>> > Live: OO#.. Dead: OO#.. > Playing > >>> > Research Engineer (Solar/Batteries O.O#. #.O#. with > >>> > /Software/Embedded Controllers) .OO#. .OO#. > >>> > rocks...1k > >>> > > >>> > > --------------------------------------------------------------------------- > >>> > Sent from my phone. Please excuse my brevity. > >>> > > >>> > Benjamin Caldwell <btcaldw...@berkeley.edu> wrote: > >>> > > >>> >>Jeff, > >>> >>If you're willing to educate, I'd be happy to learn what wide vs long > >>> >>format means. I'll give rbind a shot in the meantime. > >>> >>Ben > >>> >>On Nov 2, 2012 4:31 PM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us> > >>> >>wrote: > >>> >> > >>> >>> I would first confirm that you need the data in wide format... many > >>> >>> algorithms are more efficient in long format anyway, and rbind is > way > >>> >>more > >>> >>> efficient than merge. > >>> >>> > >>> >>> If you feel this is not negotiable, you may want to consider sqldf. > >>> >>Yes, > >>> >>> you need to learn a bit of SQL, but it is very well integrated into > >>> >>R. > >>> >>> > >>> > >>> >> > >>--------------------------------------------------------------------------- > >>> >>> Jeff Newmiller The ..... ..... Go > >>> >>Live... > >>> >>> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. > Live > >>> >>> Go... > >>> >>> Live: OO#.. Dead: OO#.. > >>> >>Playing > >>> >>> Research Engineer (Solar/Batteries O.O#. #.O#. > with > >>> >>> /Software/Embedded Controllers) .OO#. .OO#. > >>> >>rocks...1k > >>> >>> > >>> > >>> >> > >>--------------------------------------------------------------------------- > >>> >>> Sent from my phone. Please excuse my brevity. > >>> >>> > >>> >>> Benjamin Caldwell <btcaldw...@berkeley.edu> wrote: > >>> >>> > >>> >>> >Dear R help; > >>> >>> >I'm currently trying to combine a large number (about 30 x 30) of > >>> >>large > >>> >>> >.csvs together (each at least 10000 records). They are organized > by > >>> >>> >plots, > >>> >>> >hence 30 X 30, with each group of csvs in a folder which > corresponds > >>> >>to > >>> >>> >the > >>> >>> >plot. The unmerged csvs all have the same number of columns (5). > The > >>> >>> >fifth > >>> >>> >column has a different name for each csv. The number of rows is > >>> >>> >different. > >>> >>> > > >>> >>> >The combined csvs are of course quite large, and the code I'm > >>> >>running > >>> >>> >is > >>> >>> >quite slow - I'm currently running it on a computer with 10 GB > ram, > >>> >>> >ssd, > >>> >>> >and quad core 2.3 ghz processor; it's taken 8 hours and it's only > >>> >>75% > >>> >>> >of > >>> >>> >the way through (it's hung up on one of the largest data groupings > >>> >>now > >>> >>> >for > >>> >>> >an hour, and using 3.5 gigs of RAM. > >>> >>> > > >>> >>> >I know that R isn't the most efficient way of doing this, but I'm > >>> >>not > >>> >>> >familiar with sql or C. I wonder if anyone has suggestions for a > >>> >>> >different > >>> >>> >way to do this in the R environment. For instance, the key > function > >>> >>now > >>> >>> >is > >>> >>> >merge, but I haven't tried join from the plyr package or rbind > from > >>> >>> >base. > >>> >>> >I'm willing to provide a dropbox link to a couple of these files > if > >>> >>> >you'd > >>> >>> >like to see the data. My code is as follows: > >>> >>> > > >>> >>> > > >>> >>> >#multmerge is based on code by Tony cookson, > >>> >>> > > >>> >>> > >>> > >>> >> >> > http://www.r-bloggers.com/merging-multiple-data-files-into-one-data-frame/ > >>> >>> ; > >>> >>> >The function takes a path. This path should be the name of a > folder > >>> >>> >that > >>> >>> >contains all of the files you would like to read and merge > together > >>> >>and > >>> >>> >only those files you would like to merge. > >>> >>> > > >>> >>> >multmerge = function(mypath){ > >>> >>> >filenames=list.files(path=mypath, full.names=TRUE) > >>> >>> >datalist = try(lapply(filenames, > >>> >>> >function(x){read.csv(file=x,header=T)})) > >>> >>> >try(Reduce(function(x,y) {merge(x, y, all=TRUE)}, datalist)) > >>> >>> >} > >>> >>> > > >>> >>> >#this function renames files using a fixed list and outputs a .csv > >>> >>> > > >>> >>> >merepk <- function (path, nf.name) { > >>> >>> > > >>> >>> >output<-multmerge(mypath=path) > >>> >>> >name <- list("x", "y", "z", "depth", "amplitude") > >>> >>> >try(names(output) <- name) > >>> >>> > > >>> >>> >write.csv(output, nf.name) > >>> >>> >} > >>> >>> > > >>> >>> >#assumes all folders are in the same directory, with nothing else > >>> >>there > >>> >>> > > >>> >>> >merge.by.folder <- function (folderpath){ > >>> >>> > > >>> >>> >foldernames<-list.files(path=folderpath) > >>> >>> >n<- length(foldernames) > >>> >>> >setwd(folderpath) > >>> >>> > > >>> >>> >for (i in 1:n){ > >>> >>> >path<-paste(folderpath,foldernames[i], sep="\\") > >>> >>> > nf.name <- as.character(paste(foldernames[i],".csv", sep="")) > >>> >>> >merepk (path,nf.name) > >>> >>> > } > >>> >>> >} > >>> >>> > > >>> >>> >folderpath <- "yourpath" > >>> >>> > > >>> >>> >merge.by.folder(folderpath) > >>> >>> > > >>> >>> > > >>> >>> >Thanks for looking, and happy friday! > >>> >>> > > >>> >>> > > >>> >>> > > >>> >>> >*Ben Caldwell* > >>> >>> > > >>> >>> >PhD Candidate > >>> >>> >University of California, Berkeley > >>> >>> > > >>> >>> > [[alternative HTML version deleted]] > >>> >>> > > >>> >>> >______________________________________________ > >>> >>> >R-help@r-project.org mailing list > >>> >>> >https://stat.ethz.ch/mailman/listinfo/r-help > >>> >>> >PLEASE do read the posting guide > >>> >>> >http://www.R-project.org/posting-guide.html > >>> >>> >and provide commented, minimal, self-contained, reproducible code. > >>> >>> > >>> >>> > >>> > > >>> > ______________________________________________ > >>> > R-help@r-project.org mailing list > >>> > https://stat.ethz.ch/mailman/listinfo/r-help > >>> > PLEASE do read the posting guide > >>> > http://www.R-project.org/posting-guide.html > >>> > and provide commented, minimal, self-contained, reproducible code. > >>> > >>> > >>> > >>> -- > >>> Jim Holtman > >>> Data Munger Guru > >>> > >>> What is the problem that you are trying to solve? > >>> Tell me what you want to do, not how you want to do it. > >> > >> > > > > > > > > -- > > Jim Holtman > > Data Munger Guru > > > > What is the problem that you are trying to solve? > > Tell me what you want to do, not how you want to do it. > > > > -- > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > Tell me what you want to do, not how you want to do it. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.