Re: [R] mergeing a large number of large .csvs

Benjamin Caldwell Mon, 05 Nov 2012 20:27:29 -0800

Thanks to you all. Modifying the code to use rbind instead of merge worked
like a charm - I can only speculate why.
Best


*Ben Caldwell*

PhD Candidate
University of California, Berkeley
130 Mulford Hall #3114
Berkeley, CA 94720
Office 223 Mulford Hall
(510)859-3358



On Sat, Nov 3, 2012 at 2:19 PM, jim holtman <jholt...@gmail.com> wrote:

> It easier than that.  I forgot I can do it entirely within R:
>
> setwd("/temp/csv")
> files <- Sys.glob("daily*csv")
> output <- file('Rcombined.csv', 'w')
> for (i in files){
>     cat(i, '\n')  # write out file processing
>     input <- readLines(i)
>     input <- input[-1L]  # delete header
>     writeLines(input, output)
> }
> close(output)
>
>
>
> On Sat, Nov 3, 2012 at 4:56 PM, jim holtman <jholt...@gmail.com> wrote:
> > These are not commands, but programs you can use.  Here is a file copy
> > program in "perl" (I spelt it wrong in the email);  This will copy all
> > the files that have "daily" in their names.  It also skips the first
> > line of each file assuming that it is the header.
> >
> > perl  can be found on most systems.  www.activestate.com  has a
> > version that runs under Windows and that is what I am using.
> >
> >
> > chdir "/temp/csv";  # my directory with files
> > @files = glob "daily*csv";  # get files to copy (daily......csv)
> > open OUTPUT, ">combined.csv"; # output file
> > # loop for each file
> > foreach $file (@files) {
> >     print $file, "\n";  # print file being processed
> >     open INPUT, "<" . $file;
> >     # assume that the first line is a header, so skip it
> >     $header = <INPUT>;
> >     @all = <INPUT>;  # read rest of the file
> >     close INPUT;
> >     print OUTPUT @all;  # append to the output
> > }
> > close OUTPUT;
> >
> > Here is what was printed on the console:
> >
> >
> > C:\Users\Owner>perl copyFiles.pl
> > daily.BO.csv
> > daily.C.csv
> > daily.CL.csv
> > daily.CT.csv
> > daily.GC.csv
> > daily.HO.csv
> > daily.KC.csv
> > daily.LA.csv
> > daily.LN.csv
> > daily.LP.csv
> > daily.LX.csv
> > daily.NG.csv
> > daily.S.csv
> > daily.SB.csv
> > daily.SI.csv
> > daily.SM.csv
> >
> > Which was a list of all the files copied.
> >
> > On Sat, Nov 3, 2012 at 4:08 PM, Benjamin Caldwell
> > <btcaldw...@berkeley.edu> wrote:
> >> Jim,
> >>
> >> Where can I find documentation of the commands you mention?
> >> Thanks
> >>
> >>
> >>
> >>
> >>
> >> On Sat, Nov 3, 2012 at 12:15 PM, jim holtman <jholt...@gmail.com>
> wrote:
> >>>
> >>> A faster way would be to use something like 'per', 'awk' or 'sed'.
> >>> You can strip off the header line of each CSV (if it has one) and then
> >>> concatenate the files together.  This is very efficient use of memory
> >>> since you are just reading one file at a time and then writing it out.
> >>>  Will probably be a lot faster since no conversions have to be done.
> >>> Once you have the one large file, then you can play with it (load it
> >>> if you have enough memory, or load it into a database).
> >>>
> >>> On Sat, Nov 3, 2012 at 11:37 AM, Jeff Newmiller
> >>> <jdnew...@dcn.davis.ca.us> wrote:
> >>> > On the absence of any data examples from you per the posting
> guidelines,
> >>> > I will refer you to the help files for the melt function in the
> reshape2
> >>> > package.  Note that there can be various mixtures of wide versus
> long...
> >>> > such as a wide file with one date column and columns representing
> all stock
> >>> > prices and all trade volumes. The longest format would be what melt
> gives
> >>> > (date, column name, and value) but an in-between format would have
> one
> >>> > distinct column each for dollar values and volume values with a
> column
> >>> > indicating ticker label and of course another for date.
> >>> >
> >>> > If your csv files can be grouped according to those with similar
> column
> >>> > "types", then as you read them in you can use cbind(
> csvlabel="somelabel",
> >>> > csvdf) to distinguish it and then rbind those data frames together
> to create
> >>> > an intermediate-width data frame. When dealing with large amounts of
> data
> >>> > you will want to minimize the amount of reshaping you do, but it
> would
> >>> > require knowledge of your data and algorithms to say any more.
> >>> >
> >>> >
> ---------------------------------------------------------------------------
> >>> > Jeff Newmiller                        The     .....       .....  Go
> >>> > Live...
> >>> > DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.
>  Live
> >>> > Go...
> >>> >                                       Live:   OO#.. Dead: OO#..
>  Playing
> >>> > Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> >>> > /Software/Embedded Controllers)               .OO#.       .OO#.
> >>> > rocks...1k
> >>> >
> >>> >
> ---------------------------------------------------------------------------
> >>> > Sent from my phone. Please excuse my brevity.
> >>> >
> >>> > Benjamin Caldwell <btcaldw...@berkeley.edu> wrote:
> >>> >
> >>> >>Jeff,
> >>> >>If you're willing to educate, I'd be happy to learn what wide vs long
> >>> >>format means. I'll give rbind a shot in the meantime.
> >>> >>Ben
> >>> >>On Nov 2, 2012 4:31 PM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us>
> >>> >>wrote:
> >>> >>
> >>> >>> I would first confirm that you need the data in wide format... many
> >>> >>> algorithms are more efficient in long format anyway, and rbind is
> way
> >>> >>more
> >>> >>> efficient than merge.
> >>> >>>
> >>> >>> If you feel this is not negotiable, you may want to consider sqldf.
> >>> >>Yes,
> >>> >>> you need to learn a bit of SQL, but it is very well integrated into
> >>> >>R.
> >>> >>>
> >>>
> >>> >>
> >>---------------------------------------------------------------------------
> >>> >>> Jeff Newmiller                        The     .....       .....  Go
> >>> >>Live...
> >>> >>> DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.
>  Live
> >>> >>> Go...
> >>> >>>                                       Live:   OO#.. Dead: OO#..
> >>> >>Playing
> >>> >>> Research Engineer (Solar/Batteries            O.O#.       #.O#.
>  with
> >>> >>> /Software/Embedded Controllers)               .OO#.       .OO#.
> >>> >>rocks...1k
> >>> >>>
> >>>
> >>> >>
> >>---------------------------------------------------------------------------
> >>> >>> Sent from my phone. Please excuse my brevity.
> >>> >>>
> >>> >>> Benjamin Caldwell <btcaldw...@berkeley.edu> wrote:
> >>> >>>
> >>> >>> >Dear R help;
> >>> >>> >I'm currently trying to combine a large number (about 30 x 30) of
> >>> >>large
> >>> >>> >.csvs together (each at least 10000 records). They are organized
> by
> >>> >>> >plots,
> >>> >>> >hence 30 X 30, with each group of csvs in a folder which
> corresponds
> >>> >>to
> >>> >>> >the
> >>> >>> >plot. The unmerged csvs all have the same number of columns (5).
> The
> >>> >>> >fifth
> >>> >>> >column has a different name for each csv. The number of rows is
> >>> >>> >different.
> >>> >>> >
> >>> >>> >The combined csvs are of course quite large, and the code I'm
> >>> >>running
> >>> >>> >is
> >>> >>> >quite slow - I'm currently running it on a computer with 10 GB
> ram,
> >>> >>> >ssd,
> >>> >>> >and quad core 2.3 ghz processor; it's taken 8 hours and it's only
> >>> >>75%
> >>> >>> >of
> >>> >>> >the way through (it's hung up on one of the largest data groupings
> >>> >>now
> >>> >>> >for
> >>> >>> >an hour, and using 3.5 gigs of RAM.
> >>> >>> >
> >>> >>> >I know that R isn't the most efficient way of doing this, but I'm
> >>> >>not
> >>> >>> >familiar with sql or C. I wonder if anyone has suggestions for a
> >>> >>> >different
> >>> >>> >way to do this in the R environment. For instance, the key
> function
> >>> >>now
> >>> >>> >is
> >>> >>> >merge, but I haven't tried join from the plyr package or rbind
> from
> >>> >>> >base.
> >>> >>> >I'm willing to provide a dropbox link to a couple of these files
> if
> >>> >>> >you'd
> >>> >>> >like to see the data. My code is as follows:
> >>> >>> >
> >>> >>> >
> >>> >>> >#multmerge is based on code by Tony cookson,
> >>> >>> >
> >>> >>>
> >>>
> >>> >> >>
> http://www.r-bloggers.com/merging-multiple-data-files-into-one-data-frame/
> >>> >>> ;
> >>> >>> >The function takes a path. This path should be the name of a
> folder
> >>> >>> >that
> >>> >>> >contains all of the files you would like to read and merge
> together
> >>> >>and
> >>> >>> >only those files you would like to merge.
> >>> >>> >
> >>> >>> >multmerge = function(mypath){
> >>> >>> >filenames=list.files(path=mypath, full.names=TRUE)
> >>> >>> >datalist = try(lapply(filenames,
> >>> >>> >function(x){read.csv(file=x,header=T)}))
> >>> >>> >try(Reduce(function(x,y) {merge(x, y, all=TRUE)}, datalist))
> >>> >>> >}
> >>> >>> >
> >>> >>> >#this function renames files using a fixed list and outputs a .csv
> >>> >>> >
> >>> >>> >merepk <- function (path, nf.name) {
> >>> >>> >
> >>> >>> >output<-multmerge(mypath=path)
> >>> >>> >name <- list("x", "y", "z", "depth", "amplitude")
> >>> >>> >try(names(output) <- name)
> >>> >>> >
> >>> >>> >write.csv(output, nf.name)
> >>> >>> >}
> >>> >>> >
> >>> >>> >#assumes all folders are in the same directory, with nothing else
> >>> >>there
> >>> >>> >
> >>> >>> >merge.by.folder <- function (folderpath){
> >>> >>> >
> >>> >>> >foldernames<-list.files(path=folderpath)
> >>> >>> >n<- length(foldernames)
> >>> >>> >setwd(folderpath)
> >>> >>> >
> >>> >>> >for (i in 1:n){
> >>> >>> >path<-paste(folderpath,foldernames[i], sep="\\")
> >>> >>> > nf.name <- as.character(paste(foldernames[i],".csv", sep=""))
> >>> >>> >merepk (path,nf.name)
> >>> >>> > }
> >>> >>> >}
> >>> >>> >
> >>> >>> >folderpath <- "yourpath"
> >>> >>> >
> >>> >>> >merge.by.folder(folderpath)
> >>> >>> >
> >>> >>> >
> >>> >>> >Thanks for looking, and happy friday!
> >>> >>> >
> >>> >>> >
> >>> >>> >
> >>> >>> >*Ben Caldwell*
> >>> >>> >
> >>> >>> >PhD Candidate
> >>> >>> >University of California, Berkeley
> >>> >>> >
> >>> >>> >       [[alternative HTML version deleted]]
> >>> >>> >
> >>> >>> >______________________________________________
> >>> >>> >R-help@r-project.org mailing list
> >>> >>> >https://stat.ethz.ch/mailman/listinfo/r-help
> >>> >>> >PLEASE do read the posting guide
> >>> >>> >http://www.R-project.org/posting-guide.html
> >>> >>> >and provide commented, minimal, self-contained, reproducible code.
> >>> >>>
> >>> >>>
> >>> >
> >>> > ______________________________________________
> >>> > R-help@r-project.org mailing list
> >>> > https://stat.ethz.ch/mailman/listinfo/r-help
> >>> > PLEASE do read the posting guide
> >>> > http://www.R-project.org/posting-guide.html
> >>> > and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>>
> >>>
> >>> --
> >>> Jim Holtman
> >>> Data Munger Guru
> >>>
> >>> What is the problem that you are trying to solve?
> >>> Tell me what you want to do, not how you want to do it.
> >>
> >>
> >
> >
> >
> > --
> > Jim Holtman
> > Data Munger Guru
> >
> > What is the problem that you are trying to solve?
> > Tell me what you want to do, not how you want to do it.
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] mergeing a large number of large .csvs

Reply via email to