Good suggestion - I'll look into data.table. On 4/8/24 12:14, CALUM POLWART wrote: > data.table's fread is also fast. Not sure about error handling. But I > can merge 300 csvs with a total of 0.5m lines and 50 columns in a > couple of minutes versus a lifetime with read.csv or readr::read_csv > > > > On Mon, 8 Apr 2024, 16:19 Stevie Pederson, > <stephen.pederson...@gmail.com> wrote: > > Hi Dave, > > That's rather frustrating. I've found vroom (from the package > vroom) to be > helpful with large files like this. > > Does the following give you any better luck? > > vroom(file_name, delim = ",", skip = 2459465, n_max = 5) > > Of course, when you know you've got errors & the files are big > like that it > can take a bit of work resolving things. The command line tools > awk & sed > might even be a good plan for finding lines that have errors & > figuring out > a fix, but I certainly don't envy you. > > All the best > > Stevie > > On Tue, 9 Apr 2024 at 00:36, Dave Dixon <ddi...@swcp.com> wrote: > > > Greetings, > > > > I have a csv file of 76 fields and about 4 million records. I > know that > > some of the records have errors - unmatched quotes, specifically. > > Reading the file with readLines and parsing the lines with > read.csv(text > > = ...) is really slow. I know that the first 2459465 records are > good. > > So I try this: > > > > > startTime <- Sys.time() > > > first_records <- read.csv(file_name, nrows = 2459465) > > > endTime <- Sys.time() > > > cat("elapsed time = ", endTime - startTime, "\n") > > > > elapsed time = 24.12598 > > > > > startTime <- Sys.time() > > > second_records <- read.csv(file_name, skip = 2459465, nrows = 5) > > > endTime <- Sys.time() > > > cat("elapsed time = ", endTime - startTime, "\n") > > > > This appears to never finish. I have been waiting over 20 minutes. > > > > So why would (skip = 2459465, nrows = 5) take orders of > magnitude longer > > than (nrows = 2459465) ? > > > > Thanks! > > > > -dave > > > > PS: readLines(n=2459470) takes 10.42731 seconds. > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > <http://www.R-project.org/posting-guide.html> > > and provide commented, minimal, self-contained, reproducible code. > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > <http://www.R-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.