> On Nov 6, 2016, at 5:36 AM, Lucas Ferreira Mation <lucasmat...@gmail.com> > wrote: > > I have some large .txt files about ~100GB containing a dataset in fixed > width file. This contains some errors: > - character characters in column that are supposed to be numeric, > - invalid characters > - rows with too many characters, possibly due to invalid characters or some > missing end of line character (so two rows in the original data become one > row in the .txt file). > > The errors are not very frequent, but stop me from importing with readr > ::read_fwf() > > > Is there some package, or workflow, in R to pre-process the files, > separating the valid from the not-valid rows into different files? This can > be done by ETL point-click tools, such as Pentaho PDI. Is there some > equivalent code in R to do this? > > I googled it and could not find a solution. I also asked this in > StackOverflow and got no answer (here > <http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r> > ).
Had I seen it there I would have voted to close (and just did) that SO question as too broad, although it is too vague because of lack of definition of "corrupted characters", and furthermore basically a request for a package recommendation (which is also off-topic on SO). For the csv part on a smaller file task (which you didn't repeat here) I would have pointed you to this answer: http://stackoverflow.com/questions/19082490/how-can-i-use-r-to-find-malformed-rows-and-fields-in-a-file-too-big-to-read-into/19083665#19083665 For the fwf part (in a file that fits into RAM), I would have suggested wrapping table(nchar( . )) around readLines(file=filename). And then drilling down with which( nchar( . ) == <chosen_line_length> ) . I believe searching Rhelp will bring up examples of how to handle file input in chunks which should allow you to cobble together a strategy if you insist on using R ... the wrong tool. If you need to narrow your Rhelp archive search I suggest using the name "Jim Holtman" or "William Dunlap", or "Gabor Grothendieck" since they frequently have the most elegant strategies in my opinion. Here's search strategy implemented via MarkMail: http://markmail.org/search/?q=list%3Aorg.r-project.r-help+file+chunks+readlines But for files of the size you contemplate I would suggest using databases, awk or other editing software that is designed for streaming processing from disk. R is not so designed. -- David. > > regards > Lucas Mation > IPEA - Brasil > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.