Thank you Bert, Jeff and David for great answers. Let me provide more context to clarify the question:
- I am running this on a large server (512GB), so the data still fits into memory (and I also know how to process in chunks if necessary) - I agree that DBMS and other software would me better suited, but our team (part of gvt-body) is mostly comprised of statisticians and economist who know R but not much of other languages or have admin permissions to set up DBMS. It support is available but much more scarce. So a pure R solution may be useful, and I want to understand how much worst or more impractical it is. Performance is a secondary concern. - I haven't provided more concrete problems because I did not even know how to find the lines that contain errors. With the suggestions given, I think I will be able to do that The mockup of the ideal function I would dream of having is something like: clean_my_fwf <- function(infile, outfile, vector_of_error_row_numbers, col_positions, col_types){ - import_data from infile - in each row, check if the content of each caracter postion correponds to the expected content of that variable given col_positions and col_types. This must be true for all columns - write all lines that pass this test to outfile - return(vector_of_error_row_numbers) : containing the row number (of the original dataset) of all rows that fail the test } where col_positions and col_types follow the syntax of readr::read_fwf could be parallelized and C++ based 2016-11-06 14:16 GMT-02:00 David Winsemius <dwinsem...@comcast.net>: > >> On Nov 6, 2016, at 5:36 AM, Lucas Ferreira Mation <lucasmat...@gmail.com> >> wrote: >> >> I have some large .txt files about ~100GB containing a dataset in fixed >> width file. This contains some errors: >> - character characters in column that are supposed to be numeric, >> - invalid characters >> - rows with too many characters, possibly due to invalid characters or some >> missing end of line character (so two rows in the original data become one >> row in the .txt file). >> >> The errors are not very frequent, but stop me from importing with readr >> ::read_fwf() >> >> >> Is there some package, or workflow, in R to pre-process the files, >> separating the valid from the not-valid rows into different files? This can >> be done by ETL point-click tools, such as Pentaho PDI. Is there some >> equivalent code in R to do this? >> >> I googled it and could not find a solution. I also asked this in >> StackOverflow and got no answer (here >> <http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r> >> ). > > Had I seen it there I would have voted to close (and just did) that SO > question as too broad, although it is too vague because of lack of definition > of "corrupted characters", and furthermore basically a request for a package > recommendation (which is also off-topic on SO). > > For the csv part on a smaller file task (which you didn't repeat here) I > would have pointed you to this answer: > > http://stackoverflow.com/questions/19082490/how-can-i-use-r-to-find-malformed-rows-and-fields-in-a-file-too-big-to-read-into/19083665#19083665 > > For the fwf part (in a file that fits into RAM), I would have suggested > wrapping table(nchar( . )) around readLines(file=filename). And then drilling > down with which( nchar( . ) == <chosen_line_length> ) . > > I believe searching Rhelp will bring up examples of how to handle file input > in chunks which should allow you to cobble together a strategy if you insist > on using R ... the wrong tool. If you need to narrow your Rhelp archive > search I suggest using the name "Jim Holtman" or "William Dunlap", or "Gabor > Grothendieck" since they frequently have the most elegant strategies in my > opinion. > > Here's search strategy implemented via MarkMail: > http://markmail.org/search/?q=list%3Aorg.r-project.r-help+file+chunks+readlines > > But for files of the size you contemplate I would suggest using databases, > awk or other editing software that is designed for streaming processing from > disk. R is not so designed. > > -- > David. > > >> >> regards >> Lucas Mation >> IPEA - Brasil >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > David Winsemius > Alameda, CA, USA > ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.