On Aug 5, 2013, at 4:11 AM, Asis Hallab wrote: > Dear R experts, > > I have a large table saved in a file called "plant_genome.gff". The > file has 481848 lines in nine columns, which are TAB delimited, and is > 53 MegaBytes large. > For anyone who might know the GFF3 format: The table holds a plant > genome's annotation. > > If I read in the table with > read.table( "plant_genome.gff" ) > I get the following error > "line 2 did not have 12 elements". > > If I read in the table with > read.table( "plant_genome.gff", sep="\t" ) > no error or warning is given, but my resulting table has only 193547 > instead of the expected 481848 rows! 60% of the lines are omitted. > > Also passing in the arguments > as.is = TRUE > or setting the columns' classes with > colClasses = c( "character", …, "integer", "integer", "numeric", > "character", … ) > # columns 4, and 5 are integers, column 6 is numeric, all others > are characters > does not resolve the problem. > > If I read in the file with readLines and then manually split them using > strplit(…)
THat doesn't unambiguously define the process. > and combine them into a data.frame with > as.data.frame( do.call( "rbind", splitted.lines ), colClasses=…) > I get the expected and correct data.frame, representing my GFF3 data. > > My questions are: > 1) Am I using read.table wrong, or did I miss something in the documentation? > 2) Or is this is known problem with large TAB delimited tables, whose > columns contain white-spaces and are not surrounded by quotes? I would think this is not "a known problem" but rather "entirely expected and documented behavior". The read.table function uses white-space as its default separation rule. The large-ness of the file has nothing to do with it. You would get the same problem with a very small example. If you want tab-separation then use read.delim which has sep="\t" as its default. -- David Winsemius Alameda, CA, USA ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.