In a private correspondence with Martin Tomko, I think the reason for the problem has been found.
The numbers of ";"-separated fields in the 82 successive lines of his file are as follows: 01:26 02:26 03:33 04:33 05:12 06:12 07:12 08:12, 09:19 10:19 11:17 12:17 13:23 14:23 15:23 16:23, 17:23 18:23 19:23 20:23 21:23 22:23 23:23 24:23, 25:23 26:23 27:23 28:23 29:23 30:23 31:23 32:23, 33:23 34:23 35:23 36:23 37:23 38:23 39:23 40:23, 41:23 42:23 43:23 44:23 45:23 46:23 47:23 48:23, 49:23 50:23 51:23 52:23 53:23 54:23 55:23 56:23, 57:23 58:23 59:23 60:23 61:34 62:34 63:34 64:34, 65:13 66:13 67:38 68:38 69:20 70:20 71:44 72:20, 73:19 74:19 75:20 76:44 77:20 78:19 79:19 80:20, 81:25 82:25 So in the first 5 lines there is a maximum of 33 fields. Hence, since there is no header line, read.csv() decides to allocate 33 columns. (See ?read.csv). There are the following distinct numbers of fields in the lines: 12 13 17 19 20 23 25 26 33 34 38 44 so there are lines with 34, 38 and 44 fields. All lines in the CSV file end with ";", hence there is an implicit blank field at the end of each line. The lines with 34 fields have the 34th field blank, so after the break there is presumably a "quasi blank input line" where the 34th (blank) field has spilled over. Such input will be ignored with the default "blank.lines.skip = TRUE" option to read,csv(). The longer lines (2 with 38 fields, 2 with 44) will be split after the 33rd field, the remainder being taken as an additional input line. As a result, there are 82 (= 82+4) rows in the resulting dataframe. This explanation is compatible with what Martin has observed. The underlying forensic details were sniffed out with a couple of passes through 'awk' scripts. One solution is to call read.csv() with option "col.names=Xnn" where Xnn is a constructed character vector with elements such as "X01" "X02" ... "X44" (once one has determined, as above, that there is a maximum of 44 fields per line in the file). Ted. On 30-May-09 19:43:47, jim holtman wrote: > It is still not clear to me exactly how you want to read the lines in. > If > the lines have a variable number of fields, and some of the lines might > be > wrapped, is there some way to determine where the start of each line > is. > > If you are reading them in with read.csv, then the system is assuming > that > each line starts a new row. If this is not the case, then you will > have to > state the rules that determine where the lines start. You can always > read > the data in with 'scan' to separate each line and then do whatever > processing is required to put together the rows in a data frame that > you > want. > > In one of your examples, you indicated that the line was split starting > at > the word "kempten"; if this is in the middle of the line, then you > would > have to create the break after reading the line in with 'scan' and then > creating the rows in the dataframe. All of this can be done in R if > you can > state what the criteria is. > On Sat, May 30, 2009 at 4:32 AM, Martin Tomko > <martin.to...@geo.uzh.ch>wrote: > >> Jim, >> the two lines I put in are the actual problematic input lines. >> In these examples, there are no quotes nor # signs, although I have no >> means to make sure they do not occur in the inputs (any hints how I >> could >> deal with that?). >> I am trying to avoid as much pre-processing outside R as possible, and >> I >> have to process about 500 files with up to 3000 records each, so I >> need a >> more or less automated/batch solution. - so any string substitution >> will >> have to occur in R. But for the moment, I do not see a reaason for >> substitution, and the wrapping still occurs. >> >> Cheers >> Martin >> >> >> >> jim holtman wrote: >> >>> You need to supply the actual input line so we can see what is >>> happening. >>> Are you sure you do not have unbalanced quotes in your input (try >>> quote='') >>> or do you have comment characters ("#") in your input? >>> >>> On Fri, May 29, 2009 at 3:15 PM, Martin Tomko >>> <martin.to...@geo.uzh.ch<mailto: >>> martin.to...@geo.uzh.ch>> wrote: >>> >>> Dear All, >>> I am observing a strange behavior and searching the archives and >>> help pages didn't help much. >>> I have a csv with a variable number of fields in each line. >>> >>> I use >>> dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill =TRUE); >>> >>> to read it in, and it works. But - some lines are long and 'wrap', >>> or split and continue on the next line. So when I check the dim of >>> the frame, they are not correct and I can see when I do a printout >>> that the lines is split into two in the frame. I checked the input >>> file and all is good. >>> >>> an example of the input is: >>> 37;2175168475;13;8.522729;47.19537;16366...@n00 >>> ;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switz >>> erland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;touris >>> mus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitri >>> otnet; >>> >>> where the last values occurs on the next line in the data frame. >>> >>> It does not have to be the last value, as in the follwong example, >>> the word "kempten" starts the next line: >>> 39;167757703;12;10.309295;47.724545;21903...@n00 >>> ;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavar >>> ia;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss >>> ;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kup >>> pel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europ > aeischeunion;germanio; >>> >>> What could be the reason? >>> >>> I ws thinking about solving the issue by using a different >>> separator, that I would use for the first 7 fields and >>> concatenating all of the remaining values into a single stirng >>> value, but could not figure out how to do such a substitution in >>> R. Unfortunately, on my system I cannot specify a range for sed... >>> >>> Thanks for any help/pointers >>> Martin >>> >>> ______________________________________________ >>> R-help@r-project.org <mailto:R-help@r-project.org> mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html<http://www.r-project.or >>> g/posting-guide.html> >>> <http://www.r-project.org/posting-guide.html> >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >>> >>> >>> -- >>> Jim Holtman >>> Cincinnati, OH >>> +1 513 646 9390 >>> >>> What is the problem that you are trying to solve? >>> >> >> > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.hard...@manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 30-May-09 Time: 21:15:13 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.