On Feb 5, 7:16 pm, Jim Lemon <j...@bitwrit.com.au> wrote: > On 02/06/2010 09:05 AM, analys...@hotmail.com wrote: > > > > > > > On Feb 5, 8:57 am, Barry Rowlingson<b.rowling...@lancaster.ac.uk> > > wrote: > >> On Fri, Feb 5, 2010 at 10:23 AM, analys...@hotmail.com > > >> <analys...@hotmail.com> wrote: > >>> the csv files are downloaded from a database and it looks like some > >>> character fields contain the CR-LF sequence within them. > > >>> This causes R to see a new record/row and the number of rows it sees > >>> is different (usually higher) from the number of rows actually > >>> extracted. > > >> Hard to tell without an example, but I just tried this in a file: > > >> 1,2,"this > >> is a test",99 > >> 2,3,"oneliner",45 > > >> and: > > >>> read.table("test.csv",sep=",") > > >> V1 V2 V3 V4 > >> 1 1 2 this\nis a test 99 > >> 2 2 3 oneliner 45 > > >> seemed to work. But if your strings aren't "quoted" (hard to tell > >> without an example) then you might have to find another way. Hard to > >> tell without an example. > > >> Barry > > >> ______________________________________________ > >> r-h...@r-project.org mailing > >> listhttps://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > > Here is a Hex dump (please igmore the '>' at the start of each line) - > > of the file that results from extracting two rows. > > >> EF BB BF 64 65 73 63 72-69 70 74 69 6F 6E 0D 0A ...description.. > >> 22 3C 73 74 72 6F 6E 67-3E 55 6E 6B 6E 6F 77 6E "<strong>Unknown > >> 20 41 6E 79 74 69 6D 65-2C 20 41 6E 79 77 68 65 Anytime, Anywhe > >> 72 65 20 4C 65 61 72 6E-69 6E 67 3C 62 72 20 2F re Learning<br / > >> 3E 0D 0A 3C 2F 73 74 72-6F 6E 67 3E 20 54 68 65>..</strong> The > >> 20 61 6E 73 77 65 72 20-69 73 20 55 6E 6B 6E 6F answer is Unkno > >> 77 6E 2E 20 3C 73 74 72-6F 6E 67 3E 20 79 6F 75 wn.<strong> you > >> 20 63 61 6E 20 73 74 61-72 74 20 61 6E 64 20 66 can start and f > >> 69 6E 69 73 68 20 69 6E-20 6C 65 73 73 20 74 68 inish in less th > >> 65 6E 20 31 37 20 6D 6F-6E 74 68 73 2E 3C 2F 73 en 17 months.</s > >> 74 72 6F 6E 67 3E 20 3C-62 72 20 2F 3E 0D 0A 3C trong> <br />..< > >> 62 72 20 2F 3E 0D 0A 55-6E 6B 6E 6F 77 6E 20 61 br />..Unknown a > >> 62 6F 75 74 20 65 6E 73-75 72 69 6E 67 20 79 6F bout ensuring yo > >> 75 20 6C 65 61 72 6E 20-2E 22 0D 0A 03 D8 26 8A u learn ."....&. > > > R, Fortran and Excel see five lines, but the database has only two > > lines. > > Okay, you have five CR-LF pairs with two being EORs. It looks like the > <br />CR-LF is the EOR sequence, so it should be possible to preserve > those while changing the others to something like "~" or deleting them. > As I said previously, the regexperts can work out a way to distinguish > the CR-LF pairs that are _not_ in an EOR sequence. > > You might want to think about dumping the control characters as well. > > Jim > > ______________________________________________ > r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.- Hide > quoted text - >
I am sure other sequences cause a false EOR also. The false EORs are CRLF sequences are within commas - I don't know if R can read a fixed number of bytes regardless of EOR markers. If it can, it should be possible to assemble the true database rows from the bytes read in. > - Show quoted text - ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.