Andrew McLean wrote: > I have a bunch of csv files that have the following characteristics: > > - field delimiter is a comma > - all fields quoted with double quotes > - lines terminated by a *space* followed by a newline > > What surprised me was that the csv reader included the trailing space in > the final field value returned, even though it is outside of the quotes. > > > I've produced a test program (see below) that demonstrates this. There > is a workaround, which is to not pass the csv reader the file iterator, > but rather a generator that returns lines from the file with the > trailing space stripped. > > Interestingly, the same behaviour is seen if there are spaces before the > field separator. They are also included in the preceding field value, > even if they are outside the quotations. My workaround wouldn't help here.
A better workaround IMHO is to strip each *field* after it is received from the csv reader. In fact, it is very rare that leading or trailing space in CSV fields is of any significance at all. Multiple spaces ditto. Just do this all the time: row = [' '.join(x.split()) for x in row] > > Anyway is this a bug or a feature? If it is a feature then I'm curious > as to why it is considered desirable behaviour. IMHO, a bug. In that state, it should be expecting another quotechar, a delimiter, or a lineterminator. A case could be made for either (a) ignore space characters (b) raise an exception (c) a or b depending on an arg ..., ignore_trailing_space=False. But it gets even more bizarre; see output from revised test script: DOS_prompt>cat amclean2.py import csv filename = "test_data.csv" # Generate a test file - note the spaces before the newlines fout = open(filename, "w") fout.write('"Field1","Field2","Field3" \n') fout.write('"a","b","c" \n') fout.write('"d" ,"e","f" \n') fout.write('"g"xxx,"h" yyy,"i"zzz \n') fout.write('Fred "Supercoder" Nerk,p,q\n') fout.write('Fred "Supercoder\' Nerk,p,q\n') fout.write('Fred \'Supercoder" Nerk,p,q\n') fout.write('"Fred "Supercoder" Nerk",p,q\n') fout.write('"Fred "Supercoder\' Nerk",p,q\n') fout.write('"Fred \'Supercoder" Nerk",p,q\n') fout.write('"Emoh Ruo", 123 Smith St, Sometown,p,q\n') fout.write('""Emoh Ruo", 123 Smith St, Sometown","p","q"\n') fout.close() # Function to test a reader def read_and_print(reader): for line in reader: # print ",".join(['"%s"' % field for field in line]) # sheesh print repr(line) # Read the test file - and print the output reader = csv.reader(open("test_data.csv", "rb")) read_and_print(reader) DOS_prompt>\python25\python amclean2.py ['Field1', 'Field2', 'Field3 '] ['a', 'b', 'c '] ['d ', 'e', 'f '] ['gxxx', 'h yyy', 'izzz '] ['Fred "Supercoder" Nerk', 'p', 'q'] ['Fred "Supercoder\' Nerk', 'p', 'q'] ['Fred \'Supercoder" Nerk', 'p', 'q'] ['Fred Supercoder" Nerk"', 'p', 'q'] ['Fred Supercoder\' Nerk"', 'p', 'q'] ['Fred \'Supercoder Nerk"', 'p', 'q'] ['Emoh Ruo', ' 123 Smith St', ' Sometown', 'p', 'q'] ['Emoh Ruo"', ' 123 Smith St', ' Sometown"', 'p', 'q'] Input like the 4th line (and subsequent lines) in the test file can not have been produced by code which was following the usual algorithm for quoting CSV fields. Either it is *concatenating* properly-quoted segments (unlikely) or it is not doing CSV quoting at all or it is blindly wrapping quotes around the field without doubling internal quotes. IMHO such problems should not be silently ignored. > # Try using lineterminator instead - it doesn't work > reader = csv.reader(open("test_data.csv", "rb"), lineterminator=" \r\n") lineterminator is silently ignored by the reader. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list