On Oct 13, 4:01 pm, Neil Cerutti <ne...@norwich.edu> wrote: > On 2010-10-13, pstatham <pstat...@sefas.com> wrote: > > > Hopefully this will interest some, I have a csv file (can be > > downloaded fromhttp://www.paulstathamphotography.co.uk/45.txt) which > > has five fields separated by ~ delimiters. To read this I've been > > using a csv.DictReader which works in 99% of the cases. Occasionally > > however the description field has errant \r\n characters in the middle > > of the record. This causes the reader to assume it's a new record and > > try to read it. > > Here's an alternative idea. Working with csv module for this job > is too difficult for me. ;) > > import re > > record_re = > "(?P<PROGTITLE>.*?)~(?P<SUBTITLE>.*?)~(?P<EPISODE>.*?)~(?P<DESCRIPTION>.*?)~(?P<DATE>.*?)\n(.*)" > > def parse_file(fname): > with open(fname) as f: > data = f.read() > m = re.match(record_re, data, flags=re.M | re.S) > while m: > yield m.groupdict() > m = re.match(record_re, m.group(6), flags=re.M | re.S) > > for record in parse_file('45.txt'): > print(record) > > -- > Neil Cerutti
Thanks guys, I can't alter the source data. I wouldn't of considered regex, but it's a good idea as I can then define my own record structure instead of reader dictating to me what a record is. -- http://mail.python.org/mailman/listinfo/python-list