On Dec 21, 7:21 am, jwwest <[EMAIL PROTECTED]> wrote: > On Dec 20, 2:13 pm, John Machin <[EMAIL PROTECTED]> wrote: > > > > > On Dec 21, 6:50 am, jwwest <[EMAIL PROTECTED]> wrote: > > > > Anyone have any trouble pattern matching on lines returned by > > > readline? Here's an example: > > > > string = "Accounting - General" > > > pat = ".+\s-" > > > > Should match on "Accounting -". However, if I read that string in from > > > a file it will not match. In fact, I can't get anything to match > > > except ".*". > > > > I'm almost certain that it has something to do with the characters > > > that python returns from readline(). If I have this in a file: > > > > Accounting - General > > > > And do a: > > > > line = f.readline() > > > print line > > > > I get: > > > > A c c o u n t i n g - G e n e r a l > > > > Not sure why, I'm a nub at Python so any help is appreciated. They > > > look like spaces to me, but aren't (I've tried matching on spacs too) > > > > - james > > > To find out what the pseudo-spaces are, do this: > > > print repr(open("the_file", "rb").read()[:100]) > > > and show us (copy/paste) what you get. > > > Also, tell us what platform you are running Python on, and how the > > file was created (by what software, on what platform). > > Here's my output: > 'A\x00c\x00c\x00o\x00u\x00n\x00t\x00i\x00n\x00g\x00 \x00-\x00 \x00G > \x00e\x00n\x00e\x00r\x00a\x00l\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 > \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 > \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00' > > I'm running Python on Windows. The file was initially created as > output from SQL Management Studio. I've re-saved it using TextPad > which tells me it's Unicode and PC formatted.
"Unicode" means "utf16". Try this: import codecs f = codecs.open("the_file", "r", encoding="utf16le") for uline in f: line = uline.encode('cp1252') # or some other encoding if my guess isn't correct # proceed as usual Cheers, John -- http://mail.python.org/mailman/listinfo/python-list