New submission from Paul: I hope I am writing in the right place.
When using codecs.open with UTF-8 encoding, it seems characters \x12, \x13, and \x14 are interpreted as end-of-line. Example code: >>> with open('unicodetest.txt', 'w') as f: >>> f.write('a'+chr(28)+'b'+chr(29)+'c'+chr(30)+'d'+chr(31)+'e') >>> with open('unicodetest.txt', 'r') as f: >>> for i,l in enumerate(f): >>> print i, l 0 a\x12b\x13c\x14d\x15e The point here is that it reads it as one line, as I would expect. But using codecs.open with UTF-8 encoding it reads it as many lines: >>> import codecs >>> with codecs.open('unicodetest.txt', 'r', 'UTF-8') as f: >>> for i,l in enumerate(f): >>> print i, l 0 a\x12 1 b\x13 2 c\x14 3 d\x15e The characters \x12 through \x15 are described as "Information Separator Four" through "One" (in that order). As far as I can see they never mark line ends. Also interestingly, \x15 isn't interpreted as such. As a sidenote, I tested and verified that io.open is correct (but when reading loads of data it appears to be 5 times slower than codecs): >>> import io >>> with io.open('unicodetest.txt', encoding='UTF-8') as f: >>> for i,l in enumerate(f): >>> print i, l 0 a\x12b\x13c\x14d\x15e ---------- components: IO, Unicode messages: 191758 nosy: ezio.melotti, wpk priority: normal severity: normal status: open title: codecs.open interprets space as line ends type: behavior versions: Python 2.6, Python 2.7 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue18291> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com