[issue18291] codecs.open interprets space as line ends

Paul Mon, 24 Jun 2013 06:12:29 -0700

New submission from Paul:

I hope I am writing in the right place.


When using codecs.open with UTF-8 encoding, it seems characters \x12, \x13, and 
\x14 are interpreted as end-of-line.

Example code:

>>> with open('unicodetest.txt', 'w') as f:
>>>   f.write('a'+chr(28)+'b'+chr(29)+'c'+chr(30)+'d'+chr(31)+'e')
>>> with open('unicodetest.txt', 'r') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12b\x13c\x14d\x15e

The point here is that it reads it as one line, as I would expect. But using 
codecs.open with UTF-8 encoding it reads it as many lines:

>>> import codecs
>>> with codecs.open('unicodetest.txt', 'r', 'UTF-8') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12
1 b\x13
2 c\x14
3 d\x15e

The characters \x12 through \x15 are described as "Information Separator Four" 
through "One" (in that order). As far as I can see they never mark line ends. 
Also interestingly, \x15 isn't interpreted as such.

As a sidenote, I tested and verified that io.open is correct (but when reading 
loads of data it appears to be 5 times slower than codecs):

>>> import io
>>> with io.open('unicodetest.txt', encoding='UTF-8') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12b\x13c\x14d\x15e

----------
components: IO, Unicode
messages: 191758
nosy: ezio.melotti, wpk
priority: normal
severity: normal
status: open
title: codecs.open interprets space as line ends
type: behavior
versions: Python 2.6, Python 2.7

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue18291>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue18291] codecs.open interprets space as line ends

Reply via email to