[issue34801] codecs.getreader() splits lines containing control characters

Karthikeyan Singaravelan Thu, 04 Oct 2018 11:08:59 -0700


Karthikeyan Singaravelan <[email protected]> added the comment:


codecs.getreader('utf-8')(open('test.txt', 'rb')) during iteration 
str.splitlines on the decoded data that takes '\x0b' as a valid newline as 
specified in [0] being a superset of universal newlines. Thus splits on '\x0b' 
as a valid newline for string and works correctly.

./python.exe
Python 3.8.0a0 (heads/master:6f85b826b5, Oct  4 2018, 22:44:36)
[Clang 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 'first line\x0b\x0bblah blah\nsecond line\n' # returned by 
>>> codecs.getreader()
>>> a.splitlines(keepends=True)
['first line\x0b', '\x0b', 'blah blah\n', 'second line\n']

# for bytes bytes.splitlines works only on universal-newlines thus doesn't 
split on '\x0b' [1]
>>> b = b'first line\x0b\x0bblah blah\nsecond line\n' 
>>> b.splitlines(keepends=True)
[b'first line\x0b\x0bblah blah\n', b'second line\n']


But io.TextIOWrapper only accepts None, '', '\n', '\r\n' and '\r' as newline 
for text mode but for binary files it's different as noted in readline to 
accept only '\n' [2]

> The line terminator is always b'\n' for binary files; for text
> files, the newlines argument to open can be used to select the line
> terminator(s) recognized.

Thus 'first line\x0b\x0bblah blah\nsecond line\n' gives ['first 
line\x0b\x0bblah blah\n', 'second line\n'] . Trying to use '\x0b' as new line 
results in illegal newline error in TextIOWrapper.

Hope I am correct on the above analysis.

[0] https://docs.python.org/3.8/library/stdtypes.html#str.splitlines
[1] https://docs.python.org/3.8/library/stdtypes.html#bytes.splitlines
[2] https://docs.python.org/3/library/io.html#io.TextIOBase.readline

----------
nosy: +xtreak

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue34801>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue34801] codecs.getreader() splits lines containing control characters

Reply via email to