[ python-Bugs-1175396 ] codecs.readline sometimes removes newline chars

SourceForge.net Thu, 21 Apr 2005 14:56:27 -0700

Bugs item #1175396, was opened at 2005-04-02 17:14
Message generated for change (Comment added) made by doerwalter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1175396&group_id=5470


Category: Python Library
Group: Python 2.4
Status: Open
Resolution: Accepted
Priority: 5
Submitted By: Irmen de Jong (irmen)
Assigned to: Walter Dörwald (doerwalter)
Summary: codecs.readline sometimes removes newline chars

Initial Comment:
In Python 2.4.1 i observed a new bug in codecs.readline,
it seems that with certain inputs it removes newline
characters from the end of the line....

Probably related to bug #1076985 (Incorrect
behaviour of StreamReader.readline leads to crash)
and bug #1098990 codec readline() splits lines apart
(both with status closed) so I'm assigning this to Walter.

See the attached files that demonstrate the problem.
Reproduced with Python 2.4.1 on windows XP and on
Linux. The problem does not occur with Python 2.4.

(btw, it seems bug #1076985 was fixed in python 2.4.1,
but the other one (#1098990) not? )

----------------------------------------------------------------------

>Comment By: Walter Dörwald (doerwalter)
Date: 2005-04-21 23:55

Message:
Logged In: YES 
user_id=89016

OK, I've checked in the following:

Lib/codecs.py 1.44
Lib/test/test_codecs.py 1.23
Lib/codecs.py 1.35.2.7
Lib/test/test_codecs.py 1.15.2.5

with the following changes as suggested by glchapman:
If a chunk read has a trailing "\r", read one more character
even if the user has passed a size parameter. Remove the
atcr logic. This should fix most of the problems. There are
three open issues:

1) How do we handle the problem of a truncated line, if the
data comes from the charbuffer instead of being read from
the stream?

2) How do we handle error reporting? The buffering code
might read more data than the current line. If this data has
a decoding error the caller might report a wrong line
number. (See bug #1178484)

3) Fixing the tokenizer.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-04-15 20:16

Message:
Logged In: YES 
user_id=89016

The current readline() is implemented so that even if the
complete line ending (i.e. '\r\n') can't be read within size
bytes, at least something that might work as a line end
(i.e. '\r') is available at the end of the string, so that
the user knowns that the line is complete. The atcr members
makes sure that the '\n' that is read next isn't
misinterpreted as another line. Unfortunately the import
mechanisn doesn't work that way: It demands a '\n' as line
terminator and will continue reading if it only finds the
'\r'. This means that the '\n' will be skipped and the
import mechanisn treats those two lines as one.

IMHO the best solution would be the read the extra character
even if size is None, as glchapman suggested, so the above
situation would really only happen if the last character
from the stream is '\r'. I think the tokenizer should be
able to survive this. What it didn't survive in 2.4 was that
readline() tried to get it's hand on a complete line even if
the length of the line was way bigger than the size passed
in. If we remove the "size is None" restriction from the
current code, then I think we should remove the atcr logic
too, because it could only happen under exceptional
circumstances and the old handling might be better for those
users that don't recognize '\r' as a line end.

But in any case the tokenizer should be fixed to be able to
handle any line length returned from readline(). I'd really
like to get a review by Martin v. Löwis of glchapman's patch
#1101726.

Of course the simplest solution would be: "If you want a
complete line, don't pass a size parameter to readline()". I
don't know if it would make sense to change the PEP263
machinery to support this.

The other problem is if readline() returns data from the
charbuffer. I don't really see how this can be fixed. We
might call read() with a size parameter even if there are
characters in the charbuffer, but what happens if the user
calls readline(size=100000) first and then
readline(size=10)? The first readline() call might put a
very long string into the charbuffer and then the second
call will return a unicode string whose length is in no way
correlated to the size parameter. (This happens even even
with the current code of course, so the message should be:
Don't trust the size parameter, it only tells the underlying
stream how many bytes to read (approximately), it's doesn't
tell you *anything* about the amount of data returned.).
This was different in 2.3.x, because the readline() method
of the underlying stream was used, which handled this
properly, but only worked if an encoded linefeed was
recognizable as a normal 8bit '\n' linefeed by the
underlying readline() (i.e. the UTF-16 encodings didn't work).


----------------------------------------------------------------------

Comment By: Greg Chapman (glchapman)
Date: 2005-04-15 01:25

Message:
Logged In: YES 
user_id=86307

I think the foo2.py from 1163244 is probably the same bug;
at any rate, the reason for it is that a \r is at the
beginning of the last line when read in by decoding_fgets. 
I have simpler test file which shows the bug which I'll
email to Walter (you basically just have to get a \r as the
last character in the block read by StreamReader, so that
atcr will be true).  

The problem is caused by StreamReader.readline doing:

    if self.atcr and data.startswith(u"\n"):
        data = data[1:]

since the tokenizer relies on '\n' as the line break
character, but it will never see the '\n' removed by the above.

FWIW (not much), I think the 2.4 StreamReader.readline
actually made more sense than the current code, although a
few changes would seem useful (see below).  I don't think it
is particularly useful to treat the size parameter as a
fixed maximum number of bytes to read, since the number of
bytes read has no fixed relationship to the number of
decoded unicode characters (and also, in the case of the
tokenizer, no fixed relationship to the number of bytes of
encoded utf8).  Also, given the current code, the size
parameter is effectively ignored if there is a charbuffer:
if you have 5 characters sitting in the charbuffer and use a
size of 0x1FF, you only get back the 5 characters, even if
they do not end in a linebreak.  For the tokenizer, this
means an unnecessary PyMem_RESIZE and an extra call to
decoding_readline roughly every BUFSIZ bytes in the file
(since the tokenizer assumes failure to fetch a complete
line means its buffer is too small, whereas in fact it was
caused by an incomplete line being stored in the
StreamReader's charbuffer).

As to changes from 2.4, if the unicode object were to add a
findlinebreak method which returns the index of the first
character for which Py_UNICODE_ISLINEBREAK is true, readline
could use that instead of find("\n").  If it used such a
method, readline would also need to explicitly handle a
"\r\n" sequence, including a potential read(1) if a '\r'
appears at the end of the data (in the case where size is
not None).  Of course, one problem with that idea is it
requires a new method, which may not be allowed until 2.5,
and the 2.4.1 behavior definitely needs to be fixed some
way. (Interestingly, it looks to me like sre has everything
necessary for searching for unicode linebreaks except syntax
with which to express the idea in a pattern (maybe I'm
missing something, but I can't find a way to get a compiled
pattern to emit CATEGORY_UNI_LINEBREAK).)


----------------------------------------------------------------------

Comment By: Michal Rydlo (mmm)
Date: 2005-04-15 00:04

Message:
Logged In: YES 
user_id=65460

foo2.py from #1163244 fails to import. Not being expert in
Python internals I hope it is due to this bug.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-04-11 23:42

Message:
Logged In: YES 
user_id=89016

OK, I'm reopening to bug report. I didn't manage to install
pythondoc. cElementTree complains about: No such file or
directory: './pyconfig.h'. Can you provide a simple Python
file that fails when imported?

----------------------------------------------------------------------

Comment By: Greg Chapman (glchapman)
Date: 2005-04-10 00:47

Message:
Logged In: YES 
user_id=86307

Sorry to comment on a closed report, but perhaps this fix
should not be limited only to cases where size is None. 
Today, I ran into a spurious syntax error when trying to
import pythondoc (from
http://effbot.org/downloads/pythondoc-2.1b3-20050325.zip). 
It turned out a \r was ending up in what looked to the
parser like the middle of a line, presumably because a \n
was dropped.  Anyway, I applied the referenced patch to
2.4.1, except I left out the "size is None" condition, since
I knew the tokenizer passes in a size param.  With that
change pythondoc import successfully.  (Also, I just ran the
test suite and nothing broke.)

Since the size parameter is already documented as being
passed to StreamReader.read (in codecs.py -- the HTML
documentation needs to be updated), and since
StreamReader.read says size is an approximate maximum,
perhaps it's OK to read one extra byte.


----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-04-05 00:01

Message:
Logged In: YES 
user_id=89016

Checked in a fix as:
Lib/codecs.py 1.42/1.43
Lib/test/test_codecs.py 1.22
Lib/codecs.py 1.35.2.6
Lib/test/test_codecs.py 1.15.2.4

Are you really sure, that the fix for #1098990 is not in
2.4.1? According to the tracker for #1098990 the fix was in
lib/codecs.py revision 1.35.2.2 and revision 1.35.2.3 is the
one that got the r241c1 tag, so the fix should be in 2.4.1.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1175396&group_id=5470
_______________________________________________
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[ python-Bugs-1175396 ] codecs.readline sometimes removes newline chars

Reply via email to