Bugs item #1175396, was opened at 2005-04-02 06:14 Message generated for change (Comment added) made by glchapman You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1175396&group_id=5470
Category: Python Library Group: Python 2.4 Status: Open Resolution: Accepted Priority: 5 Submitted By: Irmen de Jong (irmen) Assigned to: Walter Dörwald (doerwalter) Summary: codecs.readline sometimes removes newline chars Initial Comment: In Python 2.4.1 i observed a new bug in codecs.readline, it seems that with certain inputs it removes newline characters from the end of the line.... Probably related to bug #1076985 (Incorrect behaviour of StreamReader.readline leads to crash) and bug #1098990 codec readline() splits lines apart (both with status closed) so I'm assigning this to Walter. See the attached files that demonstrate the problem. Reproduced with Python 2.4.1 on windows XP and on Linux. The problem does not occur with Python 2.4. (btw, it seems bug #1076985 was fixed in python 2.4.1, but the other one (#1098990) not? ) ---------------------------------------------------------------------- Comment By: Greg Chapman (glchapman) Date: 2005-04-14 15:25 Message: Logged In: YES user_id=86307 I think the foo2.py from 1163244 is probably the same bug; at any rate, the reason for it is that a \r is at the beginning of the last line when read in by decoding_fgets. I have simpler test file which shows the bug which I'll email to Walter (you basically just have to get a \r as the last character in the block read by StreamReader, so that atcr will be true). The problem is caused by StreamReader.readline doing: if self.atcr and data.startswith(u"\n"): data = data[1:] since the tokenizer relies on '\n' as the line break character, but it will never see the '\n' removed by the above. FWIW (not much), I think the 2.4 StreamReader.readline actually made more sense than the current code, although a few changes would seem useful (see below). I don't think it is particularly useful to treat the size parameter as a fixed maximum number of bytes to read, since the number of bytes read has no fixed relationship to the number of decoded unicode characters (and also, in the case of the tokenizer, no fixed relationship to the number of bytes of encoded utf8). Also, given the current code, the size parameter is effectively ignored if there is a charbuffer: if you have 5 characters sitting in the charbuffer and use a size of 0x1FF, you only get back the 5 characters, even if they do not end in a linebreak. For the tokenizer, this means an unnecessary PyMem_RESIZE and an extra call to decoding_readline roughly every BUFSIZ bytes in the file (since the tokenizer assumes failure to fetch a complete line means its buffer is too small, whereas in fact it was caused by an incomplete line being stored in the StreamReader's charbuffer). As to changes from 2.4, if the unicode object were to add a findlinebreak method which returns the index of the first character for which Py_UNICODE_ISLINEBREAK is true, readline could use that instead of find("\n"). If it used such a method, readline would also need to explicitly handle a "\r\n" sequence, including a potential read(1) if a '\r' appears at the end of the data (in the case where size is not None). Of course, one problem with that idea is it requires a new method, which may not be allowed until 2.5, and the 2.4.1 behavior definitely needs to be fixed some way. (Interestingly, it looks to me like sre has everything necessary for searching for unicode linebreaks except syntax with which to express the idea in a pattern (maybe I'm missing something, but I can't find a way to get a compiled pattern to emit CATEGORY_UNI_LINEBREAK).) ---------------------------------------------------------------------- Comment By: Michal Rydlo (mmm) Date: 2005-04-14 14:04 Message: Logged In: YES user_id=65460 foo2.py from #1163244 fails to import. Not being expert in Python internals I hope it is due to this bug. ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2005-04-11 13:42 Message: Logged In: YES user_id=89016 OK, I'm reopening to bug report. I didn't manage to install pythondoc. cElementTree complains about: No such file or directory: './pyconfig.h'. Can you provide a simple Python file that fails when imported? ---------------------------------------------------------------------- Comment By: Greg Chapman (glchapman) Date: 2005-04-09 14:47 Message: Logged In: YES user_id=86307 Sorry to comment on a closed report, but perhaps this fix should not be limited only to cases where size is None. Today, I ran into a spurious syntax error when trying to import pythondoc (from http://effbot.org/downloads/pythondoc-2.1b3-20050325.zip). It turned out a \r was ending up in what looked to the parser like the middle of a line, presumably because a \n was dropped. Anyway, I applied the referenced patch to 2.4.1, except I left out the "size is None" condition, since I knew the tokenizer passes in a size param. With that change pythondoc import successfully. (Also, I just ran the test suite and nothing broke.) Since the size parameter is already documented as being passed to StreamReader.read (in codecs.py -- the HTML documentation needs to be updated), and since StreamReader.read says size is an approximate maximum, perhaps it's OK to read one extra byte. ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2005-04-04 14:01 Message: Logged In: YES user_id=89016 Checked in a fix as: Lib/codecs.py 1.42/1.43 Lib/test/test_codecs.py 1.22 Lib/codecs.py 1.35.2.6 Lib/test/test_codecs.py 1.15.2.4 Are you really sure, that the fix for #1098990 is not in 2.4.1? According to the tracker for #1098990 the fix was in lib/codecs.py revision 1.35.2.2 and revision 1.35.2.3 is the one that got the r241c1 tag, so the fix should be in 2.4.1. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1175396&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com