Bugs item #1178484, was opened at 2005-04-07 14:33 Message generated for change (Comment added) made by doerwalter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1178484&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Parser/Compiler Group: Python 2.4 Status: Open Resolution: Accepted Priority: 5 Submitted By: Timo Linna (tilinna) Assigned to: Martin v. Löwis (loewis) Summary: Erroneous line number error in Py2.4.1 Initial Comment: For some reason Python 2.3.5 reports the error in the following program correctly: File "C:\Temp\problem.py", line 7 SyntaxError: unknown decode error ..whereas Python 2.4.1 reports an invalid line number: File "C:\Temp\problem.py", line 2 SyntaxError: unknown decode error ----- problem.py starts ----- # -*- coding: ascii -*- """ Foo bar """ # Ä is not allowed in ascii coding ----- problem.py ends ----- Without the encoding declaration both Python versions report the usual deprecation warning (just like they should be doing). My environment: Windows 2000 + SP3. ---------------------------------------------------------------------- >Comment By: Walter Dörwald (doerwalter) Date: 2005-08-11 16:04 Message: Logged In: YES user_id=89016 Somehow I forgot to upload the patch. Here it is (diff2.txt). I'd like this patch to go into 2.4.2. ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2005-05-19 21:06 Message: Logged In: YES user_id=89016 > Walter, as I've said before: I know that you need buffering > for the UTF-x readline support, but I don't see a > requirement for it in most of the other codecs The *charbuffer* is required for readline support, but the *bytebuffer* is required for any non-charmap codec. To have different buffering modes we'd either need a flag in the StreamReader or use different classes, i.e. a class hierarchy like the following: StreamReader UnbufferedStreamReader CharmapStreamReader ascii.StreamReader iso_8859_1.StreamReader BufferedStreamReader utf_8.StreamReader I don't think that we should introduce such a big change in 2.4.x. Furthermore there is another problem: The 2.4 buffering code automatically gives us universal newline support. If you have a file foo.txt containing "a\rb", with Python 2.4 you get: >>> list(codecs.open("foo.txt", "rb", "latin-1")) [u'a\r', u'b'] But with Python 2.3 you get: >>> list(codecs.open("foo.txt", "rb", "latin-1")) [u'a\rb'] If we would switch to the old StreamReader for the charmap codecs, suddenly the stream reader for e.g. latin-1 and UTF-8 would behave differently. Of course we could change the buffering stream reader to only split lines on "\n", but this would change functionality again. > Your argument about applications making implications on the > file position after having used .readline() is true, but > still many applications rely on this behavior which is not > as far fetched as it may seem given that they normally only > expect 8-bit data. If an application doesn't mix calls to read() with calls to readline() (or different size values in these calls), the change in behaviour from 2.3 to 2.4 shouldn't be this big. No matter what we decide for the codecs, the tokenizer is broken and should be fixed. > Wouldn't it make things a lot safer if we only use buffering > per default in the UTF-x codecs and revert back to the old > non-buffered behavior for the other codecs which has worked > well in the past ?! Only if we'd drop the additional functionality added in 2.4. (universal newline support, the chars argument for read() and the keepends argument for readline().), which I think could only be done for 2.5. > About your patch: > > * Please explain what firstline is supposed to do > (preferably in the doc-string). OK, I've added an explanation in the docstring. > * Why is firstline always set in .readline() ? firstline is only supposed to be used by readline(). We could rename the argument to _firstline to make it clear that this is a private parameter, or introduce a new method _read() that has a firstline parameter. Then read() calls _read() with firstline==False and readline() calls _read() with firstline==True. The purpose of firstline is to make sure that if an input stream has its first decoding error in line n, that the UnicodeDecodeError will only be raised by the n'th call to readline(). > * Please remove the print repr() OK, done. > * You cannot always be sure that exc has a .start attribute, > so you need to accomocate for this situation as well I don't understand that. A UnicodeDecodeError is created by PyUnicodeDecodeError_Create() in exceptions.c, so any UnicodeDecodeError instance without a start attribute would be severely broken. Thanks for reviewing the patch. ---------------------------------------------------------------------- Comment By: M.-A. Lemburg (lemburg) Date: 2005-05-18 11:31 Message: Logged In: YES user_id=38388 Walter, as I've said before: I know that you need buffering for the UTF-x readline support, but I don't see a requirement for it in most of the other codecs. E.g. an ascii codec or latin-1 codec will only ever see standard line ends (not Unicode ones), so the streams .readline() method can be used directly - just like we did before the buffering code was added. Your argument about applications making implications on the file position after having used .readline() is true, but still many applications rely on this behavior which is not as far fetched as it may seem given that they normally only expect 8-bit data. Wouldn't it make things a lot safer if we only use buffering per default in the UTF-x codecs and revert back to the old non-buffered behavior for the other codecs which has worked well in the past ?! About your patch: * Please explain what firstline is supposed to do (preferably in the doc-string). * Why is firstline always set in .readline() ? * Please remove the print repr() * You cannot always be sure that exc has a .start attribute, so you need to accomocate for this situation as well ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2005-05-17 18:50 Message: Logged In: YES user_id=89016 It isn't the buffering support per se that breaks the tokenizer. This problem exists even in Python 2.3.x (Simply try the test scripts from http://www.python.org/sf/1089395 with Python 2.3.5 and you'll get a segfault). Applications that rely on len(readline(x)) == x or anything similar are broken anyway. Supporting buffered and unbuffered reading would mean keeping the 2.3 mode of doing things around indefinitely, and we'd loose readline() support for UTF-16 again. BTW, applying Greg Chapman's patch (http://www.python.org/sf/1101726, which fixes the tokenizer) together with this one seems to fix the problem from my previous post. So if you could give http://www.python.org/sf/1101726 a third look, so we can get it into 2.4.2, this would be great. ---------------------------------------------------------------------- Comment By: M.-A. Lemburg (lemburg) Date: 2005-05-17 11:13 Message: Logged In: YES user_id=38388 Walter, I think that instead of trying to get the tokenizer to work with the buffer support in the codecs, you should add a flag that allows to switch off the buffer support in the codecs altogether and then use the unbuffered mode codecs in the tokenizer. I expect that other applications will run into the same kind of problem, so it should be possible to switch off buffering if needed (maybe we should make this the default ?!). ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2005-05-16 10:35 Message: Logged In: YES user_id=89016 OK, here is a patch. It adds an additional argument firstline to read(). If this argument is true (i.e. if called from readline()) and a decoding error happens, this error will only be reported if it is in the first line. Otherwise read() will decode up to the error position and put the rest in the bytebuffer. Unfortunately with this patch, I get a segfault with the following stacktrace if I run the test. I don't know if this is related to bug #1089395/patch #1101726. Martin, can you take a look? #0 0x08057ad1 in tok_nextc (tok=0x81ca7b0) at tokenizer.c:719 #1 0x08058558 in tok_get (tok=0x81ca7b0, p_start=0xbffff3d4, p_end=0xbffff3d0) at tokenizer.c:1075 #2 0x08059331 in PyTokenizer_Get (tok=0x81ca7b0, p_start=0xbffff3d4, p_end=0xbffff3d0) at tokenizer.c:1466 #3 0x080561b1 in parsetok (tok=0x81ca7b0, g=0x8167980, start=257, err_ret=0xbffff440, flags=0) at parsetok.c:125 #4 0x0805613c in PyParser_ParseFileFlags (fp=0x816bdb8, filename=0xbffff7b7 "./bug.py", g=0x8167980, start=257, ps1=0x0, ps2=0x0, err_ret=0xbffff440, flags=0) at parsetok.c:90 #5 0x080f3926 in PyParser_SimpleParseFileFlags (fp=0x816bdb8, filename=0xbffff7b7 "./bug.py", start=257, flags=0) at pythonrun.c:1345 #6 0x080f352b in PyRun_FileExFlags (fp=0x816bdb8, filename=0xbffff7b7 "./bug.py", start=257, globals=0xb7d62e94, locals=0xb7d62e94, closeit=1, flags=0xbffff544) at pythonrun.c:1239 #7 0x080f22f2 in PyRun_SimpleFileExFlags (fp=0x816bdb8, filename=0xbffff7b7 "./bug.py", closeit=1, flags=0xbffff544) at pythonrun.c:860 #8 0x080f1b16 in PyRun_AnyFileExFlags (fp=0x816bdb8, filename=0xbffff7b7 "./bug.py", closeit=1, flags=0xbffff544) at pythonrun.c:664 #9 0x08055e45 in Py_Main (argc=2, argv=0xbffff5f4) at main.c:484 #10 0x08055366 in main (argc=2, argv=0xbffff5f4) at python.c:23 ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2005-04-07 16:28 Message: Logged In: YES user_id=89016 The reason for this is the new codec buffering code in 2.4: The codec might read and decode more data from the byte stream than is neccessary for decoding one line. I.e. when reading line n, the codec might decode bytes that belong to line n+1, n+2 etc. too. If there's a decoding error in this data, line n gets reported. I don't think there's a simple fix for this. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1178484&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com