[ python-Bugs-1711800 ] SequenceMatcher bug with insert/delete block after "replace"
Bugs item #1711800, was opened at 2007-05-03 03:24 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1711800&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.6 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Christian Hammond (chipx86) Assigned to: Nobody/Anonymous (nobody) Summary: SequenceMatcher bug with insert/delete block after "replace" Initial Comment: difflib.SequenceMatcher fails to distinguish between a "replace" block and an "insert" or "delete" block when the "insert/delete" immediately follows a "replace". It will lump both changes together as one big "replace" block. This happens due to how get_opcodes() works. get_opcodes() loops through the matching blocks, grouping them into tags and ranges. However, if a block of text is changed and then new text is immediately added, it can't see this. All it knows is that the next matching block is after the added text. As an example, consider these strings: "ABC" "ABCD EFG." Any diffing program will show that the first line was replaced and the second was inserted. SequenceMatcher, however, just shows that there was one replace, and includes both lines in the range. I've attached a testcase that reproduces this for both replace>insert and replace>delete blocks. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1711800&group_id=5470 ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[ python-Bugs-1701389 ] utf-16 codec problems with multiple file append
Bugs item #1701389, was opened at 2007-04-16 18:05 Message generated for change (Comment added) made by iceberg4ever You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1701389&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Unicode Group: Python 2.5 Status: Closed Resolution: Remind Priority: 5 Private: No Submitted By: Iceberg Luo (iceberg4ever) Assigned to: M.-A. Lemburg (lemburg) Summary: utf-16 codec problems with multiple file append Initial Comment: This bug is similar but not exactly the same as bug215974. (http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=215974&func=detail) In my test, even multiple write() within an open()~close() lifespan will not cause the multi BOM phenomena mentioned in bug215974. Maybe it is because bug 215974 was somehow fixed during the past 7 years, although Lemburg classified it as WontFix. However, if a file is appended for more than once, by an "codecs.open('file.txt', 'a', 'utf16')", the multi BOM appears. At the same time, the saying of "(Extra unnecessary) BOM marks are removed from the input stream by the Python UTF-16 codec" in bug215974 is not true even in today, on Python2.4.4 and Python2.5.1c1 on Windows XP. Iceberg -- PS: Did not find the "File Upload" checkbox mentioned in this web page, so I think I'd better paste the code right here... import codecs, os filename = "test.utf-16" if os.path.exists(filename): os.unlink(filename) # reset def myOpen(): return codecs.open(filename, "a", 'UTF-16') def readThemBack(): return list( codecs.open(filename, "r", 'UTF-16') ) def clumsyPatch(raw): # you can read it after your first run of this program for line in raw: if line[0] in (u'\ufffe', u'\ufeff'): # get rid of the BOMs yield line[1:] else: yield line fout = myOpen() fout.write(u"ab\n") # to simplify the problem, I only use ASCII chars here fout.write(u"cd\n") fout.close() print readThemBack() assert readThemBack() == [ u'ab\n', u'cd\n' ] assert os.stat(filename).st_size == 14 # Only one BOM in the file fout = myOpen() fout.write(u"ef\n") fout.write(u"gh\n") fout.close() print readThemBack() #print list( clumsyPatch( readThemBack() ) ) # later you can enable this fix assert readThemBack() == [ u'ab\n', u'cd\n', u'ef\n', u'gh\n' ] # fails here assert os.stat(filename).st_size == 26 # not to mention here: multi BOM appears -- >Comment By: Iceberg Luo (iceberg4ever) Date: 2007-05-03 22:08 Message: Logged In: YES user_id=1770538 Originator: YES The longtime arguable ZWNBSP is deprecated nowadays ( the http://www.unicode.org/unicode/faq/utf_bom.html#24 suggests a "U+2060 WORD JOINER" instead of ZWNBSP ). However I can understand that "backwards compatibility" is always a good concern, and that's why SteamReader seems reluctant to change. In practice, a ZWNBSP inside a file is rarely intended (please also refer to the topic "Q: What should I do with U+FEFF in the middle of a file?" in same URL above). IMHO, it is very likely caused by the multi-append file operation or etc. Well, at least, the unsymmetric "what you write is NOT what you get/read" effect between "codecs.open(filename, 'a', 'UTF-16')" and "codecs.open(filename, 'r', 'UTF-16')" is not elegant enough. Aiming at the unsymmetry, finally I come up with a wrapper function for the codecs.open(), which solve (or you may say "bypass") the problem well in my case. I'll post the code as attachment. BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in Codecs", mentions that the: PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char *errors, int *byteorder) can "switches according to all byte order marks (BOM) it finds in the input data. BOMs are not copied into the resulting Unicode string". I don't know whether it is the BOM-less decoder we talked for long time. //shrug Hope the information above can be some kind of recipe for those who encounter same problem. That's it. Thanks for your patience. Best regards, Iceberg File Added: _codecs.py -- Comment By: Walter Dörwald (doerwalter) Date: 2007-04-23 18:56 Message: Logged In: YES user_id=89016 Originator: NO But BOMs *may* appear in normal content: Then their meaning is that of ZERO WIDTH NO-BREAK SPACE (see http://docs.python.org/lib/encodings-overview.html for more info). -- Comment By: Iceberg Luo (iceberg4ever) Date: 2007-04-20 11:39 Message: Logged In: YES user_id=1770538 Originator: YES If such a bug would be fixed, either StreamWriter or StreamReader should do something. I can understand Doerwalter that it is somewhat
[ python-Bugs-1701389 ] utf-16 codec problems with multiple file append
Bugs item #1701389, was opened at 2007-04-16 12:05 Message generated for change (Comment added) made by doerwalter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1701389&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Unicode Group: Python 2.5 Status: Closed Resolution: Remind Priority: 5 Private: No Submitted By: Iceberg Luo (iceberg4ever) Assigned to: M.-A. Lemburg (lemburg) Summary: utf-16 codec problems with multiple file append Initial Comment: This bug is similar but not exactly the same as bug215974. (http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=215974&func=detail) In my test, even multiple write() within an open()~close() lifespan will not cause the multi BOM phenomena mentioned in bug215974. Maybe it is because bug 215974 was somehow fixed during the past 7 years, although Lemburg classified it as WontFix. However, if a file is appended for more than once, by an "codecs.open('file.txt', 'a', 'utf16')", the multi BOM appears. At the same time, the saying of "(Extra unnecessary) BOM marks are removed from the input stream by the Python UTF-16 codec" in bug215974 is not true even in today, on Python2.4.4 and Python2.5.1c1 on Windows XP. Iceberg -- PS: Did not find the "File Upload" checkbox mentioned in this web page, so I think I'd better paste the code right here... import codecs, os filename = "test.utf-16" if os.path.exists(filename): os.unlink(filename) # reset def myOpen(): return codecs.open(filename, "a", 'UTF-16') def readThemBack(): return list( codecs.open(filename, "r", 'UTF-16') ) def clumsyPatch(raw): # you can read it after your first run of this program for line in raw: if line[0] in (u'\ufffe', u'\ufeff'): # get rid of the BOMs yield line[1:] else: yield line fout = myOpen() fout.write(u"ab\n") # to simplify the problem, I only use ASCII chars here fout.write(u"cd\n") fout.close() print readThemBack() assert readThemBack() == [ u'ab\n', u'cd\n' ] assert os.stat(filename).st_size == 14 # Only one BOM in the file fout = myOpen() fout.write(u"ef\n") fout.write(u"gh\n") fout.close() print readThemBack() #print list( clumsyPatch( readThemBack() ) ) # later you can enable this fix assert readThemBack() == [ u'ab\n', u'cd\n', u'ef\n', u'gh\n' ] # fails here assert os.stat(filename).st_size == 26 # not to mention here: multi BOM appears -- >Comment By: Walter Dörwald (doerwalter) Date: 2007-05-03 17:03 Message: Logged In: YES user_id=89016 Originator: NO >BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in > Codecs", mentions that the: > PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char > *errors, int *byteorder) > can "switches according to all byte order marks (BOM) it finds in the > input data. BOMs are not copied into the resulting Unicode string". I > don't know whether it is the BOM-less decoder we talked for long time. This seems to be wrong. Looking at the source code (Objects/unicodeobjects.c) reveals that only the first BOM is skipped. -- Comment By: Iceberg Luo (iceberg4ever) Date: 2007-05-03 16:08 Message: Logged In: YES user_id=1770538 Originator: YES The longtime arguable ZWNBSP is deprecated nowadays ( the http://www.unicode.org/unicode/faq/utf_bom.html#24 suggests a "U+2060 WORD JOINER" instead of ZWNBSP ). However I can understand that "backwards compatibility" is always a good concern, and that's why SteamReader seems reluctant to change. In practice, a ZWNBSP inside a file is rarely intended (please also refer to the topic "Q: What should I do with U+FEFF in the middle of a file?" in same URL above). IMHO, it is very likely caused by the multi-append file operation or etc. Well, at least, the unsymmetric "what you write is NOT what you get/read" effect between "codecs.open(filename, 'a', 'UTF-16')" and "codecs.open(filename, 'r', 'UTF-16')" is not elegant enough. Aiming at the unsymmetry, finally I come up with a wrapper function for the codecs.open(), which solve (or you may say "bypass") the problem well in my case. I'll post the code as attachment. BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in Codecs", mentions that the: PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char *errors, int *byteorder) can "switches according to all byte order marks (BOM) it finds in the input data. BOMs are not copied into the resulting Unicode string". I don't know whether it is the BOM-less decoder we talked for long time. //shrug Hope the information above can be some kind of recipe for those who encounter same problem. That's it. Thanks for your patience. Best regards,
[ python-Bugs-1701389 ] utf-16 codec problems with multiple file append
Bugs item #1701389, was opened at 2007-04-16 12:05 Message generated for change (Comment added) made by doerwalter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1701389&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Unicode Group: Python 2.5 Status: Closed Resolution: Remind Priority: 5 Private: No Submitted By: Iceberg Luo (iceberg4ever) Assigned to: M.-A. Lemburg (lemburg) Summary: utf-16 codec problems with multiple file append Initial Comment: This bug is similar but not exactly the same as bug215974. (http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=215974&func=detail) In my test, even multiple write() within an open()~close() lifespan will not cause the multi BOM phenomena mentioned in bug215974. Maybe it is because bug 215974 was somehow fixed during the past 7 years, although Lemburg classified it as WontFix. However, if a file is appended for more than once, by an "codecs.open('file.txt', 'a', 'utf16')", the multi BOM appears. At the same time, the saying of "(Extra unnecessary) BOM marks are removed from the input stream by the Python UTF-16 codec" in bug215974 is not true even in today, on Python2.4.4 and Python2.5.1c1 on Windows XP. Iceberg -- PS: Did not find the "File Upload" checkbox mentioned in this web page, so I think I'd better paste the code right here... import codecs, os filename = "test.utf-16" if os.path.exists(filename): os.unlink(filename) # reset def myOpen(): return codecs.open(filename, "a", 'UTF-16') def readThemBack(): return list( codecs.open(filename, "r", 'UTF-16') ) def clumsyPatch(raw): # you can read it after your first run of this program for line in raw: if line[0] in (u'\ufffe', u'\ufeff'): # get rid of the BOMs yield line[1:] else: yield line fout = myOpen() fout.write(u"ab\n") # to simplify the problem, I only use ASCII chars here fout.write(u"cd\n") fout.close() print readThemBack() assert readThemBack() == [ u'ab\n', u'cd\n' ] assert os.stat(filename).st_size == 14 # Only one BOM in the file fout = myOpen() fout.write(u"ef\n") fout.write(u"gh\n") fout.close() print readThemBack() #print list( clumsyPatch( readThemBack() ) ) # later you can enable this fix assert readThemBack() == [ u'ab\n', u'cd\n', u'ef\n', u'gh\n' ] # fails here assert os.stat(filename).st_size == 26 # not to mention here: multi BOM appears -- >Comment By: Walter Dörwald (doerwalter) Date: 2007-05-03 19:12 Message: Logged In: YES user_id=89016 Originator: NO OK, I've updated the documentation (r55094, r55095) -- Comment By: Walter Dörwald (doerwalter) Date: 2007-05-03 17:03 Message: Logged In: YES user_id=89016 Originator: NO >BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in > Codecs", mentions that the: > PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char > *errors, int *byteorder) > can "switches according to all byte order marks (BOM) it finds in the > input data. BOMs are not copied into the resulting Unicode string". I > don't know whether it is the BOM-less decoder we talked for long time. This seems to be wrong. Looking at the source code (Objects/unicodeobjects.c) reveals that only the first BOM is skipped. -- Comment By: Iceberg Luo (iceberg4ever) Date: 2007-05-03 16:08 Message: Logged In: YES user_id=1770538 Originator: YES The longtime arguable ZWNBSP is deprecated nowadays ( the http://www.unicode.org/unicode/faq/utf_bom.html#24 suggests a "U+2060 WORD JOINER" instead of ZWNBSP ). However I can understand that "backwards compatibility" is always a good concern, and that's why SteamReader seems reluctant to change. In practice, a ZWNBSP inside a file is rarely intended (please also refer to the topic "Q: What should I do with U+FEFF in the middle of a file?" in same URL above). IMHO, it is very likely caused by the multi-append file operation or etc. Well, at least, the unsymmetric "what you write is NOT what you get/read" effect between "codecs.open(filename, 'a', 'UTF-16')" and "codecs.open(filename, 'r', 'UTF-16')" is not elegant enough. Aiming at the unsymmetry, finally I come up with a wrapper function for the codecs.open(), which solve (or you may say "bypass") the problem well in my case. I'll post the code as attachment. BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in Codecs", mentions that the: PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char *errors, int *byteorder) can "switches according to all byte order marks (BOM) it finds in the input data. BOMs are not copied into the resulting Unicode string".
[ python-Bugs-1712236 ] __getslice__ changes integer arguments
Bugs item #1712236, was opened at 2007-05-03 21:20 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712236&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Interpreter Core Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Imri Goldberg (lorgandon) Assigned to: Nobody/Anonymous (nobody) Summary: __getslice__ changes integer arguments Initial Comment: When using slicing for a sequence object, with a user defined __getslice__ function, the arguments to __getslice__ are changed. This does not happen when __getslice__is called directly. Attached is some code that demonstrates the problem. I checked it on various versions, including my "Python 2.5.1 (r251:54863, May 2 2007, 16:56:35)", on my Ubuntu machine. Although __getslice__ is deprecated, there is still usage of the function, and a fix would be useful. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712236&group_id=5470 ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[ python-Bugs-1712236 ] __getslice__ changes integer arguments
Bugs item #1712236, was opened at 2007-05-03 21:20 Message generated for change (Comment added) made by lorgandon You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712236&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Interpreter Core Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Imri Goldberg (lorgandon) Assigned to: Nobody/Anonymous (nobody) Summary: __getslice__ changes integer arguments Initial Comment: When using slicing for a sequence object, with a user defined __getslice__ function, the arguments to __getslice__ are changed. This does not happen when __getslice__is called directly. Attached is some code that demonstrates the problem. I checked it on various versions, including my "Python 2.5.1 (r251:54863, May 2 2007, 16:56:35)", on my Ubuntu machine. Although __getslice__ is deprecated, there is still usage of the function, and a fix would be useful. -- >Comment By: Imri Goldberg (lorgandon) Date: 2007-05-03 21:23 Message: Logged In: YES user_id=1715564 Originator: YES This also seems to be the cause of bug "[ 908441 ] default index for __getslice__ is not sys.maxint" -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712236&group_id=5470 ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[ python-Bugs-1712419 ] Cannot use dict with unicode keys as keyword arguments
Bugs item #1712419, was opened at 2007-05-04 00:49 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712419&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Unicode Group: Python 2.5 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Viktor Ferenczi (complex) Assigned to: M.-A. Lemburg (lemburg) Summary: Cannot use dict with unicode keys as keyword arguments Initial Comment: Unicode strings cannot be used as keys in dictionaries passed as keyword argument to a function. For example: Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> def fn(**kw): ... print repr(kw) ... >>> fn(**{u'x':1}) Traceback (most recent call last): File "", line 1, in TypeError: fn() keywords must be strings >>> Unicode strings should be converted to str automatically using the site specific default encoding or something similar solution. This bug caused problems when decoding dictionaries from data formats such as XML or JSON. Usually unicode strings are returned from such modules, such as simplejson. Manual encoding from unicode to str causes performance loss if this cannot be done by Python automatically. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712419&group_id=5470 ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[ python-Feature Requests-1712419 ] Cannot use dict with unicode keys as keyword arguments
Feature Requests item #1712419, was opened at 2007-05-03 22:49 Message generated for change (Comment added) made by gbrandl You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1712419&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. >Category: Unicode >Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Viktor Ferenczi (complex) Assigned to: M.-A. Lemburg (lemburg) Summary: Cannot use dict with unicode keys as keyword arguments Initial Comment: Unicode strings cannot be used as keys in dictionaries passed as keyword argument to a function. For example: Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> def fn(**kw): ... print repr(kw) ... >>> fn(**{u'x':1}) Traceback (most recent call last): File "", line 1, in TypeError: fn() keywords must be strings >>> Unicode strings should be converted to str automatically using the site specific default encoding or something similar solution. This bug caused problems when decoding dictionaries from data formats such as XML or JSON. Usually unicode strings are returned from such modules, such as simplejson. Manual encoding from unicode to str causes performance loss if this cannot be done by Python automatically. -- >Comment By: Georg Brandl (gbrandl) Date: 2007-05-04 04:10 Message: Logged In: YES user_id=849994 Originator: NO In any case, this is a feature request. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1712419&group_id=5470 ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[ python-Bugs-1712522 ] urllib.quote throws exception on Unicode URL
Bugs item #1712522, was opened at 2007-05-04 06:11 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712522&group_id=5470 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Nagle (nagle) Assigned to: Nobody/Anonymous (nobody) Summary: urllib.quote throws exception on Unicode URL Initial Comment: The code in urllib.quote fails on Unicode input, when called by robotparser with a Unicode URL. Traceback (most recent call last): File "./sitetruth/InfoSitePage.py", line 415, in run pagetree = self.httpfetch() # fetch page File "./sitetruth/InfoSitePage.py", line 368, in httpfetch if not self.owner().checkrobotaccess(self.requestedurl) : # if access disallowed by robots.txt file File "./sitetruth/InfoSiteContent.py", line 446, in checkrobotaccess return(self.robotcheck.can_fetch(config.kuseragent, url)) # return can fetch File "/usr/local/lib/python2.5/robotparser.py", line 159, in can_fetch url = urllib.quote(urlparse.urlparse(urllib.unquote(url))[2]) or "/" File "/usr/local/lib/python2.5/urllib.py", line 1197, in quote res = map(safe_map.__getitem__, s) KeyError: u'\xe2' That bit of code needs some attention. - It still assumes ASCII goes up to 255, which hasn't been true in Python for a while now. - The initialization may not be thread-safe; a table is being initialized on first use. "robotparser" was trying to check if a URL with a Unicode character in it was allowed. Note the "KeyError: u'\xe2'" -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712522&group_id=5470 ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com