[issue10254] unicodedata.normalize('NFC', s) regression

2011-09-22 Thread STINNER Victor
STINNER Victor added the comment: "This fix is part of Python 2.7.2, but not of 2.7.2." ... but not of 2.7.1. -- ___ Python tracker ___

[issue10254] unicodedata.normalize('NFC', s) regression

2011-09-22 Thread STINNER Victor
STINNER Victor added the comment: "This new data does not crash Python 2.7.2, so I assume the issue has been fixed." Yes, the bug was already fixed in branch 2.7 by the SVN commit r87541: changeset: 67185:54f1d5651555 branch: 2.7 parent: 67159:2d09af4c137c user:Alexander B

[issue10254] unicodedata.normalize('NFC', s) regression

2011-09-22 Thread Alexander Belopolsky
Alexander Belopolsky added the comment: This new data does not crash Python 2.7.2, so I assume the issue has been fixed. Re-closing. -- status: open -> closed ___ Python tracker _

[issue10254] unicodedata.normalize('NFC', s) regression

2011-09-22 Thread Alexander Belopolsky
Changes by Alexander Belopolsky : -- status: closed -> open ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: htt

[issue10254] unicodedata.normalize('NFC', s) regression

2011-09-22 Thread Victor Ruiz
Victor Ruiz added the comment: Hi, I think I've come across what seems to be another flavor of this issue. The following string will cause a crash in some interpreters. text = u"""\u062d\u064e\u064a\u0651\u064b\u0627\u060c\u0648\u064e\u064a\u064e\u062d\u0650\u0642\u0651\u064e \u0627\u0644\

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-28 Thread Alexander Belopolsky
Alexander Belopolsky added the comment: Committed backports: r87540 (3.1) r87541 (2.7) r87546 (2.6) -- resolution: -> fixed stage: commit review -> committed/rejected status: open -> closed versions: +Python 3.2 ___ Python tracker

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-22 Thread Alexander Belopolsky
Alexander Belopolsky added the comment: Committed to py3k in revision 87442. -- versions: -Python 3.2 ___ Python tracker ___ ___ Pyt

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-21 Thread Alexander Belopolsky
Alexander Belopolsky added the comment: In the new patch, issue10254b.diff, I've added a test that would crash unpatched code: >>> unicodedata.normalize('NFC', 'C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸Ç') Segmentation fault Martin, I still feel uneasy about the fixed size of the skipped buffe

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-20 Thread Alexander Belopolsky
Alexander Belopolsky added the comment: On Mon, Dec 20, 2010 at 2:50 PM, Alexander Belopolsky wrote: .. > Unfortunately, all tests pass with either comb >= comb1 or comb == comb1, so > before > I commit, I would like to figure out the test case that would properly > exercise this code. > Aft

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-20 Thread Alexander Belopolsky
Alexander Belopolsky added the comment: Attached patch, issue10254a.diff, adds the OP's cases to test_unicodedata and changes the code as I suggested in msg124173 because ISTM that comb >= comb1 matches the pr-29 definition: """ D2'. In any character sequence beginning with a starter S, a cha

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Alexander Belopolsky
Alexander Belopolsky added the comment: On Fri, Dec 17, 2010 at 2:08 PM, Martin v. Löwis wrote: .. >> As far as I (and a two-line script) can tell >> the maximum length of a canonical decomposition of a character is 4. > > Even better - so allowing for 20 characters should be safe. I don't dis

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Martin v . Löwis
Martin v. Löwis added the comment: > The C forms (NFC and NFKC) do canonical composition and U+FDFA is a > compatibility composite. (BTW, makeunicodedata.py checks that maximum > decomposed length of a character is < 19, but it would be better if it > would compute and define a named constant, s

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Alexander Belopolsky
Alexander Belopolsky added the comment: On Fri, Dec 17, 2010 at 3:47 AM, Martin v. Löwis wrote: .. > The worst case (wrt. cskipped) is the maximum number of characters that > can get combined into a single base character. It used to be (and I > hope still is) 20 (decomposition of U+FDFA). > Th

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Martin v . Löwis
Martin v. Löwis added the comment: > Passing Part3 tests and not crashing on crash.py is probably good > enough for a commit, but I don't have a proof that length 20 skipped > buffer is always enough. I would agree with that. I still didn't have time to fully review the patch, but assuming it f

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Martin v . Löwis
Martin v. Löwis added the comment: > The logic suggested by Martin in msg120018 looks right to me, but the > whole code seems to be unnecessarily complex. (And comb1==comb may > need to be changed to comb1>=comb.) I don't understand why linear > search through "skipped" array is needed. At the

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Martin v . Löwis
Martin v. Löwis added the comment: > So lacking a new patch, I think we should revert the existing change > for now. Oops, I missed that Alexander has proposed a patch. -- ___ Python tracker _

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Martin v . Löwis
Martin v. Löwis added the comment: Am 17.12.2010 01:56, schrieb STINNER Victor: > > STINNER Victor added the comment: > > "Ooops", sorry. I just applied the patch suggested by Marc-Andre > Lemburg in msg22885 (#1054943). As the patch worked for the examples > given in Unicode PRI 29 and the t

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-16 Thread Alexander Belopolsky
Alexander Belopolsky added the comment: Attached patch, issue10254.diff, is essentially Martin's code from msg120018 and Part3 tests from NormalizationTest.txt. Since this bug exposes a buffer overflow condition, I think it qualifies as a security issue, so I am adding 2.6 to versions. Passi

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-16 Thread Alexander Belopolsky
Alexander Belopolsky added the comment: The logic suggested by Martin in msg120018 looks right to me, but the whole code seems to be unnecessarily complex. (And comb1==comb may need to be changed to comb1>=comb.) I don't understand why linear search through "skipped" array is needed. At the

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-16 Thread STINNER Victor
STINNER Victor added the comment: "Ooops", sorry. I just applied the patch suggested by Marc-Andre Lemburg in msg22885 (#1054943). As the patch worked for the examples given in Unicode PRI 29 and the test suite passed, it was enough for me. I don't understand the normalization code, so I don'

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-16 Thread Alexander Belopolsky
Alexander Belopolsky added the comment: Adding an assert as shown in the diff below, makes it easy to reproduce the crash in py3k branch: $ ./python.exe crash.py Assertion failed: (cskipped < 20), function nfc_nfkc, file Modules/unicodedata.c, line 714. Abort trap I am attaching jhalcrow's

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-15 Thread Antoine Pitrou
Antoine Pitrou added the comment: After a bit of debugging, the crash is due to the "skipped" array being overflowed in nfc_nfkc() in unicodedata.c. "cskipped" goes up to 21 while the array only has 20 entries. This happens in all branches (but only crashes in 2.7 right now for probably unimp

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-15 Thread Antoine Pitrou
Antoine Pitrou added the comment: I can reproduce the crash under 2.7, but not 2.6 or 3.x here. So it might be a separate issue. -- ___ Python tracker ___ _

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-15 Thread Jonathan Halcrow
Jonathan Halcrow added the comment: I think I've come across a related problem. I am experiencing a segfault when NFC-normalizing a certain string [1]. The crash occurs with 2.7.1 in OS X (built from source with homebrew). Here is the backtrace: #0 0x0025a96e in _PyUnicode_Resize () #1 0

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-31 Thread Ezio Melotti
Changes by Ezio Melotti : -- nosy: +ezio.melotti ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.py

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-31 Thread Arfrever Frehtes Taifersar Arahesis
Changes by Arfrever Frehtes Taifersar Arahesis : -- nosy: +Arfrever ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscri

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread R. David Murray
Changes by R. David Murray : -- nosy: +barry ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Martin v . Löwis
Martin v. Löwis added the comment: >> It's unfortunate that the patch had been backported to 2.6.6; we can't fix >> it there anymore. > > Why not ? It looks a lot like a security fix. Indeed, you could argue that. It's up to the 2.6 release manager, I guess. -- _

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: Martin v. Löwis wrote: > It's unfortunate that the patch had been backported to 2.6.6; we can't fix it > there anymore. Why not ? It looks a lot like a security fix. -- nosy: +lemburg ___ Python tracker

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Martin v . Löwis
Martin v. Löwis added the comment: The change from issue1054943 is indeed bogus. As written, the code will happily run over starters, even though a blocked start means that subsequent characters can't possibly be combinable. That way, the code manages to combine, in 'Li\u030dt-s\u1e73\u0301',

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Antoine Pitrou
Antoine Pitrou added the comment: Confirmed on Python 3.2. -- nosy: +haypo, loewis, pitrou versions: +Python 3.2 ___ Python tracker ___ _

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Merlijn van Deen
Merlijn van Deen added the comment: Please note: The bug might very well be present in python 3.2 and 3.3. However, I do not have these versions installed, so I cannot confirm this. -- ___ Python tracker

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Merlijn van Deen
New submission from Merlijn van Deen : Summary: Somewhere between 2.6.5 r79063 and 3.1 r79147 a regression in the unicode NFC normalization has been introduces. This regression leads to bot edit wars on wikipedia [1]. It is reproducable with a simple script [2]. Mediawiki/PHP [3] and C# [4] te