Bugs item #1693050, was opened at 2007-04-02 17:27 Message generated for change (Comment added) made by lemburg You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1693050&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. >Category: Regular Expressions Group: Python 2.4 Status: Open Resolution: None Priority: 5 Private: No Submitted By: nlmiles (nathanlmiles) >Assigned to: Nobody/Anonymous (nobody) Summary: \w not helpful for non-Roman scripts Initial Comment: When I try to use r"\w+(?u)" to find words in a unicode Devanagari text bad things happen. Words get chopped into small pieces. I think this is likely because vowel signs such as 093e are not considered to match \w. I think that if you wish \w to be useful for Indic scipts \w will need to be exanded to unclude unicode character categories Mc, Mn, Me. I am using Python 2.4.4 on Windows XP SP2. I ran the following script to see the characters which I think ought to match \w but don't import re import unicodedata text = "" for i in range(0x901,0x939): text += unichr(i) for i in range(0x93c,0x93d): text += unichr(i) for i in range(0x93e,0x94d): text += unichr(i) for i in range(0x950,0x954): text += unichr(i) for i in range(0x958,0x963): text += unichr(i) parts = re.findall("\W(?u)", text) for ch in parts: print "%04x" % ord(ch), unicodedata.category(ch) The odd character here is 0904. Its categorization seems to imply that you are using the uncode 3.0 database but perhaps later versions of Python are using the current 5.0 database. ---------------------------------------------------------------------- >Comment By: M.-A. Lemburg (lemburg) Date: 2007-04-02 17:38 Message: Logged In: YES user_id=38388 Originator: NO Python 2.4 is using Unicode 3.2. Python 2.5 ships with Unicode 4.1. We're likely to ship Unicode 5.x with Python 2.6 or a later release. Regarding the char classes: I don't think Mc, Mn and Me should be considered parts of a word. Those are marks which usually separate words. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1693050&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com