[ python-Bugs-1693050 ] \w not helpful for non-Roman scripts

SourceForge.net Mon, 02 Apr 2007 08:27:18 -0700

Bugs item #1693050, was opened at 2007-04-02 09:27
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1693050&group_id=5470


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: nlmiles (nathanlmiles)
Assigned to: M.-A. Lemburg (lemburg)
Summary: \w not helpful for non-Roman scripts

Initial Comment:
When I try to use r"\w+(?u)" to find words in a unicode Devanagari text bad 
things happen. Words get chopped into small pieces. I think this is likely 
because vowel signs such as 093e are not considered to match \w.

I think that if you wish \w to be useful for Indic
scipts \w will need to be exanded to unclude unicode character categories Mc, 
Mn, Me.

I am using Python 2.4.4 on Windows XP SP2.

I ran the following script to see the characters which I think ought to match 
\w but don't

import re
import unicodedata

text = ""
for i in range(0x901,0x939): text += unichr(i)
for i in range(0x93c,0x93d): text += unichr(i)
for i in range(0x93e,0x94d): text += unichr(i)
for i in range(0x950,0x954): text += unichr(i)
for i in range(0x958,0x963): text += unichr(i)
        
parts = re.findall("\W(?u)", text)
for ch in parts:
    print "%04x" % ord(ch), unicodedata.category(ch)

The odd character here is 0904. Its categorization seems to imply that you are 
using the uncode 3.0 database but perhaps later versions of Python are using 
the current 5.0 database.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1693050&group_id=5470
_______________________________________________
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[ python-Bugs-1693050 ] \w not helpful for non-Roman scripts

Reply via email to