Re: Unicode regex and Hindi language

2008-11-29 Thread Martin v. Löwis
John Machin wrote: > On Nov 30, 4:33 am, Terry Reedy <[EMAIL PROTECTED]> wrote: >> Martin v. Löwis wrote: >>> To be fair to Python (and SRE), > > I was being unfair? No - sorry if I gave that impression. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list

Re: Unicode regex and Hindi language

2008-11-29 Thread Terry Reedy
John Machin wrote: John, nothing I wrote was directed at you. If you feel insulted, you have my apology. My intention was and is to get future movement on an issue that was reported 20 months ago but which has lain dead since, until re-reported (a bit more clearly) a week ago, because of a

Re: Unicode regex and Hindi language

2008-11-29 Thread John Machin
On Nov 30, 4:33 am, Terry Reedy <[EMAIL PROTECTED]> wrote: > Martin v. Löwis wrote: > > To be fair to Python (and SRE), I was being unfair? In the context, "bug" == "needs to be changed"; see below. > SRE predates TR#18 (IIRC) - atleast > > annex C was added somewhere between revision 6 and 9, i.

Re: Unicode regex and Hindi language

2008-11-29 Thread Terry Reedy
MRAB wrote: Terry Reedy wrote: I notice from the manual "All identifiers are converted into the normal form NFC while parsing; comparison of identifiers is based on NFC." If NFC used accented letters, then the issue is finesses away for European words simply because Unicode includes include

Re: Unicode regex and Hindi language

2008-11-29 Thread MRAB
Terry Reedy wrote: Martin v. Löwis wrote: To be fair to Python (and SRE), SRE predates TR#18 (IIRC) - atleast annex C was added somewhere between revision 6 and 9, i.e. in early 2004. Python's current definition of \w is a straight-forward extension of the historical \w definition (of Perl, I b

Re: Unicode regex and Hindi language

2008-11-29 Thread Terry Reedy
Martin v. Löwis wrote: To be fair to Python (and SRE), SRE predates TR#18 (IIRC) - atleast annex C was added somewhere between revision 6 and 9, i.e. in early 2004. Python's current definition of \w is a straight-forward extension of the historical \w definition (of Perl, I believe), which, unfo

Re: Unicode regex and Hindi language

2008-11-29 Thread Martin v. Löwis
> Huh? I thought it was settled. Read Terry Ready's latest message. Read > the bug report it points to (http://bugs.python.org/issue1693050), > especially the contribution from MvL. To paraphrase a remark by the > timbot, Martin reads Unicode tech reports so that we don't have to. > However if you

Re: Unicode regex and Hindi language

2008-11-29 Thread John Machin
On Nov 29, 10:51 am, MRAB <[EMAIL PROTECTED]> wrote: > John Machin wrote: > > On Nov 29, 2:47 am, Shiao <[EMAIL PROTECTED]> wrote: > >> The regex below identifies words in all languages I tested, but not in > >> Hindi: > > >> pat = re.compile('^(\w+)$', re.U) > >> ... > >>    m = pat.search(l.decod

Re: Unicode regex and Hindi language

2008-11-28 Thread MRAB
John Machin wrote: On Nov 29, 2:47 am, Shiao <[EMAIL PROTECTED]> wrote: The regex below identifies words in all languages I tested, but not in Hindi: pat = re.compile('^(\w+)$', re.U) ... m = pat.search(l.decode('utf-8')) [example snipped] From this is assumed that the Hindi text contain

Re: Unicode regex and Hindi language

2008-11-28 Thread John Machin
On Nov 29, 2:47 am, Shiao <[EMAIL PROTECTED]> wrote: > The regex below identifies words in all languages I tested, but not in > Hindi: > pat = re.compile('^(\w+)$', re.U) > ... >m = pat.search(l.decode('utf-8')) [example snipped] > > From this is assumed that the Hindi text contains punctuatio

Re: Unicode regex and Hindi language

2008-11-28 Thread Terry Reedy
MRAB wrote: Should the Mc and Mn codepoints match \w in the re module even though u'हिन्दी'.isalpha() returns False (in Python 2.x, haven't tried Python 3.x)? Same. And to me, that is wrong. The condensation of vowel characters (which Hindi, etc, also have for words that begin with vowels)

Re: Unicode regex and Hindi language

2008-11-28 Thread MRAB
Terry Reedy wrote: Jerry Hill wrote: On Fri, Nov 28, 2008 at 10:47 AM, Shiao <[EMAIL PROTECTED]> wrote: The regex below identifies words in all languages I tested, but not in Hindi: # -*- coding: utf-8 -*- import re pat = re.compile('^(\w+)$', re.U) langs = ('English', '中文', 'हिन्दी') I thi

Re: Unicode regex and Hindi language

2008-11-28 Thread Terry Reedy
Jerry Hill wrote: On Fri, Nov 28, 2008 at 10:47 AM, Shiao <[EMAIL PROTECTED]> wrote: The regex below identifies words in all languages I tested, but not in Hindi: # -*- coding: utf-8 -*- import re pat = re.compile('^(\w+)$', re.U) langs = ('English', '中文', 'हिन्दी') I think the problem is th

Re: Unicode regex and Hindi language

2008-11-28 Thread Jerry Hill
On Fri, Nov 28, 2008 at 10:47 AM, Shiao <[EMAIL PROTECTED]> wrote: > The regex below identifies words in all languages I tested, but not in > Hindi: > > # -*- coding: utf-8 -*- > > import re > pat = re.compile('^(\w+)$', re.U) > langs = ('English', '中文', 'हिन्दी') I think the problem is that the H

Re: Unicode regex and Hindi language

2008-11-28 Thread Peter Otten
Shiao wrote: > The regex below identifies words in all languages I tested, but not in > Hindi: > > # -*- coding: utf-8 -*- > > import re > pat = re.compile('^(\w+)$', re.U) > langs = ('English', '中文', 'हिन्दी') > > for l in langs: > m = pat.search(l.decode('utf-8')) > print l, m and m.g

Unicode regex and Hindi language

2008-11-28 Thread Shiao
The regex below identifies words in all languages I tested, but not in Hindi: # -*- coding: utf-8 -*- import re pat = re.compile('^(\w+)$', re.U) langs = ('English', '中文', 'हिन्दी') for l in langs: m = pat.search(l.decode('utf-8')) print l, m and m.group(1) Output: English English 中文 中