Shiao wrote: > The regex below identifies words in all languages I tested, but not in > Hindi: > > # -*- coding: utf-8 -*- > > import re > pat = re.compile('^(\w+)$', re.U) > langs = ('English', '中文', 'हिन्दी') > > for l in langs: > m = pat.search(l.decode('utf-8')) > print l, m and m.group(1) > > Output: > > English English > 中文 中文 > हिन्दी None > > From this is assumed that the Hindi text contains punctuation or other > characters that prevent the word match. Now, even more alienating is > this: > > pat = re.compile('^(\W+)$', re.U) # note: now \W > > for l in langs: > m = pat.search(l.decode('utf-8')) > print l, m and m.group(1) > > Output: > > English None > 中文 None > हिन्दी None > > How can the Hindi be both not a word and "not not a word"?? > > Any clue would be much appreciated!
It's not a word, but that doesn't mean that it consists entirely of non-alpha characters either. Here's what Python gets to see: >>> langs[2] u'\u0939\u093f\u0928\u094d\u0926\u0940' >>> from unicodedata import name >>> for c in langs[2]: ... print repr(c), name(c), ["non-alpha", "ALPHA"][c.isalpha()] ... u'\u0939' DEVANAGARI LETTER HA ALPHA u'\u093f' DEVANAGARI VOWEL SIGN I non-alpha u'\u0928' DEVANAGARI LETTER NA ALPHA u'\u094d' DEVANAGARI SIGN VIRAMA non-alpha u'\u0926' DEVANAGARI LETTER DA ALPHA u'\u0940' DEVANAGARI VOWEL SIGN II non-alpha Peter -- http://mail.python.org/mailman/listinfo/python-list