The regex below identifies words in all languages I tested, but not in Hindi:
# -*- coding: utf-8 -*- import re pat = re.compile('^(\w+)$', re.U) langs = ('English', '中文', 'हिन्दी') for l in langs: m = pat.search(l.decode('utf-8')) print l, m and m.group(1) Output: English English 中文 中文 हिन्दी None From this is assumed that the Hindi text contains punctuation or other characters that prevent the word match. Now, even more alienating is this: pat = re.compile('^(\W+)$', re.U) # note: now \W for l in langs: m = pat.search(l.decode('utf-8')) print l, m and m.group(1) Output: English None 中文 None हिन्दी None How can the Hindi be both not a word and "not not a word"?? Any clue would be much appreciated! Best. -- http://mail.python.org/mailman/listinfo/python-list