On Nov 29, 2:47 am, Shiao <[EMAIL PROTECTED]> wrote: > The regex below identifies words in all languages I tested, but not in > Hindi:
> pat = re.compile('^(\w+)$', re.U) > ... > m = pat.search(l.decode('utf-8')) [example snipped] > > From this is assumed that the Hindi text contains punctuation or other > characters that prevent the word match. This appears to be a bug in Python, as others have pointed out. Two points not covered so far: (1) Instead of search() with pattern ^blahblah, use match() with pattern blahblah -- unless it has been fixed fairly recently, search() doesn't notice that the ^ means that it can give up when failure occurs at the first try; it keeps on trying futilely at the 2nd, 3rd, .... positions. (2) "identifies words": \w+ (when fixed) matches a sequence of one or more characters that could appear *anywhere* in a word in any language (including computer languages). So it not only matches words, it also matches non-words like '123' and '0x000' and '0123_' and 10 viramas -- in other words, you may need to filter out false positives. Also, in some languages (e.g. Chinese) a "word" consists of one or more characters and there is typically no spacing between "words"; \w+ will identify whole clauses or sentences. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list