Wijaya Edward wrote: > Since there are separator I need to include as delimiter > Especially for the case like this: > >>>> str = '\xc5\xeb\xc7\xd5\xbc--FOO--BAR' >>>> field = list(str) >>>> print field > ['\xc5', '\xeb', '\xc7', '\xd5', '\xbc', '-', '-', 'F', 'O', 'O', '-', '-', > 'B', 'A', 'R'] > > What we want as the output is this instead: > ['\xc5', '\xeb', '\xc7', '\xd5', '\xbc','FOO','BAR]
>>> s = '\xc5\xeb\xc7\xd5\xbc--FOO--BAR' >>> re.findall("(?i)[a-z]+|[\xA0-\xFF]", s) '\xd5', '\xbc', 'FOO', 'BAR'] the RE matches either a sequence of latin characters, *or* a single non-ASCII character. you may want to adjust the character ranges to match the encoding you're using, and your definition of non-chinese words. </F> -- http://mail.python.org/mailman/listinfo/python-list