> Just to be clear, TAB *only* appears in utf-8 as the encoding for the actual > TAB character, not as a part of any other character's encoding. The only > bytes that can appear in the utf-8 encoding of non-ascii characters are > starting with 0xC2 through 0xF4, followed by one or more of 0x80 through 0xBF.
So for utf-8 encoded input, I only need to use this code to split each line into fields? import sys for line in sys.stdin: fields=line.rstrip('\n').split('\t') print fields Is there a need to use this code to split each line into fields? import sys for line in sys.stdin: fields=line.rstrip('\n').decode('utf-8').split('\t') print [x.encode('utf-8') for x in fields] -- Regards, Peng -- https://mail.python.org/mailman/listinfo/python-list