> Just to be clear, TAB *only* appears in utf-8 as the encoding for the actual 
> TAB character, not as a part of any other character's encoding. The only 
> bytes that can appear in the utf-8 encoding of non-ascii characters are 
> starting with 0xC2 through 0xF4, followed by one or more of 0x80 through 0xBF.

So for utf-8 encoded input, I only need to use this code to split each
line into fields?

import sys
for line in sys.stdin:
    fields=line.rstrip('\n').split('\t')
    print fields

Is there a need to use this code to split each line into fields?

import sys
for line in sys.stdin:
    fields=line.rstrip('\n').decode('utf-8').split('\t')
    print [x.encode('utf-8') for x in fields]

-- 
Regards,
Peng
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to