Peng Yu wrote: > Can utf-8 encoded character contain a byte of TAB?
Yes; ascii is a subset of utf8. Python 2.7.6 (default, Nov 23 2017, 15:49:48) [GCC 4.8.4] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> ascii = "".join(map(chr, range(128))) >>> uni = ascii.decode("utf-8") >>> len(uni) 128 >>> assert map(ord, uni) == range(128) If you want to allow fields containing TABs in a file where TAB is also the field separator you need a convention to escape the TABs occuring in the values. Nothing I see in your post can cope with that, but the csv module can, by quoting field containing the delimiter: >>> import csv, sys >>> csv.writer(sys.stdout, delimiter="\t").writerow(["foo", "bar\tbaz"]) foo "bar baz" >>> next(csv.reader(['foo\t"bar\tbaz"\n'], delimiter="\t")) ['foo', 'bar\tbaz'] > Hi, > > I use the following code to process TSV input. > > $ printf '%s\t%s\n' {1..10} | ./main.py > ['1', '2'] > ['3', '4'] > ['5', '6'] > ['7', '8'] > ['9', '10'] > $ cat main.py > #!/usr/bin/env python > # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 > # fileencoding=utf-8: > > import sys > for line in sys.stdin: > fields=line.rstrip('\n').split('\t') > print fields > > But I am not sure it will process utf-8 input correctly. Thus, I come > up with this code. However, I am not sure if this is really necessary > as my impression is that utf-8 character should not contain the ascii > code for TAB. Is it so? Thanks. > > $ cat main1.py > #!/usr/bin/env python > # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 > # fileencoding=utf-8: > > import sys > for line in sys.stdin: > #fields=line.rstrip('\n').split('\t') > fields=line.rstrip('\n').decode('utf-8').split('\t') > print [x.encode('utf-8') for x in fields] > > $ printf '%s\t%s\n' {1..10} | ./main1.py > ['1', '2'] > ['3', '4'] > ['5', '6'] > ['7', '8'] > ['9', '10'] > > -- https://mail.python.org/mailman/listinfo/python-list