Re: Can utf-8 encoded character contain a byte of TAB?

Peter Otten Mon, 15 Jan 2018 06:39:06 -0800

Peng Yu wrote:

> Can utf-8 encoded character contain a byte of TAB?


Yes; ascii is a subset of utf8.

Python 2.7.6 (default, Nov 23 2017, 15:49:48) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> ascii = "".join(map(chr, range(128)))
>>> uni = ascii.decode("utf-8")
>>> len(uni)
128
>>> assert map(ord, uni) == range(128)

If you want to allow fields containing TABs in a file where TAB is also the 
field separator you need a convention to escape the TABs occuring in the 
values. Nothing I see in your post can cope with that, but the csv module 
can, by quoting field containing the delimiter:

>>> import csv, sys
>>> csv.writer(sys.stdout, delimiter="\t").writerow(["foo", "bar\tbaz"])
foo     "bar    baz"
>>> next(csv.reader(['foo\t"bar\tbaz"\n'], delimiter="\t"))
['foo', 'bar\tbaz']


> Hi,
> 
> I use the following code to process TSV input.
> 
> $ printf '%s\t%s\n' {1..10} | ./main.py
> ['1', '2']
> ['3', '4']
> ['5', '6']
> ['7', '8']
> ['9', '10']
> $ cat main.py
> #!/usr/bin/env python
> # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1
> # fileencoding=utf-8:
> 
> import sys
> for line in sys.stdin:
>     fields=line.rstrip('\n').split('\t')
>     print fields
> 
> But I am not sure it will process utf-8 input correctly. Thus, I come
> up with this code. However, I am not sure if this is really necessary
> as my impression is that utf-8 character should not contain the ascii
> code for TAB. Is it so? Thanks.
> 
> $ cat main1.py
> #!/usr/bin/env python
> # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1
> # fileencoding=utf-8:
> 
> import sys
> for line in sys.stdin:
>     #fields=line.rstrip('\n').split('\t')
>     fields=line.rstrip('\n').decode('utf-8').split('\t')
>     print [x.encode('utf-8') for x in fields]
> 
> $ printf '%s\t%s\n' {1..10} | ./main1.py
> ['1', '2']
> ['3', '4']
> ['5', '6']
> ['7', '8']
> ['9', '10']
> 
> 


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Can utf-8 encoded character contain a byte of TAB?

Reply via email to