On Wednesday, August 28, 2013 12:23:12 PM UTC+2, Dave Angel wrote: > On 28/8/2013 04:01, Kurt Mueller wrote: > > Because I cannot switch to Python 3 for now my life is not so easy:-) > > For some text manipulation tasks I need a template to split lines > > from stdin into a list of strings the way shlex.split() does it. > > The encoding of the input can vary. > > For further processing in Python I need the list of strings to be in > > unicode. > According to: > http://docs.python.org/2/library/shlex.html > """Prior to Python 2.7.3, this module did not support Unicode > input""" > I take that to mean that if you upgrade to Python 2.7.3, 2.7.4, or > 2.7.5, you'll have Unicode support.
I have Python 2.7.3 > Presumably that would mean you could decode the string before calling > shlex.split(). Yes, see new template.py: ############################################################### #!/usr/bin/env python # vim: set fileencoding=utf-8 : # split lines from stdin into a list of unicode strings # decode before shlex # Muk 2013-08-28 # Python 2.7.3 from __future__ import print_function import sys import shlex import chardet bool_cmnt = True # shlex: skip comments bool_posx = True # shlex: posix mode (strings in quotes) for inpt_line in sys.stdin: print( 'inpt_line=' + repr( inpt_line ) ) enco_type = chardet.detect( inpt_line )[ 'encoding' ] # {'encoding': 'EUC-JP', 'confidence': 0.99} print( 'enco_type=' + repr( enco_type ) ) strg_unic = inpt_line.decode( enco_type ) # decode the input line into unicode print( 'strg_unic=' + repr( strg_unic ) ) # unicode input line try: strg_inpt = shlex.split( strg_unic, bool_cmnt, bool_posx, ) # check if shlex works on unicode except Exception, errr: # usually 'No closing quotation' print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, ) continue print( 'strg_inpt=' + repr( strg_inpt ) ) # list of strings ############################################################### $ python -V Python 2.7.3 $ echo -e "a b c d e\na Ö u 1 2" | template.py inpt_line='a b c d e\n' enco_type='ascii' strg_unic=u'a b c d e\n' strg_inpt=['a', 'b', 'c', 'd', 'e'] inpt_line='a \xc3\x96 u 1 2\n' enco_type='utf-8' strg_unic=u'a \xd6 u 1 2\n' error=''ascii' codec can't encode character u'\xd6' in position 2: ordinal not in range(128)' on inpt_line='a Ö u 1 2' $ echo -e "a b c d e\na Ö u 1 2" | recode utf8..latin9 | ./split_shlex_unicode.py inpt_line='a b c d e\n' enco_type='ascii' strg_unic=u'a b c d e\n' strg_inpt=['a', 'b', 'c', 'd', 'e'] inpt_line='a \xd6 u 1 2\n' enco_type='windows-1252' strg_unic=u'a \xd6 u 1 2\n' error=''ascii' codec can't encode character u'\xd6' in position 2: ordinal not in range(128)' on inpt_line='a � u 1 2' $ As can be seen, shlex does work only with unicode strings decoded from 'ascii' strings. (Python 2.7.3) -- Kurt Müller -- http://mail.python.org/mailman/listinfo/python-list