"Gabriel Genellina" <gagsl-...@yahoo.com.ar> writes: >> I'm playing with os.popen function. >> a = os.popen("somecmd").read() >> >> If one of the lines contains characters like "è", "æ"or any other it loks >> line this "velja\xe8a 2009" with that "\xe8". It prints fine if i go: >> >> for i in a: >> print i: > > '\xe8' is a *single* byte (not four). It is the 'LATIN SMALL LETTER E > WITH GRAVE' Unicode code point u'è' encoded in the Windows-1252 > encoding (and latin-1, and others too).
Note that it is also 'LATIN SMALL LETTER C WITH CARON' (U+010D or u'č'), encoded in Windows-1250, which is what the OP is likely using. The rest of your message stands regardless: there is no problem, at least as long as the OP only prints out the character received from somecmd to something else that also expects Windows-1250. The problem would arise if the OP wanted to store the string in a PyGTK label (which expects UTF8) or send it to a web browser (which expects explicit encoding, probably defaulting to UTF8), in which case he'd have to disambiguate whether '\xe8' refers to U+010D or to U+00E8 or something else entirely. That is the problem that Python 3 solves by requiring (or strongly suggesting) that such disambiguation be performed as early in the program as possible, preferrably while the characters are being read from the outside source. A similar approach is possible using Python 2 and its unicode type, but since the OP never specified exactly which problem he had (except for the repr/str confusion), it's hard to tell if using the unicode type would help. -- http://mail.python.org/mailman/listinfo/python-list