On 2010-03-09 11:12 AM, Stef Mientki wrote:
On 09-03-2010 18:02, Alf P. Steinbach wrote:
* C. Benson Manica:
Hours of Googling has not helped me resolve a seemingly simple
question - Given a string s, how can I tell whether it's ascii (and
thus 1 byte per character) or UTF-8 (and two bytes per character)?
This is python 2.4.3, so I don't have getsizeof available to me.

Generally, if you need 100% certainty then you can't tell the encoding
from a sequence of byte values.

However, if you know that it's EITHER ascii or utf-8 then the presence
of any value above 127 (or, for signed byte values, any negative
values), tells you that it can't be ascii,
AFAIK it's completely impossible.
UTF-8 characters have 1 to 4 bytes / byte.
I can create ASCII strings containing byte values between 127 and 255.

No, you can't. ASCII strings only have characters in the range 0..127. You could create Latin-1 (or any number of the 8-bit encodings out there) strings with characters 0..255, yes, but not ASCII.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to