On 09/03/2010 16:54, C. Benson Manica wrote:
Hours of Googling has not helped me resolve a seemingly simple question - Given a string s, how can I tell whether it's ascii (and thus 1 byte per character) or UTF-8 (and two bytes per character)? This is python 2.4.3, so I don't have getsizeof available to me.
You can't. You can apply one or more heuristics, depending on exactly what your requirement is. But any valid ASCII text is also valid UTF8-encoded text since UTF-8 isn't "two bytes per char" but a variable number of bytes per char. Obviously, you can test whether all the bytes are less than 128 which suggests that the text is legal ASCII. But then it's also legal UTF8. Or you can just attempt to decode and catch the exception: try: unicode (text, "ascii") except UnicodeDecodeError: print "Not ASCII" TJG -- http://mail.python.org/mailman/listinfo/python-list