On Apr 12, 9:48 am, Christian Heimes <[EMAIL PROTECTED]> wrote: > Peter Robinson schrieb: > > > Dear list > > I am at my wits end on what seemed a very simple task: > > I have some greek text, nicely encoded in utf8, going in and out of a > > xml database, being passed over and beautifully displayed on the web. > > For example: the most common greek word of all 'kai' (or και if your > > mailer can see utf8) > > So all I want to do is: > > step through this string a character at a time, and do something for > > each character (actually set a width attribute somewhere else for each > > character) > > As John already said: UTF-8 ain't unicode. UTF-8 is an encoding similar > to ASCII or Latin-1 but different in its inner workings. A single > character may be encoded by up to 6 bytes.
Up to 4 bytes in the latest versions. (the largest value is U+10FFFF and is represented by 0xF4 0x8F 0xBF 0xBF). I believe the proper way for returning the number of characters for Greek would require a normalization first: from unicodedata import normalize def greek_text_length(utf8_string): u = unicode(utf8_string, 'utf-8') u = normalize('NFC', u) return len(u) If there are pairs of characters that count as one, things may be worse. > > I highly recommend Joel's article on unicode: > > The Absolute Minimum Every Software Developer Absolutely, Positively > Must Know About Unicode and Character Sets (No > Excuses!)http://www.joelonsoftware.com/articles/Unicode.html > > Christian -- http://mail.python.org/mailman/listinfo/python-list