> Peter Robinson schrieb:
> > Dear list
> > I am at my wits end on what seemed a very simple task:
> > I have some greek text, nicely encoded in utf8, going in and out of a
> > xml database, being passed over and beautifully displayed on the web.
> > For example: the most common greek word of all 'kai' (or και if your
> > mailer can see utf8)
> > So all I want to do is:
> > step through this string a character at a time, and do something for
> > each character (actually set a width attribute somewhere else for each
> > character)
> As John already said: UTF-8 ain't unicode. UTF-8 is an encoding similar
> to ASCII or Latin-1 but different in its inner workings. A single
> character may be encoded by up to 6 bytes.

 Up to 4 bytes in the latest versions. (the largest value is U+10FFFF
and is represented by 0xF4 0x8F 0xBF 0xBF).

 I believe the proper way for returning the number of characters for
Greek would require a normalization first:

from unicodedata import normalize
def greek_text_length(utf8_string):
      u = unicode(utf8_string, 'utf-8')
      u = normalize('NFC', u)
      return len(u)

 If there are pairs of characters that count as one, things may be

> I highly recommend Joel's article on unicode:
> The Absolute Minimum Every Software Developer Absolutely, Positively
> Must Know About Unicode and Character Sets (No 
> Excuses!)http://www.joelonsoftware.com/articles/Unicode.html
> Christian


