On Mon, Oct 30, 2006 at 06:02:18PM +0100, Georg Baum wrote: > Joost Verburg wrote: > > > Georg Baum wrote: > >> For ucs4 -> utf8 we would have to use a result string with a length of 6 > >> times the input length, with the average length close to the inpurt > >> length if we want to be able to convert everything. That is probably too > >> much to be efficient. > > > > ucs4 uses 4 bytes per character and utf8 1-4 bytes. I don't understand > > where you get this number from. > > I read somewhere that the highest possible number of bytes for a single > character in utf8 is 6, but I forgot where. Abdel reported the same, and > now I am unsure, because wikipedia says 4. Does anybody know what is > correct?
Maybe you got your info from here: http://www.cl.cam.ac.uk/~mgk25/unicode.html Indeed, if only 21 bits are used, 4 bytes should suffice. Eee the table a little down here: http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 -- Enrico