On Sat, Oct 26, 2019 at 07:38:19PM -0400, David Mertz wrote:
> On Sat, Oct 26, 2019, 7:29 PM Steven D'Aprano
> 
> 
> > (At worst, a code-point in UTF-8 takes three bytes, compared to four in
> > UTF-16 or UTF-32.)
> >
> 
> http://www.fileformat.info/info/unicode/char/10000/index.htm

Oops, you're right, UTF-8 can use four code units (four bytes) too, I 
forgot about that. Thanks for the correction.

So in the worst case, if your string consists of all (let's say) 
Linear-B syllables, UTF-8 will use four bytes per character, the same as 
UTF-32. But for strings consisting of a mix of (say) ASCII, Latin-1, etc 
with only a few Linear-B syllables, UTF-8 will use a lot less memory.



-- 
Steven
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/DNFYA7Z3IGDWYLNMKL7ITZ3AON6JJVKO/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to