Absolutely, utf-8 is a wonderful encoding. And indeed, worst case is the
same storage requirement as utf-16 or utf-32. For O(1) random access into
all strings, we have to eat 32-bits per character, one way or the other,
but of course there are space/speed trade-offs one could make for
intermediate behavior.

On Sat, Oct 26, 2019, 7:58 PM Steven D'Aprano <[email protected]> wrote:

> On Sat, Oct 26, 2019 at 07:38:19PM -0400, David Mertz wrote:
> > On Sat, Oct 26, 2019, 7:29 PM Steven D'Aprano
> >
> >
> > > (At worst, a code-point in UTF-8 takes three bytes, compared to four in
> > > UTF-16 or UTF-32.)
> > >
> >
> > http://www.fileformat.info/info/unicode/char/10000/index.htm
>
> Oops, you're right, UTF-8 can use four code units (four bytes) too, I
> forgot about that. Thanks for the correction.
>
> So in the worst case, if your string consists of all (let's say)
> Linear-B syllables, UTF-8 will use four bytes per character, the same as
> UTF-32. But for strings consisting of a mix of (say) ASCII, Latin-1, etc
> with only a few Linear-B syllables, UTF-8 will use a lot less memory.
>
>
>
> --
> Steven
> _______________________________________________
> Python-ideas mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/[email protected]/message/DNFYA7Z3IGDWYLNMKL7ITZ3AON6JJVKO/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/RMH7GU5JHZ7EW2E4DAFHITHQRYF6PJG4/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to