Absolutely, utf-8 is a wonderful encoding. And indeed, worst case is the same storage requirement as utf-16 or utf-32. For O(1) random access into all strings, we have to eat 32-bits per character, one way or the other, but of course there are space/speed trade-offs one could make for intermediate behavior.
On Sat, Oct 26, 2019, 7:58 PM Steven D'Aprano <[email protected]> wrote: > On Sat, Oct 26, 2019 at 07:38:19PM -0400, David Mertz wrote: > > On Sat, Oct 26, 2019, 7:29 PM Steven D'Aprano > > > > > > > (At worst, a code-point in UTF-8 takes three bytes, compared to four in > > > UTF-16 or UTF-32.) > > > > > > > http://www.fileformat.info/info/unicode/char/10000/index.htm > > Oops, you're right, UTF-8 can use four code units (four bytes) too, I > forgot about that. Thanks for the correction. > > So in the worst case, if your string consists of all (let's say) > Linear-B syllables, UTF-8 will use four bytes per character, the same as > UTF-32. But for strings consisting of a mix of (say) ASCII, Latin-1, etc > with only a few Linear-B syllables, UTF-8 will use a lot less memory. > > > > -- > Steven > _______________________________________________ > Python-ideas mailing list -- [email protected] > To unsubscribe send an email to [email protected] > https://mail.python.org/mailman3/lists/python-ideas.python.org/ > Message archived at > https://mail.python.org/archives/list/[email protected]/message/DNFYA7Z3IGDWYLNMKL7ITZ3AON6JJVKO/ > Code of Conduct: http://python.org/psf/codeofconduct/ >
_______________________________________________ Python-ideas mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/[email protected]/message/RMH7GU5JHZ7EW2E4DAFHITHQRYF6PJG4/ Code of Conduct: http://python.org/psf/codeofconduct/
