random...@fastmail.us wrote: > On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote: >> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in >> UTF-8 >> and UTF-32, since that goes against the grain of the system. You would >> have >> to program in artificial restrictions that otherwise don't exist. > > UTF-8 is already restricted from representing values above 0x10FFFF, > whereas UTF-8 can "naturally" represent values up to 0x1FFFFF in four > bytes, up to 0x3FFFFFF in five bytes, and 0x7FFFFFFF in six bytes. If > anything, the BMP represents a natural boundary, since it coincides with > values that can be represented in three bytes. Likewise, UTF-32 can > obviously represent values up to 0xFFFFFFFF. You're programming in > artificial restrictions either way, it's just a question of what those > restrictions are.
Good points, but they don't greatly change my conclusion. If you are implementing UTF-8 or UTF-32, it is no harder to deal with code points in the SMP than those in the BMP. -- Steven -- https://mail.python.org/mailman/listinfo/python-list