On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote: > With that in mind, I, as many others, think that forcing Unicode bloat > upon people by default is the most controversial feature of Python3. > The reason is that you go very long way dealing with languages of the > people of the world by just treating strings as consisting of 8-bit > data. I'd say, that's enough for 90% of applications. Unicode is needed > only if one needs to deal with multiple languages *at the same time*, > which is fairly rare (remaining 10% of apps).
> And please keep in mind that MicroPython was originally intended (and > should be remain scalable down to) an MCU. Unicode needed there is even > less, and even less resources to support Unicode just because. At some time (when jmf was making more intelligible noises) I had suggested that the choice between 1/2/4 byte strings that happens at runtime in python3's FSR can be made at python-start time with a command-line switch. There are many combinations here; here is one in more detail: Instead of having one (FSR) string engine, you have (upto) 4 - a pure 1 byte (ASCII) - a pure 2 byte (BMP) with decode-failures for out-of-ranges - a pure 4 byte -- everything UTF-32 - FSR dynamic switching at runtime (with massive moping from the world's jmfs) The point is that only one of these engines would be brought into memory based on command-line/config options. Some more personal thoughts (that may be quite ill-informed!): 1. I regard myself as a unicode ignoramus+enthusiast. The world will be a better place if unicode is more pervasive. See http://blog.languager.org/2014/04/unicoded-python.html As it happens I am also a computer scientist -- I understand that in contexts where anything other than 8-bit chars is unacceptably inefficient, unicode-bloat may be a real thing. 2. My casual/cursory reading of the contents of the SMP-planes suggests that the stuff there is are things like - egyptian hieroplyphics - mahjong characters - ancient greek musical symbols - alchemical symbols etc etc. IOW from pov of a universallly acceptable character set this is mostly rubbish And so a pure BMP-supporting implementation may be a reasonable compromise. [As long as no surrogate-pairs are there] -- https://mail.python.org/mailman/listinfo/python-list