On Mon, 16 Jul 2018 21:48:42 -0400, Richard Damon wrote: >> On Jul 16, 2018, at 9:21 PM, Steven D'Aprano >> <steve+comp.lang.pyt...@pearwood.info> wrote: >> >>> On Mon, 16 Jul 2018 19:02:36 -0400, Richard Damon wrote: >>> >>> You are defining a variable/fixed width codepoint set. Many others >>> want to deal with CHARACTER sets. >> >> Good luck coming up with a universal, objective, language-neutral, >> consistent definition for a character. >> > Who says there needs to be one. A good engineer will use the definition > that is most appropriate to the task at hand. Some things need very > solid definitions, and some things don’t.
The the problem is solved: we have a perfectly good de facto definition of character: it is a synonym for "code point", and every single one of Marko's objections disappears. > This goes back to my original point, where I said some people consider > UTF-32 as a variable width encoding. For very many things, practically, > the ‘codepoint’ isn’t the important thing, Ah, is this another one of those "let's pick a definition that nobody else uses, and state it as a fact" like UTF-32 being variable width? If by "very many things", you mean "not very many things", I agree with you. In my experience, dealing with code points is "good enough", especially if you use Western European alphabets, and even more so if you're willing to do a normalization step before processing text. But of course other people's experience may vary. I'm interested in learning about the library you use to process graphemes in your software. > so the fact that every UTF-32 > code point takes the same number of bytes or code words isn’t that > important. They are dealing with something that needs to be rendered and > preserving larger units, like the grapheme is important. If you're writing a text widget or a shell, you need to worry about rendering glyphs. Everyone else just delegates to their text widget, GUI framework, or shell. >>> This doesn’t mean that UTF-32 is an awful system, just that it isn’t >>> the magical cure that some were hoping for. >> >> Nobody ever claimed it was, except for the people railing that since it >> isn't a magically system we ought to go back to the Good Old Days of >> code page hell, or even further back when everyone just used ASCII. >> > Sometimes ASCII is good enough, especially on a small machine with > limited resources. I doubt that there are many general purpose computers with resources *that* limited. Even MicroPython supports Unicode, and that runs on embedded devices with memory measured in kilobytes. 8K is considered the smallest amount of memory usable with MicroPython, although 128K is more realistic as the *practical* lower limit. In the mid 1980s, I was using computers with 128K of RAM, and they were still able to deal with more than just ASCII. I think the "limited resources" argument is bogus. -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list