On Fri, Feb 27, 2015 at 10:09 AM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > Chris Angelico wrote: > >> Unicode >> isn't about taking everyone's separate character sets and numbering >> them all so we can reference characters from anywhere; if you wanted >> that, you'd be much better off with something that lets you specify a >> code page in 16 bits and a character in 8, which is roughly the same >> size as Unicode anyway. > > Well, except for the approximately 25% of people in the world whose native > language has more than 256 characters.
You could always allocate multiple code pages to one language. But since I'm not advocating this system, I'm only guessing at solutions to its problems. > It sounds like you are referring to some sort of "shift code" system. Some > legacy East Asian encodings use a similar scheme, and depending on how they > are implemented they have great disadvantages. For example, Shift-JIS > suffers from a number of weaknesses including that a single byte corrupted > in transmission can cause large swaths of the following text to be > corrupted. With Unicode, a single corrupted byte can only corrupt a single > code point. That's exactly what I was hinting at. There are plenty of systems like that, and they are badly flawed compared to a simple universal system for a number of reasons. One is the corruption issue you mention; another is that a simple memory-based text search becomes utterly useless (to locate text in a document, you'd need to do a whole lot of stateful parsing - not to mention the difficulties of doing "similar-to" searches across languages); concatenation of text also becomes a stateful operation, and so do all sorts of other simple manipulations. Unicode may demand a bit more storage in certain circumstances (where an eight-bit encoding might have handled your entire document), but it's so much easier for the general case. ChrisA -- https://mail.python.org/mailman/listinfo/python-list