Rustom Mody wrote: > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: >> On 2/26/2015 8:24 AM, Chris Angelico wrote: >> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: >> >> Wrote something up on why we should stop using ASCII: >> >> http://blog.languager.org/2015/02/universal-unicode.html >> >> I think that the main point of the post, that many Unicode chars are >> truly planetary rather than just national/regional, is excellent. > > <snipped> > >> You should add emoticons, but not call them or the above 'gibberish'. >> I think that this part of your post is more 'unprofessional' than the >> character blocks. It is very jarring and seems contrary to your main >> point. > > Ok Done > > References to gibberish removed from > http://blog.languager.org/2015/02/universal-unicode.html
I consider it unethical to make semantic changes to a published work in place without acknowledgement. Fixing minor typos or spelling errors, or dead links, is okay. But any edit that changes the meaning should be commented on, either by an explicit note on the page itself, or by striking out the previous content and inserting the new. As for the content of the essay, it is currently rather unfocused. It appears to be more of a list of "here are some Unicode characters I think are interesting, divided into subgroups, oh and here are some I personally don't have any use for, which makes them silly" than any sort of discussion about the universality of Unicode. That makes it rather idiosyncratic and parochial. Why should obscure maths symbols be given more importance than obscure historical languages? I think that the universality of Unicode could be explained in a single sentence: "It is the aim of Unicode to be the one character set anyone needs to represent every character, ideogram or symbol (but not necessarily distinct glyph) from any existing or historical human language." I can expand on that, but in a nutshell that is it. You state: "APL and Z Notation are two notable languages APL is a programming language and Z a specification language that did not tie themselves down to a restricted charset ..." but I don't think that is correct. I'm pretty sure that neither APL nor Z allowed you to define new characters. They might not have used ASCII alone, but they still had a restricted character set. It was merely less restricted than ASCII. You make a comment about Cobol's relative unpopularity, but (1) Cobol doesn't require you to write out numbers as English words, and (2) Cobol is still used, there are uncounted billions of lines of Cobol code being used, and if the number of Cobol programmers is less now than it was 16 years ago, there are still a lot of them. Academics and FOSS programmers don't think much of Cobol, but it has to count as one of the most amazing success stories in the field of programming languages, despite its lousy design. You list ideographs such as Cuneiform under "Icons". They are not icons. They are a mixture of symbols used for consonants, syllables, and logophonetic, consonantal alphabetic and syllabic signs. That sits them firmly in the same categories as modern languages with consonants, ideogram languages like Chinese, and syllabary languages like Cheyenne. Just because native readers of Cuneiform are all dead doesn't make Cuneiform unimportant. There are probably more people who need to write Cuneiform than people who need to write APL source code. You make a comment: "To me – a unicode-layman – it looks unprofessional… Billions of computing devices world over, each having billions of storage words having their storage wasted on blocks such as these??" But that is nonsense, and it contradicts your earlier quoting of Dave Angel. Why are you so worried about an (illusionary) minor optimization? Whether code points are allocated or not doesn't affect how much space they take up. There are millions of unused Unicode code points today. If they are allocated tomorrow, the space your documents take up will not increase one byte. Allocating code points to Cuneiform has not increased the space needed by Unicode at all. Two bytes alone is not enough for even existing human languages (thanks China). For hardware related reasons, it is faster and more efficient to use four bytes than three, so the obvious and "dumb" (in the simplest thing which will work) way to store Unicode is UTF-32, which takes a full four bytes per code point, regardless of whether there are 65537 code points or 1114112. That makes it less expensive than floating point numbers, which take eight. Would you like to argue that floating point doubles are "unprofessional" and wasteful? As Dave pointed out, and you apparently agreed with him enough to quote him TWICE (once in each of two blog posts), history of computing is full of premature optimizations for space. (In fact, some of these may have been justified by the technical limitations of the day.) Technically Unicode is also limited, but it is limited to over one million code points, 1114112 to be exact, although some of them are reserved as invalid for technical reasons, and there is no indication that we'll ever run out of space in Unicode. In practice, there are three common Unicode encodings that nearly all Unicode documents will use. * UTF-8 will use between one and (by memory) four bytes per code point. For Western European languages, that will be mostly one or two bytes per character. * UTF-16 uses a fixed two bytes per code point in the Basic Multilingual Plane, which is enough for nearly all Western European writing and much East Asian writing as well. For the rest, it uses a fixed four bytes per code point. * UTF-32 uses a fixed four bytes per code point. Hardly anyone uses this as a storage format. In *all three cases*, the existence of hieroglyphs and cuneiform in Unicode doesn't change the space used. If you actually include a few hieroglyphs to your document, the space increases only by the actual space used by those hieroglyphs: four bytes per hieroglyph. At no time does the existence of a single hieroglyph in your document force you to expand the non-hieroglyph characters to use more space. > What I was trying to say expanded here > http://blog.languager.org/2015/03/whimsical-unicode.html You have at least two broken links, referring to a non-existent page: http://blog.languager.org/2015/03/unicode-universal-or-whimsical.html This essay seems to be even more rambling and unfocused than the first. What does the cost of semi-conductor plants have to do with whether or not programmers support Unicode in their applications? Your point about the UTF-8 "BOM" is valid only if you interpret it as a Byte Order Mark. But if you interpret it as an explicit UTF-8 signature or mark, it isn't so silly. If your text begins with the UTF-8 mark, treat it as UTF-8. It's no more silly than any other heuristic, like HTML encoding tags or text editor's encoding cookies. Your discussion of "complexifiers and simplifiers" doesn't seem to be terribly relevant, or at least if it is relevant, you don't give any reason for it. The whole thing about Moore's Law and the cost of semi-conductor plants seems irrelevant to Unicode except in the most over-generalised sense of "things are bigger today than in the past, we've gone from five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So what's your point? You agree that 16-bits are not enough, and yet you critice Unicode for using more than 16-bits on wasteful, whimsical gibberish like Cuneiform? That is an inconsistent position to take. UTF-16 is not half-arsed Unicode support. UTF-16 is full Unicode support. The problem is when your language treats UTF-16 as a fixed-width two-byte format instead of a variable-width, two- or four-byte format. (That's more or less like the old, obsolete, UCS-2 standard.) There are all sorts of good ways to solve the problem of surrogate pairs and the SMPs in UTF-16. If some programming languages or software fails to do so, they are buggy, not UTF-16. After explaining that 16 bits are not enough, you then propose a 16 bit standard. /face-palm UTF-16 cannot break the fixed with invariant, because it has no fixed width invariant. That's like arguing against UTF-8 because it breaks the fixed width invariant "all characters are single byte ASCII characters". If you cannot handle SMP characters, you are not supporting Unicode. You suggest that Chinese users should be looking at Big5 or GB. I really, really don't think so. - Neither is universal. What makes you think that Chinese writers need to use maths symbols, or include (say) Thai or Russian in their work any less than Western writers do? - Neither even support all of Chinese. Big5 supports Traditional Chinese, but not Simplified Chinese. GB supports Simplified Chinese, but not Traditional Chinese. - Big5 likewise doesn't support placenames, many people's names, and other less common parts of Chinese. - Big5 is a shift-system, like Shift-JIS, and suffers from the same sort of data corruption issues. - There is no one single Big5 standard, but a whole lot of vendor extensions. You say: "I just want to suggest that the Unicode consortium going overboard in adding zillions of codepoints of nearly zero usefulness, is in fact undermining unicode’s popularity and spread." Can you demonstrate this? Can you show somebody who says "Well, I was going to support full Unicode, but since they added a snowman, I'm going to stick to ASCII"? The "whimsical" characters you are complaining about were important enough to somebody to spend significant amounts of time and money to write up a proposal, have it go through the Unicode Consortium bureaucracy, and eventually have it accepted. That's not easy or cheap, and people didn't add a snowman on a whim. They did it because there are a whole lot of people who want a shared standard for map symbols. It is easy to mock what is not important to you. I daresay kids adding emoji to their 10 character tweets would mock all the useless maths symbols in Unicode too. -- Steven -- https://mail.python.org/mailman/listinfo/python-list