On Wednesday, March 4, 2015 at 9:35:28 AM UTC+5:30, Rustom Mody wrote: > On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote: > > Rustom Mody wrote: > > > > > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: > > >> On 2/26/2015 8:24 AM, Chris Angelico wrote: > > >> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: > > >> >> Wrote something up on why we should stop using ASCII: > > >> >> http://blog.languager.org/2015/02/universal-unicode.html > > >> > > >> I think that the main point of the post, that many Unicode chars are > > >> truly planetary rather than just national/regional, is excellent. > > > > > > <snipped> > > > > > >> You should add emoticons, but not call them or the above 'gibberish'. > > >> I think that this part of your post is more 'unprofessional' than the > > >> character blocks. It is very jarring and seems contrary to your main > > >> point. > > > > > > Ok Done > > > > > > References to gibberish removed from > > > http://blog.languager.org/2015/02/universal-unicode.html > > > > I consider it unethical to make semantic changes to a published work in > > place without acknowledgement. Fixing minor typos or spelling errors, or > > dead links, is okay. But any edit that changes the meaning should be > > commented on, either by an explicit note on the page itself, or by striking > > out the previous content and inserting the new. > > Dunno What you are grumping about… > > Anyway the attribution is made more explicit – footnote 5 in > http://blog.languager.org/2015/03/whimsical-unicode.html. > > Note Terry Reedy's post who mainly objected was already acked earlier. > Ive just added one more ack¹ > And JFTR the 'publication' (O how archaic!) is the whole blog not a single > page just as it is for any other dead-tree publication. > > > > > As for the content of the essay, it is currently rather unfocused. > > True. > > It > > appears to be more of a list of "here are some Unicode characters I think > > are interesting, divided into subgroups, oh and here are some I personally > > don't have any use for, which makes them silly" than any sort of discussion > > about the universality of Unicode. That makes it rather idiosyncratic and > > parochial. Why should obscure maths symbols be given more importance than > > obscure historical languages? > > Idiosyncratic ≠ parochial > > > > > > I think that the universality of Unicode could be explained in a single > > sentence: > > > > "It is the aim of Unicode to be the one character set anyone needs to > > represent every character, ideogram or symbol (but not necessarily distinct > > glyph) from any existing or historical human language." > > > > I can expand on that, but in a nutshell that is it. > > > > > > You state: > > > > "APL and Z Notation are two notable languages APL is a programming language > > and Z a specification language that did not tie themselves down to a > > restricted charset ..." > > Tsk Tsk – dihonest snipping. I wrote > > | APL and Z Notation are two notable languages APL is a programming language > | and Z a specification language that did not tie themselves down to a > | restricted charset even in the day that ASCII ruled. > > so its clear that the restricted applies to ASCII > > > > You list ideographs such as Cuneiform under "Icons". They are not icons. > > They are a mixture of symbols used for consonants, syllables, and > > logophonetic, consonantal alphabetic and syllabic signs. That sits them > > firmly in the same categories as modern languages with consonants, ideogram > > languages like Chinese, and syllabary languages like Cheyenne. > > Ok changed to iconic. > Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform they > were languages. > In 2015 when someone sees them and recognizes them, they are 'those things > that > Sumerians/Egyptians wrote' No one except a rare expert knows those languages > > > > > Just because native readers of Cuneiform are all dead doesn't make Cuneiform > > unimportant. There are probably more people who need to write Cuneiform > > than people who need to write APL source code. > > > > You make a comment: > > > > "To me – a unicode-layman – it looks unprofessional… Billions of computing > > devices world over, each having billions of storage words having their > > storage wasted on blocks such as these??" > > > > But that is nonsense, and it contradicts your earlier quoting of Dave Angel. > > Why are you so worried about an (illusionary) minor optimization? > > 2 < 4 as far as I am concerned. > [If you disagree one man's illusionary is another's waking] > > > > > Whether code points are allocated or not doesn't affect how much space they > > take up. There are millions of unused Unicode code points today. If they > > are allocated tomorrow, the space your documents take up will not increase > > one byte. > > > > Allocating code points to Cuneiform has not increased the space needed by > > Unicode at all. Two bytes alone is not enough for even existing human > > languages (thanks China). For hardware related reasons, it is faster and > > more efficient to use four bytes than three, so the obvious and "dumb" (in > > the simplest thing which will work) way to store Unicode is UTF-32, which > > takes a full four bytes per code point, regardless of whether there are > > 65537 code points or 1114112. That makes it less expensive than floating > > point numbers, which take eight. Would you like to argue that floating > > point doubles are "unprofessional" and wasteful? > > > > As Dave pointed out, and you apparently agreed with him enough to quote him > > TWICE (once in each of two blog posts), history of computing is full of > > premature optimizations for space. (In fact, some of these may have been > > justified by the technical limitations of the day.) Technically Unicode is > > also limited, but it is limited to over one million code points, 1114112 to > > be exact, although some of them are reserved as invalid for technical > > reasons, and there is no indication that we'll ever run out of space in > > Unicode. > > > > In practice, there are three common Unicode encodings that nearly all > > Unicode documents will use. > > > > * UTF-8 will use between one and (by memory) four bytes per code > > point. For Western European languages, that will be mostly one > > or two bytes per character. > > > > * UTF-16 uses a fixed two bytes per code point in the Basic Multilingual > > Plane, which is enough for nearly all Western European writing and > > much East Asian writing as well. For the rest, it uses a fixed four > > bytes per code point. > > > > * UTF-32 uses a fixed four bytes per code point. Hardly anyone uses > > this as a storage format. > > > > > > In *all three cases*, the existence of hieroglyphs and cuneiform in Unicode > > doesn't change the space used. If you actually include a few hieroglyphs to > > your document, the space increases only by the actual space used by those > > hieroglyphs: four bytes per hieroglyph. At no time does the existence of a > > single hieroglyph in your document force you to expand the non-hieroglyph > > characters to use more space. > > > > > > > What I was trying to say expanded here > > > http://blog.languager.org/2015/03/whimsical-unicode.html > > > > You have at least two broken links, referring to a non-existent page: > > > > http://blog.languager.org/2015/03/unicode-universal-or-whimsical.html > > Thanks corrected > > > > > This essay seems to be even more rambling and unfocused than the first. What > > does the cost of semi-conductor plants have to do with whether or not > > programmers support Unicode in their applications? > > > > Your point about the UTF-8 "BOM" is valid only if you interpret it as a Byte > > Order Mark. But if you interpret it as an explicit UTF-8 signature or mark, > > it isn't so silly. If your text begins with the UTF-8 mark, treat it as > > UTF-8. It's no more silly than any other heuristic, like HTML encoding tags > > or text editor's encoding cookies. > > > > Your discussion of "complexifiers and simplifiers" doesn't seem to be > > terribly relevant, or at least if it is relevant, you don't give any reason > > for it. The whole thing about Moore's Law and the cost of semi-conductor > > plants seems irrelevant to Unicode except in the most over-generalised > > sense of "things are bigger today than in the past, we've gone from > > five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So what's your point? > > - Most people need only 16 bits. > - Many notable examples of software fail going from 16 to 23. > - If you are a software writer, and you fail going 16 to 23 its ok but try to > give useful errors
Uh… 21 Thats what makes 3 chars per 64-bit word a possibility. A possibility that can become realistic if/when Intel decides to add 'packed-unicode' string instructions. -- https://mail.python.org/mailman/listinfo/python-list