Rustom Mody wrote: > Emoticons (or is it emoji) seems to have some (regional?) takeup?? Dunno… > In any case I'd like to stay clear of political(izable) questions
Emoji is the term used in Japan, gradually spreading to the rest of the word. Emoticons, I believe, should be restricted to the practice of using ASCII-only digraphs and trigraphs such as :-) (colon, hyphen, right-parens) to indicate "smileys". I believe that emoji will eventually lead to Unicode's victory. People will want smileys and piles of poo on their mobile phones, and from there it will gradually spread to everywhere. All they need to do to make victory inevitable is add cartoon genitals... >> I think that this part of your post is more 'unprofessional' than the >> character blocks. It is very jarring and seems contrary to your main >> point. > > Ok I need a word for > 1. I have no need for this > 2. 99.9% of the (living) on this planet also have no need for this 0.1% of the living is seven million people. I'll tell you what, you tell me which seven million people should be relegated to second-class status, and I'll tell them where you live. :-) [...] > I clearly am more enthusiastic than knowledgeable about unicode. > But I know my basic CS well enough (as I am sure you and Chris also do) > > So I dont get how 4 bytes is not more expensive than 2. Obviously it is. But it's only twice as expensive, and in computer science terms that counts as "close enough". It's quite common for data structures to "waste" space by using "no more than twice as much space as needed", e.g. Python dicts and lists. The whole Unicode range U+0000 to U+10FFFF needs only 21 bits, which fits into three bytes. Nevertheless, there's no three-byte UTF encoding, because on modern hardware it is more efficient to "waste" an entire extra byte per code point and deal with an even multiple of bytes. > Yeah I know you can squeeze a unicode char into 3 bytes or even 21 bits > You could use a clever representation like UTF-8 or FSR. > But I dont see how you can get out of this that full-unicode costs more > than exclusive BMP. Are you missing a word there? Costs "no more" perhaps? > eg consider the case of 32 vs 64 bit executables. > The 64 bit executable is generally larger than the 32 bit one > Now consider the case of a machine that has say 2GB RAM and a 64-bit > processor. You could -- I think -- make a reasonable case that all those > all-zero hi-address-words are 'waste'. Sure. The whole point of 64-bit processors is to enable the use of more than 2GB of RAM. One might as well say that using 32-bit processors is wasteful if you only have 64K of memory. Yes it is, but the only things which use 16-bit or 8-bit processors these days are embedded devices. [...] > Math-Greek: Consider the math-alpha block > http://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode#Mathematical_Alphanumeric_Symbols_block > > Now imagine a beginning student not getting the difference between font, > glyph, > character. To me this block represents this same error cast into concrete > and dignified by the (supposed) authority of the unicode consortium. Not being privy to the internal deliberations of the Consortium, it is sometimes difficult to tell why two symbols are sometimes declared to be mere different glyphs for the same character, and other times declared to be worthy of being separate characters. E.g. I think we should all agree that the English "A" and the French "A" shouldn't count as separate characters, although the Greek "Α" and Russian "А" do. In the case of the maths symbols, it isn't obvious to me what the deciding factors were. I know that one of the considerations they use is to consider whether or not users of the symbols have a tradition of treating the symbols as mere different glyphs, i.e. stylistic variations. In this case, I'm pretty sure that mathematicians would *not* consider: U+2115 DOUBLE-STRUCK CAPITAL N "ℕ" U+004E LATIN CAPITAL LETTER N "N" as mere stylistic variations. If you defined a matrix called ℕ, you would probably be told off for using the wrong symbol, not for using the wrong formatting. On the other hand, I'm not so sure about U+210E PLANCK CONSTANT "ℎ" versus a mere lowercase h (possibly in italic). > There are probably dozens of other such stupidities like distinguishing > kelvin K from latin K as if that is the business of the unicode consortium But it *is* the business of the Unicode consortium. They have at least two important aims: - to be able to represent every possible human-language character; - to allow lossless round-trip conversion to all existing legacy encodings (for the subset of Unicode handled by that encoding). The second reason is why Unicode includes code points for degree-Celsius and degree-Fahrenheit, rather than just using °C and °F like sane people. Because some idiot^W code-page designer back in the 1980s or 90s decided to add single character ℃ and ℉. So now Unicode has to be able to round-trip (say) "°C℃" without loss. I imagine that the same applies to U+212A KELVIN SIGN K. > My real reservations about unicode come from their work in areas that I > happen to know something about > > Music: To put music simply as a few mostly-meaningless 'dingbats' like ♩ ♪ > ♫ is perhaps ok However all this stuff > http://xahlee.info/comp/unicode_music_symbols.html > makes no sense (to me) given that music (ie standard western music written > in staff notation) is inherently 2 dimensional -- multi-voiced, > multi-staff, chordal (1) Text can also be two dimensional. (2) Where you put the symbol on the page is a separate question from whether or not the symbol exists. > Consists of bogus letters that dont exist in devanagari > The letter ऄ (0904) is found here http://unicode.org/charts/PDF/U0900.pdf > But not here http://en.wikipedia.org/wiki/Devanagari#Vowels > So I call it bogus-devanagari Hmm, well I love Wikipedia as much as the next guy, but I think that even Jimmy Wales would suggest that Wikipedia is not a primary source for what counts as Devanagari vowels. What makes you think that Wikipedia is right and Unicode is wrong? That's not to say that Unicode hasn't made some mistakes. There are a few deprecated code points, or code points that have been given the wrong name. Oops. Mistakes happen. > Contrariwise an important letter in vedic pronunciation the double-udatta > is missing > http://list.indology.info/pipermail/indology_list.indology.info/2000-April/021070.html I quote: I do not see any need for a "double udaatta". Perhaps "double ANudaatta" is meant here? I don't know Sanskrit, but if somebody suggested that Unicode doesn't support English because the important letter "double-oh" (as in "moon", "spoon", "croon" etc.) was missing, I wouldn't be terribly impressed. We have a "double-u" letter, why not "double-oh"? Another quote: I should strongly recommend not to hurry with a standardization proposal until the text collection of Vedic texts has been finished In other words, even the experts in Vedic texts don't yet know all the characters which they may or may not need. -- Steven -- https://mail.python.org/mailman/listinfo/python-list