On Sat, 9 Apr 2016 03:21 am, Peter Pearson wrote: > On Fri, 08 Apr 2016 16:00:10 +1000, Steven D'Aprano <st...@pearwood.info> > wrote: >> On Fri, 8 Apr 2016 02:51 am, Peter Pearson wrote: >>> >>> The Unicode consortium was certifiably insane when it went into the >>> typesetting business. >> >> They are not, and never have been, in the typesetting business. Perhaps >> characters are not the only things easily confused *wink* > > Defining codepoints that deal with appearance but not with meaning is > going into the typesetting business. Examples: ligatures, and spaces of > varying widths with specific typesetting properties like being > non-breaking.
Both of which are covered by the requirement that Unicode is capable of representing legacy encodings/code pages. Examples: MacRoman contains fl and fi ligatures, and NBSP. Non-breaking space is not so much a typesetting property as a semantic property, that is, it deals with *meaning* (exactly what you suggested it doesn't deal with). It is a space which doesn't break words. Ligatures are a good example -- the Unicode consortium have explicitly refused to add other ligatures beyond the handful needed for backwards compatibility because they maintain that it is a typesetting issue that is best handled by the font. There's even a FAQ about that very issue, and I quote: "The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances." http://www.unicode.org/faq/ligature_digraph.html#Lig2 Unicode currently contains something of the order of one hundred and ten thousand defined code points. I'm sure that if you went through the entire list, with a sufficiently loose definition of "typesetting", you could probably find some that exist only for presentation, and aren't covered by the legacy encoding clause. So what? One swallow does not mean the season is spring. Unicode makes an explicit rejection of being responsible for typesetting. See their discussion on presentation forms: http://www.unicode.org/faq/ligature_digraph.html#PForms But I will grant you that sometimes there's a grey area between presentation and semantics, and the Unicode consortium has to make a decision one way or another. Those decisions may not always be completely consistent, and may be driven by political and/or popular demand. E.g. the Consortium explicitly state that stylistic issues such as bold, italic, superscript etc are up to the layout engine or markup, and shouldn't be part of the Unicode character set. They insist that they only show representative glyphs for code points, and that font designers and vendors are free (within certain limits) to modify the presentation as desired. Nevertheless, there are specialist characters with distinct formatting, and variant selectors for specifying a specific glyph, and emoji modifiers for specifying skin tone. But when you get down to fundamentals, character sets and alphabets have always blurred the line between presentation and meaning. W ("double-u") was, once upon a time, UU and & (ampersand) started off as a ligature of "et" (Latin for "and"). There are always going to be cases where well-meaning people can agree to disagree on whether or not adding the character to Unicode was justified or not. -- Steven -- https://mail.python.org/mailman/listinfo/python-list