Tom Christiansen <tchr...@perl.com> added the comment: >> Perl does not provide the old 1.0 names at all. We don't have a Unicode >> 1.0 legacy to support, which makes this cleaner. However, we do provide >> for the names of the C0 and C1 Control Codes, because apart from Unicode >> 1.0, they don't condescend to name the ASCII or Latin1 control codes. =20
> If there would be a reasonably official source for these names, and one > that guarantees that there is no collision with UCD names, I could > accept doing so for Python as well. The C0 and C1 control code names don't change. There is/was one stability issue where they screwed up, because they ended up having a UAX (required) and a UTS (not required) fighting because of the dumb stuff they did with the Emoji names. They neglected to prefix them with "Emoji ..." or some such, the way things like "GREEK ... LETTER ..." or "MATHEMATICAL ..." or "MUSICAL ..." did. The problem is they stole BELL without calling it EMOJI BELL. This is C0 name for Control-G. Dimwits. The problem with official names is that they have things in them that you are not expected in names. Do you really and truly mean to tell me you think it is somehow **good** that people are forced to write \N{LINE FEED (LF)} Rather than the more obvious pair of \N{LINE FEED} \N{LF} ?? If so, then I don't understand that. Nobody in their right mind prefers "\N{LINE FEED (LF)}" over "\N{LINE FEED}" -- do they? % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LINE FEED}"' U+000A % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LF}"' U+000A % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LINE FEED (LF)}"' U+000A % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEXT LINE}"' U+0085 % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEL}"' U+0085 % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEXT LINE (NEL)}"' U+0085 >> We also provide for certain well known aliases from the Names file: >> anything that says "* commonly abbreviated as ...", so things like LRO >> and ZWJ and such. > -1. Readability counts, writability not so much (I know this is > different for Perl :-). I actually very strongly resent and rebuff that entire mindset in the most extreme way possible. Well-written Perl code is perfectly readable by people who speak that langauge. If you find Perl code that isn't readable, it is by definition not well-written. *PLEASE* don't start. Yes, I just got done driving 16 hours and am overtired, but it's something I've been fighting against all of professional career. It's a "leyenda negra". > If there is too much aliasing, people will > wonder what these codes actually mean. There are 15 "commonly abbreviated as" aliases in the Names.txt file. * commonly abbreviated as NBSP * commonly abbreviated as SHY * commonly abbreviated as CGJ * commonly abbreviated ZWSP * commonly abbreviated ZWNJ * commonly abbreviated ZWJ * commonly abbreviated LRM * commonly abbreviated RLM * commonly abbreviated LRE * commonly abbreviated RLE * commonly abbreviated PDF * commonly abbreviated LRO * commonly abbreviated RLO * commonly abbreviated NNBSP * commonly abbreviated WJ All of the standards documents *talk* about things like LRO and ZWNJ. I guess the standards aren't "readable" then, right? :) >From the charnames manpage, which shows that we really don't just make these up as we feel like (although we could; see below). They're all from this or that standard: ALIASES A few aliases have been defined for convenience: instead of having to use the official names LINE FEED (LF) FORM FEED (FF) CARRIAGE RETURN (CR) NEXT LINE (NEL) (yes, with parentheses), one can use LINE FEED FORM FEED CARRIAGE RETURN NEXT LINE LF FF CR NEL All the other standard abbreviations for the controls, such as "ACK" for "ACKNOWLEDGE" also can be used. One can also use BYTE ORDER MARK BOM and these abbreviations Abbreviation Full Name CGJ COMBINING GRAPHEME JOINER FVS1 MONGOLIAN FREE VARIATION SELECTOR ONE FVS2 MONGOLIAN FREE VARIATION SELECTOR TWO FVS3 MONGOLIAN FREE VARIATION SELECTOR THREE LRE LEFT-TO-RIGHT EMBEDDING LRM LEFT-TO-RIGHT MARK LRO LEFT-TO-RIGHT OVERRIDE MMSP MEDIUM MATHEMATICAL SPACE MVS MONGOLIAN VOWEL SEPARATOR NBSP NO-BREAK SPACE NNBSP NARROW NO-BREAK SPACE PDF POP DIRECTIONAL FORMATTING RLE RIGHT-TO-LEFT EMBEDDING RLM RIGHT-TO-LEFT MARK RLO RIGHT-TO-LEFT OVERRIDE SHY SOFT HYPHEN VS1 VARIATION SELECTOR-1 . . . VS256 VARIATION SELECTOR-256 WJ WORD JOINER ZWJ ZERO WIDTH JOINER ZWNJ ZERO WIDTH NON-JOINER ZWSP ZERO WIDTH SPACE For backward compatibility one can use the old names for certain C0 and C1 controls old new FILE SEPARATOR INFORMATION SEPARATOR FOUR GROUP SEPARATOR INFORMATION SEPARATOR THREE HORIZONTAL TABULATION CHARACTER TABULATION HORIZONTAL TABULATION SET CHARACTER TABULATION SET HORIZONTAL TABULATION WITH JUSTIFICATION CHARACTER TABULATION WITH JUSTIFICATION PARTIAL LINE DOWN PARTIAL LINE FORWARD PARTIAL LINE UP PARTIAL LINE BACKWARD RECORD SEPARATOR INFORMATION SEPARATOR TWO REVERSE INDEX REVERSE LINE FEED UNIT SEPARATOR INFORMATION SEPARATOR ONE VERTICAL TABULATION LINE TABULATION VERTICAL TABULATION SET LINE TABULATION SET but the old names in addition to giving the character will also give a warning about being deprecated. And finally, certain published variants are usable, including some for controls that have no Unicode names: name character END OF PROTECTED AREA END OF GUARDED AREA, U+0097 HIGH OCTET PRESET U+0081 HOP U+0081 IND U+0084 INDEX U+0084 PAD U+0080 PADDING CHARACTER U+0080 PRIVATE USE 1 PRIVATE USE ONE, U+0091 PRIVATE USE 2 PRIVATE USE TWO, U+0092 SGC U+0099 SINGLE GRAPHIC CHARACTER INTRODUCER U+0099 SINGLE-SHIFT 2 SINGLE SHIFT TWO, U+008E SINGLE-SHIFT 3 SINGLE SHIFT THREE, U+008F START OF PROTECTED AREA START OF GUARDED AREA, U+0096 perl v5.14.0 2011-05-07 2 Those are the defaults. They are overridable. That's because we feel that people should be able to name their character constants however they feel makes sense for them. If they get tired of typing \N{LATIN SMALL LETTER U WITH DIAERESIS} let alone \N{LATIN CAPITAL LETTER THORN WITH STROKE THROUGH DESCENDER} then they can, because there is a mechanism for making aliases: use charnames ":full", ":alias" => { U_uml => "LATIN CAPITAL LETTER U WITH DIAERESIS", u_uml => "LATIN SMALL LETTER U WITH DIAERESIS", }; That way you can do s/\N{U_uml}/UE/; s/\N{u_uml}/ue/; This is probably not as persuasive as the private-use case described below. It is important to remember that all charname bindings in Perl are attached to a *lexically-scoped declaration. It is completely constrained to operate only within that lexical scope. That's why the compiler replaces things like use charnames ":full", ":alias" => { U_uml => "LATIN CAPITAL LETTER U WITH DIAERESIS", u_uml => "LATIN SMALL LETTER U WITH DIAERESIS", }; my $find_u_uml = qr/\N{u_uml}/i; print "Seach pattern is: $find_u_uml\n"; Which dutifully prints out: Seach pattern is: (?^ui:\N{U+FC}) So charname bindings are never "hard to read" because the effect is completely lexically constrained, and can never leak outside of the scope. I realize (or at least, believe) that Python has no notion of nested lexical scopes, and like many things, this sort of thing can therefore never work there because of that. The most persuasive use-case for user-defined names is for private-use area code points. These will never have an official name. But it is just fine to use them. Don't they deserve a better name, one that makes sense within your own program that uses them? Of course they do. For example, Apple has a bunch of private-use glyphs they use all the time. In the 8-bit MacRoman encoding, the byte 0xF0 represents the Apple corporate logo/glyph thingie of an apple with a bite taken out of it. (Microsoft also has a bunch of these.) If you upgrade MacRoman to Unicode, you will find that that 0xF0 maps to code point U+F8FF using the regular converter. Now what are you supposed to do in your program when you want a named character there? You certainly do not want to make users put an opaque magic number as a Unicode escape. That is always really lame, because the whole reason we have \N{...} escapes is so we don't have to put mysterious unreadable magic numbers in our code!! So all you do is use charnames ":alias" => { "APPLE LOGO" => 0xF8FF, }; and now you can use \N{APPLE LOGO} anywhere within that lexical scope. The compiler will dutifully resolve it to U+F8FF, since all name lookups happen at compile-time. And it cannot leak out of the scope. I assert that this facility makes your program more readable, and its absence makes your program less readable. Private use characters are important in Asian texts, but they are also important for other things. For example, Unicode intends to get around to allocating Tengwar up the the SMP. However, lots of stupid old code can't use full Unicode, being constrained to UCS-2 only. So many Tengwar fonts start at a different base, and put it in the private use area instead or the SMP. Here are two constants: use constant { TB_CONSCRIPT_UNICODE_REGISTRY => 0x00_E000, # private use TB_UNICODE_CONSORTIIUM => 0x01_6080, # where it will really go }; I have an entire Tengwar module that makes heavy use of named private-use characters. All I do is this: use constant TENGWAR_BASE => TB_CONSCRIPT_UNICODE_REGISTRY; use charnames ":alias" => { reverse ( (TENGWAR_BASE + 0x00) => "TENGWAR LETTER TINCO", (TENGWAR_BASE + 0x01) => "TENGWAR LETTER PARMA", (TENGWAR_BASE + 0x02) => "TENGWAR LETTER CALMA", (TENGWAR_BASE + 0x03) => "TENGWAR LETTER QUESSE", (TENGWAR_BASE + 0x04) => "TENGWAR LETTER ANDO", .... ) }; Now you can write \N{TENGWAR LETTER TINCO} etc. See how slick that is? Consider the alternative. Magic numbers. Worse, magic numbers with funny calculations in them. That is just so wrong that it completely justifies letting people name things how they want to, so long as they don't make other people do the same. What people do in the privacy of their own lexical scope is their own business. It gets better. Perl lets you define your character properties, too. Therefore I can write things like \p{Is_Tengwar_Decimal} and such. Right now I have these properties: In_Tengwar, Is_Tengwar In_Tengwar_Alphanumerics In_Tengwar_Consonants, In_Tengwar_Vowels, In_Tengwar_Alphabetics In_Tengwar_Numerals, Is_Tengwar_Decimal, Is_Tengwar_Duodecimal In_Tengwar_Punctuation In_Tengwar_Marks So I have code in my Tengwar module that does stuff like this, using my own named characters (which again, are compile-time resolved and work only within this lexical scope): chr( $1 + ord("\N{TENGWAR DIGIT ZERO}") ) Not to mention this using my own properties: $TENGWAR_GRAPHEME_RX = qr/(?:(?=\p{In_Tengwar})\P{In_Tengwar_Marks}\p{In_Tengwar_Marks}*)|\p{In_Tengwar_Marks}/x; Actually, I'm fibbing. I *never* write regexes all on one line like that: they are abhorrent to me. The pattern really looks like this in the code: $TENGWAR_GRAPHEME_RX = qr{ (?: (?= \p{In_Tengwar} ) \P{In_Tengwar_Marks} # Either one basechar... \p{In_Tengwar_Marks} * # ... plus 0 or more marks ) | \p{In_Tengwar_Marks} # or else a naked unpaired mark. }x; People who write patterns without whitespace for cognitive chunking (plus comments for explanation) are wicked wicked wicked. Frankly I'm surprised Python doesn't require it. :)/2 Anyway, do you see how much better that is than opaque unreadable magic numbers? Can you just imagine the sheer horror of writing that sort of code without the ability to define your own named characters *and* your own character properties? It's beautiful, simple, clean, and readable. I'll even go so far as to call it intuitive. No, I don't expect Python to do this sort of thing. You don't have proper scoping, so you can't ever do it cleanly the way Perl can. I just wanted to give a concrete example where flexibility leads to a much more readable program than inflexibility ever can. --tom "We hates magic numberses. We hates them forevers!" --Sméagol the Hacker ---------- title: \N{...} neglects formal aliases and named sequences from Unicode charnames namespace -> \N{...} neglects formal aliases and named sequences from Unicode charnames namespace _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12753> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com