[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

Tom Christiansen Sat, 01 Oct 2011 22:33:51 -0700

Tom Christiansen <tchr...@perl.com> added the comment:

>> Perl does not provide the old 1.0 names at all.  We don't have a Unicode
>> 1.0 legacy to support, which makes this cleaner.  However, we do provide
>> for the names of the C0 and C1 Control Codes, because apart from Unicode
>> 1.0, they don't condescend to name the ASCII or Latin1 control codes. =20

> If there would be a reasonably official source for these names, and one
> that guarantees that there is no collision with UCD names, I could
> accept doing so for Python as well.

The C0 and C1 control code names don't change. There is/was one stability
issue where they screwed up, because they ended up having a UAX (required)
and a UTS (not required) fighting because of the dumb stuff they did with
the Emoji names. They neglected to prefix them with "Emoji ..." or some
such, the way things like "GREEK ... LETTER ..." or "MATHEMATICAL ..." or
"MUSICAL ..." did. The problem is they stole BELL without calling it EMOJI
BELL. This is C0 name for Control-G. Dimwits.

The problem with official names is that they have things in them that you
are not expected in names. Do you really and truly mean to tell me you
think it is somehow **good** that people are forced to write

\N{LINE FEED (LF)}

Rather than the more obvious pair of

\N{LINE FEED}
\N{LF}

If so, then I don't understand that. Nobody in their right
mind prefers "\N{LINE FEED (LF)}" over "\N{LINE FEED}" -- do they?

% perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LINE FEED}"'
U+000A
% perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LF}"'
U+000A
% perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LINE FEED (LF)}"'
U+000A

% perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEXT LINE}"'
U+0085
% perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEL}"'
U+0085
% perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEXT LINE (NEL)}"'
U+0085

>> We also provide for certain well known aliases from the Names file:
>> anything that says "* commonly abbreviated as ...", so things like LRO
>> and ZWJ and such.

> -1. Readability counts, writability not so much (I know this is
> different for Perl :-).

I actually very strongly resent and rebuff that entire mindset in the most
extreme way possible. Well-written Perl code is perfectly readable by
people who speak that langauge. If you find Perl code that isn't readable,
it is by definition not well-written.

*PLEASE* don't start.

Yes, I just got done driving 16 hours and am overtired, but it's
something I've been fighting against all of professional career.
It's a "leyenda negra".

> If there is too much aliasing, people will
> wonder what these codes actually mean.

There are 15 "commonly abbreviated as" aliases in the Names.txt file.

* commonly abbreviated as NBSP
* commonly abbreviated as SHY
* commonly abbreviated as CGJ
* commonly abbreviated ZWSP
* commonly abbreviated ZWNJ
* commonly abbreviated ZWJ
* commonly abbreviated LRM
* commonly abbreviated RLM
* commonly abbreviated LRE
* commonly abbreviated RLE
* commonly abbreviated PDF
* commonly abbreviated LRO
* commonly abbreviated RLO
* commonly abbreviated NNBSP
* commonly abbreviated WJ

All of the standards documents *talk* about things like LRO and ZWNJ.
I guess the standards aren't "readable" then, right? :)

>From the charnames manpage, which shows that we really don't just make
these up as we feel like (although we could; see below). They're all from
this or that standard:

ALIASES
A few aliases have been defined for convenience: instead
of having to use the official names

LINE FEED (LF)
FORM FEED (FF)
CARRIAGE RETURN (CR)
NEXT LINE (NEL)

(yes, with parentheses), one can use

LINE FEED
FORM FEED
CARRIAGE RETURN
NEXT LINE
LF
FF
CR
NEL

All the other standard abbreviations for the controls,
such as "ACK" for "ACKNOWLEDGE" also can be used.

One can also use

BYTE ORDER MARK
BOM

and these abbreviations

Abbreviation Full Name

CGJ COMBINING GRAPHEME JOINER
FVS1 MONGOLIAN FREE VARIATION SELECTOR ONE
FVS2 MONGOLIAN FREE VARIATION SELECTOR TWO
FVS3 MONGOLIAN FREE VARIATION SELECTOR THREE
LRE LEFT-TO-RIGHT EMBEDDING
LRM LEFT-TO-RIGHT MARK
LRO LEFT-TO-RIGHT OVERRIDE
MMSP MEDIUM MATHEMATICAL SPACE
MVS MONGOLIAN VOWEL SEPARATOR
NBSP NO-BREAK SPACE
NNBSP NARROW NO-BREAK SPACE
PDF POP DIRECTIONAL FORMATTING
RLE RIGHT-TO-LEFT EMBEDDING
RLM RIGHT-TO-LEFT MARK
RLO RIGHT-TO-LEFT OVERRIDE
SHY SOFT HYPHEN
VS1 VARIATION SELECTOR-1
.
.
.
VS256 VARIATION SELECTOR-256
WJ WORD JOINER
ZWJ ZERO WIDTH JOINER
ZWNJ ZERO WIDTH NON-JOINER
ZWSP ZERO WIDTH SPACE

For backward compatibility one can use the old names for
certain C0 and C1 controls

old new

FILE SEPARATOR INFORMATION SEPARATOR FOUR
GROUP SEPARATOR INFORMATION SEPARATOR THREE
HORIZONTAL TABULATION CHARACTER TABULATION
HORIZONTAL TABULATION SET CHARACTER TABULATION SET
HORIZONTAL TABULATION WITH JUSTIFICATION CHARACTER TABULATION
WITH JUSTIFICATION
PARTIAL LINE DOWN PARTIAL LINE FORWARD
PARTIAL LINE UP PARTIAL LINE BACKWARD
RECORD SEPARATOR INFORMATION SEPARATOR TWO
REVERSE INDEX REVERSE LINE FEED
UNIT SEPARATOR INFORMATION SEPARATOR ONE
VERTICAL TABULATION LINE TABULATION
VERTICAL TABULATION SET LINE TABULATION SET

but the old names in addition to giving the character will
also give a warning about being deprecated.

And finally, certain published variants are usable,
including some for controls that have no Unicode names:

name character

END OF PROTECTED AREA END OF GUARDED AREA, U+0097
HIGH OCTET PRESET U+0081
HOP U+0081
IND U+0084
INDEX U+0084
PAD U+0080
PADDING CHARACTER U+0080
PRIVATE USE 1 PRIVATE USE ONE, U+0091
PRIVATE USE 2 PRIVATE USE TWO, U+0092
SGC U+0099
SINGLE GRAPHIC CHARACTER INTRODUCER U+0099
SINGLE-SHIFT 2 SINGLE SHIFT TWO, U+008E
SINGLE-SHIFT 3 SINGLE SHIFT THREE, U+008F
START OF PROTECTED AREA START OF GUARDED AREA, U+0096

perl v5.14.0 2011-05-07 2

Those are the defaults. They are overridable. That's because we feel that
people should be able to name their character constants however they feel
makes sense for them. If they get tired of typing

\N{LATIN SMALL LETTER U WITH DIAERESIS}

let alone

\N{LATIN CAPITAL LETTER THORN WITH STROKE THROUGH DESCENDER}

then they can, because there is a mechanism for making aliases:

use charnames ":full", ":alias" => {
U_uml => "LATIN CAPITAL LETTER U WITH DIAERESIS",
u_uml => "LATIN SMALL LETTER U WITH DIAERESIS",
};

That way you can do

s/\N{U_uml}/UE/;
s/\N{u_uml}/ue/;

This is probably not as persuasive as the private-use case described below.

It is important to remember that all charname bindings in Perl are attached
to a *lexically-scoped declaration. It is completely constrained to
operate only within that lexical scope. That's why the compiler replaces
things like

use charnames ":full", ":alias" => {
U_uml => "LATIN CAPITAL LETTER U WITH DIAERESIS",
u_uml => "LATIN SMALL LETTER U WITH DIAERESIS",
};

my $find_u_uml = qr/\N{u_uml}/i;

print "Seach pattern is: $find_u_uml\n";

Which dutifully prints out:

Seach pattern is: (?^ui:\N{U+FC})

So charname bindings are never "hard to read" because the effect is
completely lexically constrained, and can never leak outside of the scope.

I realize (or at least, believe) that Python has no notion of nested
lexical scopes, and like many things, this sort of thing can therefore
never work there because of that.

The most persuasive use-case for user-defined names is for private-use
area code points. These will never have an official name. But it is
just fine to use them. Don't they deserve a better name, one that makes
sense within your own program that uses them? Of course they do.

For example, Apple has a bunch of private-use glyphs they use all the time.
In the 8-bit MacRoman encoding, the byte 0xF0 represents the Apple corporate
logo/glyph thingie of an apple with a bite taken out of it. (Microsoft
also has a bunch of these.) If you upgrade MacRoman to Unicode, you will
find that that 0xF0 maps to code point U+F8FF using the regular converter.

Now what are you supposed to do in your program when you want a named character
there? You certainly do not want to make users put an opaque magic number
as a Unicode escape. That is always really lame, because the whole reason
we have \N{...} escapes is so we don't have to put mysterious unreadable magic
numbers in our code!!

So all you do is

use charnames ":alias" => {
"APPLE LOGO" => 0xF8FF,
};

and now you can use \N{APPLE LOGO} anywhere within that lexical scope. The
compiler will dutifully resolve it to U+F8FF, since all name lookups happen
at compile-time. And it cannot leak out of the scope.

I assert that this facility makes your program more readable, and its
absence makes your program less readable.

Private use characters are important in Asian texts, but they are also
important for other things. For example, Unicode intends to get around
to allocating Tengwar up the the SMP. However, lots of stupid old code
can't use full Unicode, being constrained to UCS-2 only. So many Tengwar
fonts start at a different base, and put it in the private use area instead
or the SMP. Here are two constants:

use constant {
TB_CONSCRIPT_UNICODE_REGISTRY => 0x00_E000, # private use
TB_UNICODE_CONSORTIIUM => 0x01_6080, # where it will really
go
};

I have an entire Tengwar module that makes heavy use of named
private-use characters. All I do is this:

use constant TENGWAR_BASE => TB_CONSCRIPT_UNICODE_REGISTRY;

use charnames ":alias" => {
reverse (
(TENGWAR_BASE + 0x00) => "TENGWAR LETTER TINCO",
(TENGWAR_BASE + 0x01) => "TENGWAR LETTER PARMA",
(TENGWAR_BASE + 0x02) => "TENGWAR LETTER CALMA",
(TENGWAR_BASE + 0x03) => "TENGWAR LETTER QUESSE",
(TENGWAR_BASE + 0x04) => "TENGWAR LETTER ANDO",
....
)
};

Now you can write \N{TENGWAR LETTER TINCO} etc. See how slick that is?
Consider the alternative. Magic numbers. Worse, magic numbers with funny
calculations in them. That is just so wrong that it completely justifies
letting people name things how they want to, so long as they don't make
other people do the same. What people do in the privacy of their own
lexical scope is their own business.

It gets better. Perl lets you define your character properties, too.
Therefore I can write things like \p{Is_Tengwar_Decimal} and such.
Right now I have these properties:

In_Tengwar, Is_Tengwar
In_Tengwar_Alphanumerics
In_Tengwar_Consonants, In_Tengwar_Vowels, In_Tengwar_Alphabetics
In_Tengwar_Numerals, Is_Tengwar_Decimal, Is_Tengwar_Duodecimal
In_Tengwar_Punctuation
In_Tengwar_Marks

So I have code in my Tengwar module that does stuff like this, using
my own named characters (which again, are compile-time resolved and
work only within this lexical scope):

chr( $1 + ord("\N{TENGWAR DIGIT ZERO}") )

Not to mention this using my own properties:

$TENGWAR_GRAPHEME_RX =
qr/(?:(?=\p{In_Tengwar})\P{In_Tengwar_Marks}\p{In_Tengwar_Marks}*)|\p{In_Tengwar_Marks}/x;

Actually, I'm fibbing. I *never* write regexes all on one line like
that: they are abhorrent to me. The pattern really looks like this in
the code:

$TENGWAR_GRAPHEME_RX = qr{
(?:
(?= \p{In_Tengwar} ) \P{In_Tengwar_Marks} # Either one basechar...
\p{In_Tengwar_Marks} * # ... plus 0 or more
marks
) |
\p{In_Tengwar_Marks} # or else a naked
unpaired mark.
}x;

People who write patterns without whitespace for cognitive chunking (plus
comments for explanation) are wicked wicked wicked. Frankly I'm surprised
Python doesn't require it. :)/2

Anyway, do you see how much better that is than opaque unreadable magic
numbers? Can you just imagine the sheer horror of writing that sort of
code without the ability to define your own named characters *and* your
own character properties? It's beautiful, simple, clean, and readable.
I'll even go so far as to call it intuitive.

No, I don't expect Python to do this sort of thing. You don't have proper
scoping, so you can't ever do it cleanly the way Perl can.

I just wanted to give a concrete example where flexibility leads to a
much more readable program than inflexibility ever can.

--tom

"We hates magic numberses. We hates them forevers!"
--Sméagol the Hacker

----------
title: \N{...} neglects formal aliases and named sequences from Unicode
charnames namespace -> \N{...} neglects formal aliases and named sequences from
Unicode charnames namespace

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12753>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

Reply via email to