Satisfactory answers, thank you very much.
Going back to doing more research. (Silence does not imply abandoning
the C1 Control Pictures project; just a lot to synthesize.)
Regarding the three points U+0080, U+0081, and U+0099: the fact that
Unicode defers mostly to ISO 6429 and other standards before its time
(e.g., ANSI X3.32 / ISO 2047 / ECMA-17) means that it is not
particularly urgent that those code points get Unicode names. I also do
not find that their lack of definition precludes pictorial
representations. In the current U+2400 block, the Standard says: "The
diagonal lettering glyphs are only exemplary; alternate representations
may be, and often are used in the visible display of control codes",
and, Section 22.7.
I am now in possession of a copy of ANSI X3.32-1973 and ECMA-17:1968
(the latter is available on ECMA's website). I find it worthwhile to
point out that the Transmission Controls and Format Effectors were not
standardized by the time of ECMA-17:1968, but the symbols are the same
nonetheless. ANSI X3.32-1973 has the standardized control names for
those characters.
Sean
On 10/6/2015 6:57 AM, Philippe Verdy wrote:
2015-10-06 14:24 GMT+02:00 Sean Leonard <[email protected]
<mailto:[email protected]>>:
2. The Unicode code charts are (deliberately) vague about
U+0080, U+0081,
and U+0099. All other C1 control codes have aliases to the ISO
6429
set of control functions, but in ISO 6429, those three control
codes don't
have any assigned functions (or names).
On 10/5/2015 3:57 PM, Philippe Verdy wrote:
Also the aliases for C1 controls were formally registered in
1983 only for the two ranges U+0084..U+0097 and U+009B..U+009F
for ISO 6429.
If I may, I would appreciate another history lesson:
In ISO 2022 / 6429 land, it is apparent that the C1 controls are
mainly aliases for ESC 4/0 - 5/15. ( @ through _ ) This might vary
depending on what is loaded into the C1 register, but overall, it
just seems like saving one byte.
Why was C1 invented in the first place?
Look for the history of EBCDIC and its adaptation/conversion with
ASCII compatible encodings: round trip conversion wasneeded (using a
only a simple reordering of byte values, with no duplicates). EBCDIC
has used many controls that were not part of C0 and were kept in the
C1 set. Ignore the 7-bit compatiblity encoding using pairs, they were
only needed for ISO 2022, but ISO 6429 defines a profile where those
longer sequences are not needed and even forbidden in 8-bit contexts
or in contexts where aliases are undesirable and invalidated, such as
security environments.
With your thoughts, I would conclude that assigning characters in the
G1 set was also a duplicate, because it is reachable with a C0
"shifting" control + a position of the G0 set. In that case ISO 8859-1
or Windows 1252 was also an unneeded duplication ! And we would live
today in a 7-bit only world.
C1 controls have their own identity. The 7-bit encoding using ESC is
just a hack to make them fit in 7-bit and it only works where the ESC
control is assumed to play this function according to ISO 2022, ISO
6429, or other similar old 7-bit protocols such as Videotext (which
was widely used in France with the free "Minitel" terminal, long
before the introduction of the Internet to the general public around
1992-1995).
Today Videotext is definitely dead (the old call numbers for this slow
service are now definitely defunct, the Minitels are recycled wastes,
they stopped being distributed and replaced by applications on PC
connected to the Internet, but now all the old services are directly
on the internet and none of them use 7-bit encodings for their HTML
pages, or their mobile applications). France has also definitely
abandoned its old French version of ISO 646, there are no longer any
printer supporting versions of ISO 646 other than ASCII, but they
still support various 8-bit encodings.
7-bit encodings are things of the past (they were only justified at
times where communication links were slow and generated lots of
transmission errors, and the only implemented mecanism to check them
was to use a single parity bit per character. Today we transmit long
datagrams and prefer using checks codes for the whole (such as CRC, or
autocorrecting codes). 8-bit encodings are much easier and faster to
process for transmitting not just text but also binary data.
Let's forget the 7-bit world definitely. We have also abandonned the
old UTF-7 in Unicode ! I've not seen it used anywhere except in a few
old emails sent at end of the 90's, because many mail servers were
still not 8-bit clean and silently transformed non-ASCII bytes in
unpredictable ways or using unspecified encodings, or just siltently
dropped the high bit, assuming it was just a parity bit : at that
time, emails were not sent with SMTP, but with the old UUCP protocol
and could take weeks to be delivered to the final recipient, as there
was still no global routing infrastructure and many hops were
necessary via non-permanent modem links. My opinion of UTF-7 is that
it was just a temporary and experimental solution to help system
admins and developers adopt the new UCS, including for their old 7-bit
environments.
On 10/6/2015 8:33 AM, Asmus Freytag (t) wrote:
On 10/6/2015 5:24 AM, Sean Leonard wrote:
And, why did Unicode deem it necessary to replicate the C1 block at
0x80-0x9F, when all of the control characters (codes) were equally
reachable via ESC 4/0 - 5/15? I understand why it is desirable to
align U+0000 - U+007F with ASCII, and maybe even U+0000 - U+00FF with
Latin-1 (ISO-8859-1). But maybe Windows-1252, MacRoman, and all the
other non-ISO-standardized 8-bit encodings got this much right:
duplicating control codes is basically a waste of very precious
character code real estate
Because Unicode aligns with ISO 8859-1, so that transcoding from that
was a simple zero-fill to 16 bits.
8859-1 was the most widely used single byte (full 8-bit) ISO standard
at the time, and making that transition easy was beneficial, both
practically and politically.
Vendor standards all disagreed on the upper range, and it would not
have been feasible to single out any of them. Nobody wanted to follow
the IBM code page 437 (then still the most widely used single byte
vendor standard).
Note, that by "then" I refer to dates earlier than the dates of the
final drafts, because may of those decisions date back to earlier
periods where the drafts were first developed.Also, the overloading of
0x80-0xFF by Windows did not happen all at once, earlier versions had
left much of that space open, but then people realized that as long as
you were still limited to 8 bits, throwing away 32 codes was an issue.
Now, for Unicode, 32 out of 64K values (initially) or 1114112 (now),
don't matter, so being "clean" didn't cost much. (Note that even for
UTF-8, there's no special benefit of a value being inside that second
range of 128 codes.
Finally, even if the range had not been dedicated to C1, the 32 codes
would have had to be given space, because the translation into ESC
sequences is not universal, so, in transcoding data you needed to have
a way to retain the difference between the raw code and the ESC
sequence, or your round-trip would not be lossless.
A./