Re: Concise term for non-ASCII Unicode characters

Sean Leonard Tue, 22 Sep 2015 03:21:08 -0700

On 9/22/2015 1:45 AM, Philippe Verdy wrote:

I would not use the "clumsy 7-bit ASCII" due to the confusion createdsince long when it could refer to any national version of ISO 646,which reassign some code positions in the rande 0x00 to 0x07F to othercharacters outside the range U+0000 to U+007F, while still remaining7-bit encodings.So insead of "7-bit ASCII" I highly prefer the term "US-ASCII" to makesure it refers to the encoding of 7-bit code positions effectively toU+0000..U+007F.
So for code positions outside 0x00..0x7F, I would call them "notUS-ASCII" (none of them are bound to any Unicode "character" or "codepoint" or "scalar value", they are just "code positions" or moreprecisely "octet values with their most significant bit set to 1"which is really long: "not US-ASCII" is fine as a shorter term).

Again having just read through ANSI X3.4-1986 (R1997), I would like toclarify some things.


The standard itself is titled:

American National Standard for Information Systems - Coded CharacterSets - 7-Bit American National Standard Code for Information Interchange(7-Bit ASCII)


However, Clause 1.1 states:

This standard specifies a set of 128 characters (control characters andgraphic characters, such as letters, digits, and symbols) with theircoded representation. The American National Standard Code forInformation Interchange may also be identified by the acronym ASCII(pronounced ask-ee). To explicitly designate a particular (perhapsprior) edition of this standard, the last two digits of the year ofissue may be appended, as in "ASCII 68" or "ASCII 86".

According to the title, "7-Bit ASCII" is proper. However, according tothe text, "ASCII" is sufficient. The "7-Bit" part really just emphasizesthe fact that it is a 7-bit standard. The eighth bit is outside thescope of the standard (but see clause 2.1.1). (Incidentally, Clause 1.1is not Y2K compliant! Thus you should '86 that part of ASCII 86...hehe)

The term "US-ASCII" (see also RFC 2046 for a lot of discussion) issimilarly redundant. After all, it is the *American* *National* StandardCode for Information Interchange. Even if you remove the term "National"(which does not appear in ASCII 68 or ASCII 63), it's still American.However, ASCII 68 (partially reprinted in RFC 20:<https://tools.ietf.org/html/rfc20>) actually permits "the notationASCII (pronounced as'-key) or USASCII (pronounced you-sas'-key) [...] tomean the code prescribed by the latest issue of the standard". That isprobably the genesis of US-ASCII. I wasn't alive at the time so I don'tknow. My suspicion is that "US-ASCII" was meant to disambiguate ASCII 86from ASCII 68 (which is referred to as "ASCII" in RFC 821) withoutreferring to the year, and since 68 and 86 are transposed numerals,"US-ASCII" eliminates possible mix-ups.

My conclusion here is that "ASCII" is sufficient when talking about therange of (code or character) positions 0 - 127, regardless of how theyare encoded, so long as they logically evaluate to the bit combinationsof the 7-bit code described in ANSI X3.4-1986.

"Basic Latin" also works if you want to avoid the historic reference.But there are many systems in use that are ASCII-based (including theInternet, as RFC 20 is still in force), and the term "ASCII" is pepperedthroughout the Unicode Standard 8.0 with greater frequency than "BasicLatin" (which is acknowledged to be a synonym for "ASCII" in Sections5.7 and 6.2).


Sean

Re: Concise term for non-ASCII Unicode characters

Reply via email to