Re: Logical Nor (¬) in ASCII-based code pages?

Seymour J Metz Thu, 11 May 2023 12:56:32 -0700

'AC'X is not valid as the first octet of a UTF-8 sequence and the indicator is 
in the first octet, which must be one of

    0xxxxxxx
    110xxxxx
    1110xxxx
    11110xxx

'AC'X is valid in the second, third or fourth octet.

--
Shmuel (Seymour J.) Metz
http://mason.gmu.edu/~smetz3

________________________________________
From: IBM Mainframe Discussion List [IBM-MAIN@LISTSERV.UA.EDU] on behalf of 
Rick Troth [tro...@gmail.com]
Sent: Thursday, May 11, 2023 1:14 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Logical Nor (¬) in ASCII-based code pages?

On 5/8/23 14:48, Phil Smith III wrote:
> Seymour J Metz wrote, in part:
>> You seem to be confirming what I wrote; if the locale is UTF-8 then
>> your character data should be UTF-8. The ¬ character in UTF-8 has a
>> different encoding from the ¬ character in Unicode, so there is no
>> issue of a zero octet. '00AC'X is not a valid UTF-8 string.
> There is no “encoding” “in Unicode”. That’s the point, and is why you can’t 
> say “AC” and expect it to be meaningful. Folks might guess, but (especially 
> with endianness) might get it wrong.
>
> ‘00ac’x is indeed invalid as a UTF-8 encoded value, but it’s not the 00 
> that’s bad (that’s fine, it’s a null): it’s the AC, which is invalid because 
> the top two bits are 10 and that means it’s a continuation byte. So a UTF-8 
> parser chugs along, sees the 00 and says “OK, good, that’s a single-byte 
> encoding of a null”. But then it looks at the AC and says “Hey, this is 
> supposed to be a continuation, and I’m not IN a multi-byte encoded character, 
> that’s no good”. (One of the cool things about UTF-8 is that, assuming proper 
> UTF-8, you can start in the middle of a string and, if you find you’re at a 
> continuation byte, you can back up to the first byte in the tuple and start 
> from there!)
>
> The above assumes big-endian, of course.
>
> I’m’a keep harping on “Unicode is not an encoding” because it isn’t and it 
> matters. Former manager beat that into my head, and it took a while for me to 
> get it, so if you’ve bothered to read this far and feel like it’s an 
> artificial distinction, I dig. It’s not.

Phil's right. Unicode is not an encoding.
UFT-8 is an encoding of Unicode which fits nicely in many historically
8-bit spaces. UTF-8 allows the Ukrainians to use Cyrillic (one of many
character sets in Unicode) in their web pages without having to play
musical code pages.

If any given system is trying to process a UTF-8 stream and comes across
00AC (at end of input, THAT'S IMPORTANT), it will treat the 00 as NULL
and move along, then will fail on AC and either 1) throw an error, 2)
try to cope (treating it as ISO-8859-1, maybe), or 3) ignore that byte.
In no case can a UTF-8 processor "do the right thing" with just AC.

In sequence, AC is one of many continuation indicators. The UTF-8
processor expects more. If it finds more, it WON'T interpret that as
logical not. If it doesn't find more, then see previous paragraph.

> ----------------------------------------------------------------------
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

-- R; <><

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Logical Nor (¬) in ASCII-based code pages?

Reply via email to