Whoosh!

UTF-8 is not Unicode, it is a transform format. It is also not UTF-16.

There is no need to guess WHEN THE CONTEXT IS UNICODE.


I wrote " '00AC'X is not a valid UTF-8 string." Please rebut that, not a claim 
that I never made.

Big endian? For UTF-8? Shirley you're joking.

Why should you keep harping on harping on “Unicode is not an encoding” when 
nobody claimed that it was?

________________________________________
From: IBM Mainframe Discussion List <IBM-MAIN@LISTSERV.UA.EDU> on behalf of 
Phil Smith III <li...@akphs.com>
Sent: Monday, May 8, 2023 2:48 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Logical Nor (¬) in ASCII-based code pages?

Seymour J Metz wrote, in part:
>You seem to be confirming what I wrote; if the locale is UTF-8 then
>your character data should be UTF-8. The ¬ character in UTF-8 has a
>different encoding from the ¬ character in Unicode, so there is no
>issue of a zero octet. '00AC'X is not a valid UTF-8 string.

There is no “encoding” “in Unicode”. That’s the point, and is why you can’t say 
“AC” and expect it to be meaningful. Folks might guess, but (especially with 
endianness) might get it wrong.

‘00ac’x is indeed invalid as a UTF-8 encoded value, but it’s not the 00 that’s 
bad (that’s fine, it’s a null): it’s the AC, which is invalid because the top 
two bits are 10 and that means it’s a continuation byte. So a UTF-8 parser 
chugs along, sees the 00 and says “OK, good, that’s a single-byte encoding of a 
null”. But then it looks at the AC and says “Hey, this is supposed to be a 
continuation, and I’m not IN a multi-byte encoded character, that’s no good”. 
(One of the cool things about UTF-8 is that, assuming proper UTF-8, you can 
start in the middle of a string and, if you find you’re at a continuation byte, 
you can back up to the first byte in the tuple and start from there!)

The above assumes big-endian, of course.

I’m’a keep harping on “Unicode is not an encoding” because it isn’t and it 
matters. Former manager beat that into my head, and it took a while for me to 
get it, so if you’ve bothered to read this far and feel like it’s an artificial 
distinction, I dig. It’s not.


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Reply via email to