Re: unicode: UTF / UCS

'Johannes Koehler' via vim_use Thu, 05 Aug 2021 07:45:14 -0700

THX to EIKE and TONY for TIME and EFFORT @ REPLY!

I was confused due to reading the unicode documentation,
whereby utf-32 codepoints are local expandable with 1) blocks
in planes OR 2) whole planes... And intuitively
i had in mind the utf-8 is for downwardly compatible with us-ascii 
codespace. 
The "usecase" with bash script and us-ascii puts the same into
my mind. Q: Is bash script reading text files similar to binary? (when i am
not allow to use a BOM). Meant, is not using a charset encoding
applied by linux.


Then partition tables, which should be readable
on different systems, are encoded with utf-16/ucs-2.

Thus implied to me, UCS-2 is a new standard for independent
decentralized 2-byte charsets. And the UTF is the local interpreting
process... 

Finally, it doesnt matter - because the linux decoder seems to
be very rich of decision possibilities (e.g. creates 1-byte utf-8 file like 
us-ascii 
until i use an utf-16 codepoint) and therefore my files should
be readable with the 1byte utf-8 for my lifetime.
 
But attention! ...with the modern "android smartphone" philosophy 
i got brainwashed: At all cost - stay up-to-date with your software and 
hardware
systems, else you are not with us (community, life etc.) anymore. 
Then i got _paranoid_ when i still know there is a new charset encoding 
since years, and my system goes back to the deprecated one ... *take for 
fun*

sincerely
-kefko

... 
http://www.johannes-koehler.de

[email protected] schrieb am Montag, 2. August 2021 um 13:59:26 UTC+2:

> As some have said above, UTF-8 is a variable-length encoding, which
> encodes 7-bit ASCII characters exactly like us-ascii, and characters
> (codepoints) above U+007F in two or more bytes, each of them with the
> high bit set. Originally Unicode was foreseen to be able to go as far
> up as U+3FFFFFFF, but when UTF-16 was crafted and surrogate codepoints
> were assigned it was decided that codepoints higher than U+10FFFF
> would never mean anything (and U+F0000 to U+10FFFF are "for private
> use" anyway, i.e. transmitter and receiver have to agree on the
> values, which are not defined by Unicode). The Wikipedia page about it
> is well-written and I recommend reading it.
>
> The so-called "byte order mark" U+FEFF ZERO-WIDTH NO-BREAK SPACE
> should more appropriately be coded an "encoding mark" : it can
> discriminate most Unicode encodings and endiannesses from each other,
> including UTF-8, which has no byte-order ambiguity. At the head of a
> UTF-8 file (e.g. an HTML file or CSS script, whose syntaxes expressly
> support it), it means "This is UTF-8". However some programs which
> expect only US-ASCII will choke if they get a file headed by a BOM:
> for instance a #! "executable script" header will not be recognized if
> it is preceded by a BOM, so if you want to start your first line by
> #!/bin/bash or #!/bin/env python the file may be in UTF-8 (which
> encodes the 128 ASCII characters just like us-ascii) but without BOM.
>
> See:
> https://en.wikipedia.org/wiki/Unicode
> https://en.wikipedia.org/wiki/UTF-8
> and beware that the Microsoft Windows documentation usually says
> "Unicode" when what it means is "UTF-16" which represents each
> codepoint in one, or sometimes two, 16-bit words.
>
> Best regards,
> Tony.
>

-- 
-- 
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/vim_use/714b5bfe-9f5b-4a96-8b2b-66701f331073n%40googlegroups.com.

Re: unicode: UTF / UCS

Reply via email to