THX to EIKE and TONY for TIME and EFFORT @ REPLY! I was confused due to reading the unicode documentation, whereby utf-32 codepoints are local expandable with 1) blocks in planes OR 2) whole planes... And intuitively i had in mind the utf-8 is for downwardly compatible with us-ascii codespace. The "usecase" with bash script and us-ascii puts the same into my mind. Q: Is bash script reading text files similar to binary? (when i am not allow to use a BOM). Meant, is not using a charset encoding applied by linux.
Then partition tables, which should be readable on different systems, are encoded with utf-16/ucs-2. Thus implied to me, UCS-2 is a new standard for independent decentralized 2-byte charsets. And the UTF is the local interpreting process... Finally, it doesnt matter - because the linux decoder seems to be very rich of decision possibilities (e.g. creates 1-byte utf-8 file like us-ascii until i use an utf-16 codepoint) and therefore my files should be readable with the 1byte utf-8 for my lifetime. But attention! ...with the modern "android smartphone" philosophy i got brainwashed: At all cost - stay up-to-date with your software and hardware systems, else you are not with us (community, life etc.) anymore. Then i got _paranoid_ when i still know there is a new charset encoding since years, and my system goes back to the deprecated one ... *take for fun* sincerely -kefko ... http://www.johannes-koehler.de [email protected] schrieb am Montag, 2. August 2021 um 13:59:26 UTC+2: > As some have said above, UTF-8 is a variable-length encoding, which > encodes 7-bit ASCII characters exactly like us-ascii, and characters > (codepoints) above U+007F in two or more bytes, each of them with the > high bit set. Originally Unicode was foreseen to be able to go as far > up as U+3FFFFFFF, but when UTF-16 was crafted and surrogate codepoints > were assigned it was decided that codepoints higher than U+10FFFF > would never mean anything (and U+F0000 to U+10FFFF are "for private > use" anyway, i.e. transmitter and receiver have to agree on the > values, which are not defined by Unicode). The Wikipedia page about it > is well-written and I recommend reading it. > > The so-called "byte order mark" U+FEFF ZERO-WIDTH NO-BREAK SPACE > should more appropriately be coded an "encoding mark" : it can > discriminate most Unicode encodings and endiannesses from each other, > including UTF-8, which has no byte-order ambiguity. At the head of a > UTF-8 file (e.g. an HTML file or CSS script, whose syntaxes expressly > support it), it means "This is UTF-8". However some programs which > expect only US-ASCII will choke if they get a file headed by a BOM: > for instance a #! "executable script" header will not be recognized if > it is preceded by a BOM, so if you want to start your first line by > #!/bin/bash or #!/bin/env python the file may be in UTF-8 (which > encodes the 128 ASCII characters just like us-ascii) but without BOM. > > See: > https://en.wikipedia.org/wiki/Unicode > https://en.wikipedia.org/wiki/UTF-8 > and beware that the Microsoft Windows documentation usually says > "Unicode" when what it means is "UTF-16" which represents each > codepoint in one, or sometimes two, 16-bit words. > > Best regards, > Tony. > -- -- You received this message from the "vim_use" maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php --- You received this message because you are subscribed to the Google Groups "vim_use" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/vim_use/714b5bfe-9f5b-4a96-8b2b-66701f331073n%40googlegroups.com.
