-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, Apr 02, 2018 at 03:18:38PM -0300, Henrique de Moraes Holschuh wrote: > On Mon, 02 Apr 2018, rhkra...@gmail.com wrote: > > The wikipedia article is rather interesting, in a quick skim, I learned > > some > > interesting things about UTF-8, especially the property of self- > > synchronization. > > Yes, UTF-8 is a brilliant design.
Possibly relevant, definitely entertaining, Rob Pike's account of UTF-8's gestation [1] Yeah. Elegant design. Until the Unicode Consortium left Microsoft near it (Byte Order Mark, I'm looking at you!). [...] > > I guess I have a followup question--are those two bytes (or either one of > > them) also unused in all possible "code pages"? I'm not sure what you mean here: there are two layers at work (at least if you have UTF-8 encoded Unicode). As Henrique says, if you assume both to be "correct" then you get more illegal things. But sometimes UTF-8 encoding is used for other things (notably Emacs encodes a superset of Unicode, to be able to express "raw byte values" next to "Unicode characters". > > The problem is that I copy snippets of text from all kinds of sources into > > those text files (which are formatted like mbox files), so I might find one > > or > > both of those bytes in the file already. > > Then it isn't a valid unicode text file in UTF-8 format, and it needs to > be converted (or fixed) first to be encoded in UTF-8 :-) Agreed: if you don't know what's coming in, you better plan for anything :) Cheers - -- t -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAlrCeTcACgkQBcgs9XrR2kbRtgCfaRHoodlkFFt8Gm0Oq438ymvg 0oMAn2NkpsqMJ3Tcy5BvAJIpTvfG8mdj =iVqF -----END PGP SIGNATURE-----