Hi, At Thu, 19 Oct 2000 22:12:07 +0200 (CEST), Werner LEMBERG <[EMAIL PROTECTED]> wrote:
> This is not true. Encoding does *not* imply the character set. > You are talking about charset/encoding tags. Hmm, I cannot understand your idea... In Emacs, charsets such as ISO8859-1, JISX0208.1990, and BIG5 are defined. Using these charsets, encodings such as euc-japan, iso-2022-jp, and iso-2022-7bit are defined. A user can use these encodings by, for example, M-x set-buffer-file-coding-system, M-x set-terminal-encoding-system, and so on. (Names for 'coding-system' can be followed by '-unix', '-dos', or '-mac' which specify line-breaking code.) I can also specify the encoding of a file using '-*-coding: euc-jp;-*-' in the first line of the file. You said that encoding names can be specified by MIME charset tag names. I write mails in 'charset=us-ascii' or in 'charset=iso-2022-jp' and web pages in 'charset=us-ascii', 'charset=iso-2022-jp', 'charset=euc-jp', 'charset=utf-8', or so on. I never specify encoding and charset separately. Nor I don't write 'charset=euc'. ISO-2022 is a encoding which includes many charsets. Using ISO-2022, I can write a multilingual text including US-ASCII, ISO 646-*, ISO 8859-*, JIS X * (Japanese), CNS 11643 (traditional Chinese), GB 2312 (simplified Chinese), TIS620 (Thai), and so on. GL, GR, G0, G1, G2, and G3 can be used for these charsets with clearly defined escape sequences and other control codes. Since the escape sequences and control codes are clearly defined, we don't need 'charset=' information to read ISO-2022 text. The preprocesser can work without it (though I won't implement ISO-2022 converter. I will implement only four converters --- from Latin-1, EBCDIC, UTF-8 (no-conversion), and locale encoding (iconv(3)) to UTF-8). # Note that conversion from Unicode variants to ISO-2022 (not ISO-2022-JP, # ISO-2022-CN, and so on) contains a problem and almost impossible. # However, now we are discussing on reading the roff source, not writing. Indicating JIS-X-0208 and EUC is insufficient to specify an encoding. Also, telling JIS-X-0208 and ISO-2022 lacks information. In the former case, EUC can handle four character sets for GL, GR, SS1, and SS2. EUC-JP is: ASCII for GL and JIS-X-0208 for GR. ISO-2022-JP is more complex. Practical view; according to your idea, a user can specify 'charset: KOI8-R; encoding: EUC', which cannot be specified with my idea. However, I don't think this can be a reason your idea is superior. Rather, IMHO, such a usage is harmful. I intend to mean - character set: CCS (Coded Character Set) in RFC 2130 - encoding: CES (Character Encoding Scheme) in RFC 2130 I don't understand on what context you say 'EUC' is an encoding. And, I think this is the most important, what is the merit of your idea? --- Tomohiro KUBOTA <[EMAIL PROTECTED]> http://surfchem0.riken.go.jp/~kubota/