> > There is a serious problem with coding tags and utf-16 encodings > > of any flavour: Emacs simply can't recognize the tag. This is a > > non-trivial problem. > > Sorry for the late reply, but I think coding tag is useless for a > file encoded in some of utf-16 variants. > > If a file has BOM at the head, BOM should tell the exact encoding > whatever is specified in coding tag. > > If a file is encoded without BOM, we must use the less reliable > heuristics to guess utf-16be or utf-16le. If you find a coding-tag > spec by ignoring all zero bytes at even byte indexes, it means that > the file is, in high possibility, utf-16be whatever the tag value > is. If you find a coding-tag spec by ignoring all zero bytes at odd > byte indexes, it means that the file is utf-16le whatever the tag > value is. > > So, in any cases, a tag value itself is useless. [...]
I'll do the following for groff's preprocessor, preconv: . If the data starts with a BOM, use it, and ignore the coding tag. . Otherwise, if there are zero bytes in the first two lines, ignore those zero values, emit a warning, and use the coding tag, if any. . Otherwise, use the default encoding -- this normally will lead to a wrong result and make groff explode, but I consider this better than to apply heuristics, especially if you have to recognize both UTF16 and UTF32 variants. This is probably a suboptimal solution but quite easy to implement, and the user can always explicitly select an encoding on the command line. Perhaps someone finds (and implements) a better way which I can then adapt to preconv. Werner _______________________________________________ Groff mailing list Groff@gnu.org http://lists.gnu.org/mailman/listinfo/groff