[Groff] Re: coding tags and utf-16

Werner LEMBERG Wed, 04 Jan 2006 10:30:42 -0800

> > There is a serious problem with coding tags and utf-16 encodings
> > of any flavour: Emacs simply can't recognize the tag.  This is a
> > non-trivial problem.
> 
> Sorry for the late reply, but I think coding tag is useless for a
> file encoded in some of utf-16 variants.
> 
> If a file has BOM at the head, BOM should tell the exact encoding
> whatever is specified in coding tag.
> 
> If a file is encoded without BOM, we must use the less reliable
> heuristics to guess utf-16be or utf-16le.  If you find a coding-tag
> spec by ignoring all zero bytes at even byte indexes, it means that
> the file is, in high possibility, utf-16be whatever the tag value
> is.  If you find a coding-tag spec by ignoring all zero bytes at odd
> byte indexes, it means that the file is utf-16le whatever the tag
> value is.
> 
> So, in any cases, a tag value itself is useless.  [...]


I'll do the following for groff's preprocessor, preconv:

  . If the data starts with a BOM, use it, and ignore the coding tag.

  . Otherwise, if there are zero bytes in the first two lines, ignore
    those zero values, emit a warning, and use the coding tag, if any.

  . Otherwise, use the default encoding -- this normally will lead to
    a wrong result and make groff explode, but I consider this better
    than to apply heuristics, especially if you have to recognize both
    UTF16 and UTF32 variants.  This is probably a suboptimal solution
    but quite easy to implement, and the user can always explicitly
    select an encoding on the command line.  Perhaps someone finds
    (and implements) a better way which I can then adapt to preconv.


      Werner


_______________________________________________
Groff mailing list
Groff@gnu.org
http://lists.gnu.org/mailman/listinfo/groff

[Groff] Re: coding tags and utf-16

Reply via email to