Re: [bug #66287] help preconv guess the correct encoding of shipped files

G. Branden Robinson Fri, 04 Oct 2024 04:00:53 -0700

At 2024-10-03T18:35:37-0400, Dave wrote:
> Follow-up Comment #2:
> [comment #1 comment #1:]
> > [comment #0 original submission:]
> > > Unfortunately, preconv looks only at the first two lines of a file
> > > for encoding information.
> > 
> > Only if the file isn't seekable...
> 
> preconv looks at 0 lines if the file isn't seekable, and 2 lines if it
> is.  Per its man page: "If the input stream is seekable, check the
> first two input lines for a GNU Emacs file-local variable identifying
> the character encoding." Under no circumstances will preconv find the
> tag if it appears after the first two lines.


Hmm, right.  Thanks for reminding me.  I feel pulled in several
directions lately...

> I don't desire any change in preconv.  I merely desire to change
> shipped groff files to give preconv a greater chance of getting the
> encoding right.

This is fine if it doesn't fool Emacs into ignoring the local variables
at the end of the file and making the overall file editing experience
_worse_ for people who _do_ have uchardet installed.

I reckon I'll test that.

> Putting the "coding:" tag in the first two lines, where preconv will
> find it, is a small change to two shipped files and no executables.
> 
> > Hmm, can't reproduce a problem here with _groff_ 1.23.0 or Git HEAD.
> 
> Ah, probably you have a uchardet library, which is preconv's next step
> after checking the first two lines for an encoding tag.

I assuredly do.

> > Can you do some experiments with `preconv -d` and see what it says?
> 
> Sure.  On a UTF-8 terminal, absent uchardet, preconv guesses the wrong
> encoding for groff_mmse.7.man:
> 
> $ fgrep 'coding: ' contrib/mm/groff_mmse.7.man 
> .\" coding: latin-1
> $ echo $LC_CTYPE
> en_US.utf8
> $ preconv -d contrib/mm/groff_mmse.7.man > /dev/null
> fallback encoding: 'UTF-8'
> processing 'contrib/mm/groff_mmse.7.man'
>   no coding tag
>   could not detect encoding with uchardet
>   encoding used: 'UTF-8'
>   incomplete UTF-8 sequence(s) in input stream: replacing each such sequence
> with 0xFFFD
> $ preconv --version
> GNU preconv (groff) version 1.23.0.1624-4d251-dirty with iconv support and
> without uchardet support
> 
> And on a latin-1 terminal, it guesses the wrong encoding for
> meintro_fr.me.in:
> 
> $ fgrep 'coding: ' doc/meintro_fr.me.in
> .\" coding: utf-8
> $ echo $LC_CTYPE
> en_US.iso88591
> $ preconv -d doc/meintro_fr.me.in > /dev/null
> fallback encoding: 'ISO-8859-1'
> processing 'doc/meintro_fr.me.in'
>   no coding tag
>   could not detect encoding with uchardet
>   encoding used: 'ISO-8859-1'
> 
> Putting the coding: tag at the tops of the files, following the
> examples of the two .mom files I cited, fixes both of these.

Hrm, yup.  If that provokes GNU Emacs into bad ergonomics as noted
above, it may be time to migrate at least these two files to UTF-8 in
the source tree.  That day is coming one way or the other...

>  {savane: user = 108747; tracker = bugs; item = 66287}

signature.asc
Description: PGP signature

Re: [bug #66287] help preconv guess the correct encoding of shipped files

Reply via email to