Re: normalization problem with `@anchor` targets

pertusus Thu, 06 Feb 2025 09:40:39 -0800

On Thu, Feb 06, 2025 at 05:02:30PM +0000, Werner LEMBERG wrote:
> 
> I don't understand.  How gets 'ö' mapped to 'o'?  It is not NFC, but
> it can't be transliteration either, can it?  I thought that the idea
> of transliteration is to map non-ASCII characters to ASCII strings
> unambiguously,


No, the idea is to map to ASCII strings that are not too long and such
that looking at the file name, it is possible to imagine what
node/anchor it corresponds to at least to some extent.

>  so I could imagine that 'Bögen' gets mapped to
> 'B_oe_gen' or something similar.  Stripping off the umlaut dots from
> the 'ö' character to convert 'Bögen' to 'Bogen' can never be the right
> solution.

That is what we do.  We do the removal of diacritics ourselves in
some cases, but we mainly use Text::Unidecode, or, in C iconv //TRANSLIT
https://metacpan.org/pod/Text::Unidecode

The objective is not to have a unique filename nor to be perfect, but to
have ASCII that is recognizable.

> I have no opinion here.  IMHO, it should work with or without
> `--no-transliterate0file-names`.  What I currently see is a strange
> Texinfo warning message that implies that I can't use `@anchor{Bogen}`
> and `@anchor{Bögen}` at the same time.  There are many languages where
> you have different words – probably also in French – that become
> identical if you strip off the diacritics.

We didn't try to have a non ambiguous transliteration, nor a reversible
transliteration. My guess is that it is impossible when all the Unicode
characters may appear in the same document.

And, indeed, there are many such words in French.  I used Prés et Près
in the test of clashes (and an invented Prês).

-- 
Pat

Re: normalization problem with `@anchor` targets

Reply via email to