Re: Encoding error for HTML output (Solaris 11)

Patrice Dumas Sat, 22 Nov 2025 05:49:58 -0800

On Sat, Nov 22, 2025 at 10:03:12AM +0000, Gavin Smith wrote:
> On Sat, Nov 22, 2025 at 10:29:24AM +0100, Patrice Dumas wrote:
> > On Fri, Nov 21, 2025 at 11:04:36PM +0000, Gavin Smith wrote:
> > > Is there some internal conversion done on section titles that doesn't show
> > > up in the output?
> > 
> > Indeed, there is, as shown by the trace in your other email, the
> > normalization of 'HTML Cross-references' is used for section arguments
> > to get a string that can be used as target.  This would also happen with
> > @expansion on @node line.


Actually, it would not happen for @node line expansion as this only
happens with transliteration.  And even with transliteration, for node,
with the recent changes, I believe that there would be two identifiers,
one transliterated, and not necessarily reproducible and the Xref
compliant id, reproducible.

> Indeed, there is a difference between Solaris 11 output and GNU/Linux.
> On Solaris 11:
> 
> <h2 class="chapter subsection-level-set-chapter" 
> id="g_t_0040expansion_007b_007d-_0028_0029_003a-Indicating-an-Expansion"
> 
> On GNU/Linux:
> 
> <h2 class="chapter subsection-level-set-chapter" 
> id="g_t_0040expansion_007b_007d-_0028_003f_0029_003a-Indicating-an-Expansion">
> 
> The difference is in the "_0028_0029" section of the header 'id' attribute.
> This gives the ASCII values for "()".   On GNU/Linux it is _0028_003f_0029
> which corresponds to "(?)" - here, "?" is evidently used as a replacement
> character for the right arrow character.
> 
> Neither is good output.

The sectioning commands id are not specified.  They are only supposed to
be consistent "internally", ie it should be the right id which is used
in generated HTML (section commands id are not often used) or available
in user-defined code using the HTML customization API.

Therefore, in normal runs, right now speed is favored over consistency
and the iconv "us-ascii//TRANSLIT" output is used.

In tests, the Perl code is called.

> If I run with TEXINFO_XS=omit, the output is different: _0028_21a6_0029.
> Here _21a6 refers to the correct character.  This is the same on both
> Solaris 11 and GNU/Linux.

And with TEST set.

> Hence there is a clear bug here with inconsistent output between XS and
> pure Perl code, with the pure Perl output being superior.

To me it is not so clear that it is a bug.  It could be, but it is
debatable.

> It appears to be from the use of the "us-ascii//TRANSLIT" encoding in
> 'unicode_to_transliterate' in main/node_name_normalization.c.  My
> guess is that this system either doesn't have such an encoding or doesn't
> support some characters for transliteration.

Indeed, there is no guarantee about the output with "us-ascii//TRANSLIT"
iconv transliteration, while the Perl code is more consistent.

> I found the use of this encoding was introduced in commit 1c9a5f283:
> 
> Author: Patrice Dumas <[email protected]>
> Date:   2023-10-11 15:11:11 +0200
> 
>     * tp/Texinfo/Convert/HTML.pm (_set_root_commands_targets_node_files):
>     remove unused $output_units argument.  Remove unused $no_unidecode.
>     Put $extension in if.
>     
>     * tp/Texinfo/XS/main/errors.c (reallocate_error_messages)
>     (message_list_line_error_internal)
>     (message_list_document_error_internal, message_list_document_error)
>     (message_list_document_warn), tp/Texinfo/XS/main/get_perl_info.c
>     (html_converter_initialize): add message_list_document_warn and
>     message_list_document_error and add error messages in converter.
>     
>     * tp/Texinfo/XS/main/convert_utils.c, tp/Texinfo/XS/main/utils.c
>     (output_conversions, input_conversions, decode_string, encode_string):
>     move output_conversions, input_conversions, decode_string, encode_string
>     to utils.c.
>     
>     * tp/Texinfo/XS/parsetexi/input.c (parser_input_conversions): rename
>     input_conversions as parser_input_conversions.
>     
>     * tp/Texinfo/XS/convert/convert_html.c (normalized_to_id)
>     (normalized_label_id_file, unique_target)
>     (new_sectioning_command_target, set_root_commands_targets_node_files)
>     (html_prepare_conversion_units_targets),
>     tp/Texinfo/XS/convert/converter.c (id_to_filename)
>     (normalized_sectioning_command_filename, node_information_filename),
>     tp/Texinfo/XS/main/call_perl_function.c
>     (call_file_id_setting_label_target_name)
>     (call_file_id_setting_node_file_name)
>     (call_file_id_setting_sectioning_command_target_name),
>     tp/Texinfo/XS/main/node_name_normalization.c
>     (unicode_to_transliterate, normalize_transliterate_texinfo)
>     (normalize_transliterate_texinfo_contents): implement
>     set_root_commands_targets_node_files.
> 
> I don't get any understanding by looking at that commit why
> "us-ascii//TRANSLIT" was used.  It seems likely that such an encoding
> wouldn't be supported or work identically on different systems.

At that time it was simply used as a C replacement for Text::Unidecode.
Later on, I added the possibility to call Perl to do the transliteration
reproducibly.  But as I said above, reproducibility is not offered for
sectioning commands identifiers, so it remained as is in that case.


-- 
Pat

Re: Encoding error for HTML output (Solaris 11)

Reply via email to