On Fri, Feb 07, 2025 at 10:34:47AM +0000, Gavin Smith wrote:
> On Wed, Feb 05, 2025 at 05:05:33PM -0500, Daniel Macks wrote:
> > For the first one, test-suite.log has:
> > 
> > FAIL: test_scripts/encoded_non_ascii_command_line.sh
> > ====================================================
> > 
> > D: encoded/diffs/non_ascii_command_line.diff (printed below)
> > Only in ./encoded/res_parser/non_ascii_command_line: int%c3%a9rnal.txt
> > Only in encoded/out_parser/non_ascii_command_line: inte%cc%81rnal.txt
> > Only in ./encoded/res_parser/non_ascii_command_line: os%c3%a9-texinfo.texi
> > Only in ./encoded/res_parser/non_ascii_command_line: os%c3%a9_utf8.1
> > Only in ./encoded/res_parser/non_ascii_command_line: os%c3%a9_utf8.2
> > Only in ./encoded/res_parser/non_ascii_command_line: os%c3%a9_utf8_abt.html
> > Only in encoded/out_parser/non_ascii_command_line: ose%cc%81-texinfo.texi
> > Only in encoded/out_parser/non_ascii_command_line: ose%cc%81_utf8.1
> > Only in encoded/out_parser/non_ascii_command_line: ose%cc%81_utf8.2
> > Only in encoded/out_parser/non_ascii_command_line: ose%cc%81_utf8_abt.html
> > D: encoded/diffs/non_ascii_command_line.diff (printed above)
> > testdir: encoded
> > driving_file: ./encoded/list-of-tests
> > made result dir: ./encoded/res_parser/
> > 
> > doing test non_ascii_command_line, src_file 
> > built_input/non_ascii/osé_utf8.texi
> > format_option: 
> > texi2any.pl non_ascii_command_line -> 
> > encoded/out_parser/non_ascii_command_line
> >  /usr/bin/perl -w ./..//texi2any.pl  --force --conf-dir ./../t/init/ 
> > --conf-dir ./../init --conf-dir ./../ext -I ./encoded -I encoded/ -I ./ -I 
> > . -I built_input -I built_input/non_ascii --error-limit=1000 -c TEST=1  
> > --output encoded/out_parser/non_ascii_command_line/ --html --no-split -c 
> > DO_ABOUT=1 -c COMMAND_LINE_ENCODING=UTF-8 -c MESSAGE_ENCODING=UTF-8 -c 
> > OUTPUT_FILE_NAME_ENCODING=UTF-8 --split=Mekanïk 
> > --document-language=Destruktïw -c 'Kommandöh vâl' -D TÛT -D 'vùr ké' -U 
> > ôndef -c 'FORMAT_MENU mînù' 
> > --macro-expand=encoded/out_parser/non_ascii_command_line/osé-texinfo.texi 
> > --internal-links=encoded/out_parser/non_ascii_command_line/intérnal.txt 
> > --css-include çss.css --css-include cêss.css --css-ref=rëf --css-ref=öref 
> > -D 'neednonasciifilenames Need non-ASCII file names' 
> > built_input/non_ascii/osé_utf8.texi > 
> > encoded/out_parser/non_ascii_command_line/osé_utf8.1 
> > 2>encoded/out_parser/non_ascii_command_line/osé_utf8.2
> > 
> > all done, exiting with status 1
> > 
> > and nearly identical information about the others. In all cases, the 
> > filenames differ by %a9 vs %81.
> > 
> 
> It is not just that one byte.  The reference results have %c3%a9 and what
> was produced is e%cc%81.  This is different ways of outputting the é character
> (e with acute accent).  You can check this on the command line with
> 
>   LC_ALL=C printf 'e\xCC\x81'
> 
> and
> 
>   LC_ALL=C printf '\xC3\xA9'
> 
> Perhaps some Unicode normalisation step is missing and/or faulty.

Indeed, e+diacritics are used instead of the precomposed character,
which is typically the Normalization Form C (NFC).  For the case of the
output file name, we probably normalize and output ASCII only for most
HTML names (with a normalization to NFC to avoid inconsistencies), but
otherwise I think that there is no specific normalization, only
conversion to UTF-8 as we pass the options that force the conversion to
that encoding on the command-line, but nothing more.

We could normalize file names produced by texi2any, and possibly
differently for tests and regular output, but some file names are not
under texi2any control, and come from the shell only, mainly
redirections:
  > encoded/out_parser/non_ascii_command_line/osé_utf8.1 
2>encoded/out_parser/non_ascii_command_line/osé_utf8.2

I think that we need to modify tta/tests/escape_file_names.pl in any
case to normalize by adding a step doing encoding/normalize NFC/decoding
before the percent encoding.  We could also change texi2any output file
names, but I am not sure about that.

-- 
Pat

Reply via email to