Hi Alex, At 2022-01-24T22:13:32+0100, Alejandro Colomar wrote: > Hi Branden, > > And another html bug; however, this one seems to be a browser bug, but > please confirm.
Maybe not. > For the following code: > > [ > .TP > .B \(aq\-\(aq > Empty white cell. > ] > > groff(1) generates the following HTML code: > > [ > <p><b>'−'</b></p></td> > <td width="5%"></td> > <td width="22%"> > ] > > However, both firefox and chrome show something that if copy&pasted to > a terminal is different from ASCII 45, and is longer than the proper > minus sign. If your system works like mine, it _is_ a "proper minus sign". $ lynx -dump EXPERIMENTS/chess-init.6.html | sed -n '21p' | xxd 00000000: 2020 2063 6865 7373 e288 9269 6e69 740a chess...init. And UTF-8 E2 88 92 is... $ unicode − U+2212 MINUS SIGN UTF-8: e2 88 92 UTF-16BE: 2212 Decimal: − Octal: \021022 − Category: Sm (Symbol, Math); East Asian width: N (neutral) Unicode block: 2200..22FF; Mathematical Operators Bidi: ES (European Number Separator) > Should I report a bug to firefox? No, you're getting correct output...almost. \- to U+2212 is a wholly legitimate mapping for troff typesetting going back to 1973. But man(7) pages are an issue. There, a "real" minus sign is almost never wanted. It makes sense for the man(7) package to have a bespoke mapping for the minus sign glyph to the basic Latin hyphen-minus on devices that distinguish them. I see the following in /etc/groff/man.local on my Debian system with groff 1.22.4: . \" Debian: "\-" is more commonly used for option dashes than for minus . \" signs in manual pages, so map it to plain "-" for HTML/XHTML output . \" rather than letting it be rendered as "−". . ie '\*[.T]'html' \ . char \- \N'45' . el \{\ . if '\*[.T]'xhtml' \ . char \- \N'45' . \} Debian shouldn't have to do that; groff should, and moreover should move this character definition into the an.tmac file and apply it to the utf8 groff output device as well, not just (x)html. There is a related bug in that groff's html device maps regular '-' to the basic Latin hyphen-minus when it should become the HTML ‐ entity instead. Here's partial output from a slightly modified version of your page. $ lynx -dump EXPERIMENTS/chess-init.6.html | sed -n '17p' | xxd 00000000: 2020 2063 6865 7373 e288 9269 6e69 7420 chess...init 00000010: e288 9220 696e 6974 6961 6c69 7a65 2061 ... initialize a 00000020: 2063 6865 7373 2067 616d 6520 666f 7220 chess game for 00000030: 796f 7572 206d 6f74 6865 722d 696e 2d6c your mother-in-l 00000040: 6177 0a aw. That's wrong, but I understand why it happened. If a man(7) page author truly wants a Unicode minus sign--perhaps for an expansion of the unicode(7) page--they can obtain it with a special character escape sequence: \[u2010]. So this is a bug, too: the grohtml output device needs to map - to ‐, \- to − and groff's an.tmac needs to override that mapping of \- to point it at \N'45'. In groff Git HEAD, we have this in an.tmac: .\" === Define/remap characters. === . .\" For UTF-8, map the minus sign to the hyphen-minus to facilitate .\" copy and paste of code examples, file names, and URLs embedding it. .if '\*[.T]'utf8' \{\ . char \- \N'45' . char - \N'45' .\} As a related matter I would kill the second 'char' request (remapping the unescaped input dash). The first should be done not just for 'utf8', but 'html' and 'xhtml' as well. Would you like to file this one as well? Regards, Branden
signature.asc
Description: PGP signature