Re: Special characters

G. Branden Robinson Fri, 22 Sep 2023 03:03:20 -0700

At 2023-09-22T10:56:06+0200, H.Merijn Brand wrote:
> Shorted reply. Might expand on this later


No worries.  I acknowledge that my emails sometimes resemble homework
assignments.

> I realized when I re-read that this morning and apologized in my reply
> I'll apologize again if that was not clear

I'm not upset with you.  I get frustrated with software as well.
(Frequently.)

If someone loses their temper, they should expect to be ignored,
reproached, or gently steered back to equanimity.  That's just social
dynamics.

> Thanks for the long and clear answer, which I have to re-read a few
> times to get all of the implications. Thanks for the time and effort
> you have put into it to clarify all the points.

You're welcome.  I regard it as part of the job.  ;-)

> nroff2man added for your amusement. It has never been my intention to
> make that public, but feel free to do with it whatever you like.

I'll offer you some feedback on it that might make your life easier.

> Perl5 uses '~' and '^' quite a lot. '~' as part of '=~' operator is
> probably the most widespread use and '^' inside regular expressions.

It certainly does.  But I did feel the need to point out that the
universe of discourse (man pages) is broader than POD.

So, nroff2man...

> #!/pro/bin/perl
> open my $fh, "-|", "nroff", "-mandoc", $nfn;
> print map {
>        s{(?:\x{02dc}|\xcb\x9c         )}{~}grx        # ~

Okay, it takes me some time to parse perlre, not being a full time Perl
programmer.  I'll try to decode this for the benefit of self and others.

?: makes a capture group "clustering" instead of "capturing"; in other
words, no \1, etc., backreference is produced for the () pair.

Braces following \x appear to admit (arbitrarily?) long hexadecimal code
points instead of the byte-oriented \xXX syntax.

Braces are also used here instead of more traditional delimitation[1]
where opening and closing delimiters are identical.

Finally we see 'r' and 'x' options on the replacement.  'r' performs
"non-destructive substitution"--not sure if/how that applies here, and
'x' treats much whitespace as discardable, for readability.

Thus, what the above does is replace U+02DC, whether encoded as UTF-32
or UTF-8, with an ASCII tilde.

>     =~ s{(?:\x{02c6}|\xcb\x86         )}{^}grx        # ^

Similar: circumflex accent -> caret.

>     =~ s{(?:\x{2018}|\xe2\x80\x98
>          |\x{2019}|\xe2\x80\x99       )}{'}grx        # '

Similar: right and left single quotation marks -> neutral apostrophe.

>     =~ s{(?:\x{201c}|\xe2\x80\x9c
>          |\x{201d}|\xe2\x80\x9d       )}{"}grx        # "

Similar: right and left double quotation marks -> neutral double quote.

>     =~ s{(?:\x{2212}|\xe2\x88\x92
>          |\x{2010}|\xe2\x80\x90       )}{-}grx        # -

Similar: minus sign -> hyphen-minus.

>     =~ s{(?:\e\[|\x9b)[0-9;]*m}                 {}grx         # colors

Very different.  Attempts to match some forms of ECMA-48 escape sequence
and remove them.

You can prevent these from showing up in the input in the first place by
passing the '-c' flag to nroff(1).  That runs groff(1) and ultimately
grotty(1) with the same option.  See grotty(1).

On the other hand that resorts to overstriking, which you also might not
want.  In that case, you can tell grotty(1) to shut off _all_ attempts
to represent style changes.

$ nroff -mandoc -P -cbou

nroff's support for `-P` is new in groff 1.23.0.

Regards,
Branden

[1] nonce word?

signature.asc
Description: PGP signature

Re: Special characters

Reply via email to