Re: Warn on mid-input line sentence endings

Alejandro Colomar Sat, 29 Apr 2023 18:05:02 -0700

Hi Branden,

On 4/30/23 02:05, G. Branden Robinson wrote:
> I should clarify a couple of points here since I was feeling grumpy when
> I wrote the following, and that made me forget things.
> 
> At 2023-04-27T09:45:40-0500, G. Branden Robinson wrote:
>> We're re-covering some familiar ground here.
>>
>> I have a few points I'd like to make.
>>
>> 1.  "Semantic newlines" is a terrible term.
> 
> I should have said "_Warn on_ semantic newlines" is a terrible
> instruction/summary.


That's why I used the phrase (at least I tried to do it consistently
recently) "warn on S. N. violations".

> 
> They are what we _don't_ want to warn about upon encountering them.
> 
> If man-pages(7) or other people continue to call the practice of
> breaking *roff input lines after sentence-ending punctuation "semantic
> newlines", I have no complaint.  It could also be called "Kernighan
> breaking", in honor of an early popularizer of the practice.

You could use it for the warning name ;).

> 
>> 2.  Bjarni's comment '"groff" is not the right tool for such things,
>>     but "grep" is.' is thoroughly wrong-headed and Ingo was right to
>>     reject it with great force.  Here a few reasons why.  I don't
>>     think any of B through D are relevant to mandoc(1) since it
>>     doesn't support the features in question (as far as I know).
>>
>>     A.  The formatter decides where sentence boundaries are based on
>>     its input.
>>
>>     B.  Use of the `cflags' request can change the characters that
>>     have sentence-ending semantics.  grep(1) cannot know this.
>>
>>     C.  Sentence-ending characters are subject to character
>>     translation (the `tr` request).  grep(1) cannot know this.
>>
>>     D.  The user/document could define a special character that is a
>>     sentence-ending character (with `char` and `cflags`).  grep(1)
>>     cannot know this.
> 
>       E.  Because '.', '?', and '!' are valid characters in *roff
>       identifiers, grep(1) can be fooled by special character, register,
>       or string interpolations in the input if their identifiers use
>       those characters.
> 
> Example:
> 
> I can't believe \*(I.  ate the whole thing.
> 
> It is only valid to detect the end of a sentence here if the (recursive)
> _expansion_ of the `I.` string ends with a sentence-ending punctuation
> character.
> 
> Further, since string interpolations can result in further string
> interpolations, a finite-state automaton will not suffice to analyze
> this input.  You need a stack machine.  (IIRC, a stack machine
> recognizes "recursively enumerable" languages.)
> 
> This is categorically not what regular expressions can cope with,
> formally.

Well, formally yes.  And a regex can't find C function definitions in a
source tree; at least if you try to fool it by writing the most horrible
code in the universe.  But I wrote a relatively small script[1] that
finds a lot of C code with pcre2grep(1), and works most of the time.  It
has limitations; some of which can be fixed by improving the regexes
(read: making them even more unreadable); some others are likely
impossible to fix with a regex.  The biggest limitation I think I've met
is K&R-style functions: I don't think a regex can cope with them.

I believe a regex-based script can be good enough for some purposes,
even if it's not perfect.

Cheers,
Alex

[1]:  <http://www.alejandro-colomar.es/src/alx/alx/grepc.git/tree/bin/grepc>

-- 
<http://www.alejandro-colomar.es/>
GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5

OpenPGP_signature
Description: OpenPGP digital signature

Re: Warn on mid-input line sentence endings

Reply via email to