Pascal rides again (was: Specifying dependencies more clearly)

G. Branden Robinson Thu, 10 Nov 2022 13:20:17 -0800

(Some of this is off-topic for the groff list.)

Hi Alex,

At 2022-11-08T22:05:25+0100, Alejandro Colomar wrote:
> Okay, here we go for a rant.
> 
> Let's say there's some software with cowboy programmers, which has
> things like:
> 
>     typedef struct {
>         size_t  length;
>         u_char  *start;
>     } str_t;

Those who do not learn Pascal strings are condemned to reinvent them.

>     #define length(s)   (sizeof(s) - 1)
> 
>     #define str_set(str, text)  do       \
>     {                                    \
>         (str)->length = length(test);    \
>         (str)->start = (u_char *) text;  \
>     }
> 
> (Of course, cowboy programmers don't need terminating NUL bytes,
> that's for newbies, but that's not today's rant.)

There are some advantages to Pascal strings.  Having determination of
any string's size be O(1) is a big win over C string-scanning functions
(or loops doing repeated "*ch != '\0'" comparisons), especially when
these are used by the naïve or slack-jawed.[1]

But they have their downsides too.  You truly do need a struct for them,
as they are not homogeneous sequences.  The length is a different type,
and may be a different width, than the subsequent (or pointed-to) array.

Pascal strings were more obviously advantageous back in the 8-bit era.
size_t and chars were both bytes.  A string length limit of 255 was not
considered terribly limiting on machines where you could fit only 256 of
these in addressable memory anyway (16-bit address space).  In practice,
fewer--often much fewer.

But doing Pascal strings in C is a delicate undertaking due to impedance
mismatching, as you showed below.

> And then some new programmer in the team writes a line of code that's
> something like:
> 
>     str = str_set(cond ? "someword" : "another");

Right.  C doesn't _really_ have strings, except at the library level.
It has character arrays and one grain of syntactic sugar for encoding
"string literals", which should not have been called that because
whether they get the null terminator is context-dependent.

        char a[5] = "fooba";
        char *b = "bazqux";

I see some Internet sources claim that C is absolutely reliable about
null-terminating such literals, but I can't agree.  The assignment to
`b` above adds a null terminator, and the one to `a` does not.  This is
the opposite of absolute reliability.  Since I foresee someone calling
me a liar for saying that, I'll grant that if you carry a long enough
list of exceptional cases for the syntax in your head, both are
predictable.  But it's simply a land mine for the everyday programmer.
Worse, the hype around C that "arrays are really pointers in disguise!1!
They're interchangeable!1!" constitutes a neon sign directing the
learner directly into the mine field.  As I recall, Peter van der Linden
railed against this folk wisdom way back in 1994 (_Expert C Programming:
Deep C Secrets_, Pearson).  Despite writing a fine book, his efforts did
little to quieten this or other examples of braying hype by C partisans.

Because we teach and practice similarly imprecise nomenclature regarding
"strings" in C, we encourage programmers to make mistakes like the one
you encountered.  This is one of the reasons I'm such a stickler about
terminology in the groff documentation and am willing to break with
tradition (to Ralph's oft-voiced derision) in a search for terms that
resist extrapolative interpretation.

"There are two fundamentally difficult problems in computer science:
cache invalidation, naming things, and off-by-one errors." -- anon.

Before getting off of strings I wanted to point out a third mechanism of
encoding them--one that is usually forgotten nowadays.  Back when 64kB
was a lot of memory, both Pascal- and C-style strings could be
considered wasteful.  Zero-terminated strings even more so, because they
consumed an _entire byte_ with their invariant zero bits.

So what a lot of people did back in the days before ISO 8859 was to flag
the end of the string by setting the high bit in its last character.

int main(void) {
  char s[] = { 'H', 'e', 'l', 'l', 'o', ',', ' ',
               'W', 'o', 'r', 'l', 'd', '!', '\n' + 0x80 };

  char *p = s;
  do {
    putchar(*p & 0x7F);
  } while (*p++ >= 0); // (*p++ < 0x80) on systems w/ unsigned chars
}

You may notice that there's no way of encoding an empty string with this
mechanism.  That was by design.  Why would you ever point to an empty
string?  That wastes not one but TWO bytes (16-bit pointers)!

> The author of the patch decides to completely rewrite that line even
> if the bug is not really understood, and it just works after it.

Yes.  A sloppy lexicon, combined with cultural and managerial
preoccupations with "cadence" (always implicitly a higher one), manures
the ground thickly for kludges, black magic, and a habit of individual
contributors abandoning projects so that they experience no
accountability for their coding errors.  And I don't mean "punishment"
as a synonym for "accountability"--though that is a substitution typical
of hard-driving, "type A", "get 'er done" engineers and managers alike.
I mean accountability in terms of someone being able to find out _that_
they erred, and _learning_ from it, without a thick gravy of operant
conditioning ladled over it.

God forbid we have _that_ sort of personal development in our industry.

I may have said this before on this list, since it's one of my favorite
things to hold forth about, but, at least in the U.S., civilian air
traffic controllers have a maxim.

Safe, orderly, efficient.[2]

You meet these criteria in order from left to right, and you satisfy one
completely, or to some accepted, documented, and well-known standard
measure, before you move on to the next.  The obvious reason for this is
that when aircraft meet each other at cruise altitudes, many people die.

I haven't yet settled on a counterpart for software engineering that I
like, but my latest stab at it is this.

Comprehensible, correct, efficient.

Incomprehensible code is useless.[3][4]  Even code that is proven
correct by formal methods is fragile if human maintainers are defeated
by its esoteric expression.[5]  (And formal verification can't save you
from incorrect specification in the first place.)  Richard Feynman once
said something along the lines of, if there is any phenomenon in physics
that he can't successfully explain to an audience of freshmen
(first-year undergraduates), then we don't really understand it yet.  We
use subtle, complex tools to solve problems only when we haven't worked
out ways to overcome them with simple, straightforward ones.

> I then investigate that line again, and being told it has a bug, but
> that the bug is not known, I quickly realize that it is due to the
> ternary operator decaying the array into a pointer and sizeof later
> doing shit.

Everybody takes a bullet from array decay and sizeof.  At least once.

> And by doing such a change (with a single reviewer that approved it,
> of course), some old developer (which happens to be one of the
> reviewers that happened to be reviewing the patch that almost
> introduced a bug and didn't catch the bug) complains that I'm touching
> sacred code written by god,

Good code does not further improve in quality with knowledge of who
wrote it.  It must speak for itself.

> and I am blaspheming by insinuating that it was unsafe code.

Write and demonstrate an exploit[6].  It won't make you any more popular
than your present approach, but it will knock the jocks' cowboy hats
askew.

> Then I need to defend my one-line patch (I already defended it in the
> commit message with a somewhat extended explanation, including a
> dissection of the bug that would have been prevented by a compiler
> warning) 2 times with what would will more than what I would write in
> two hypothetical manual pages about sizeof() and the ternary operator.
> Just imagine around 10 terminal pages of rationale for that change.
> And then 3 meetings with different people.  And so we decide to bring
> this issue to one of the oldest programmers in the group. Then things
> go as follows.

Some people don't like reading long messages, and will hector you about
opportunity costs while spending time in code reviews (or on mailing
lists) that they might prefer on a golf course.

> I get a review that starts by saying that this makes the macro
> unreadable (seriously, wtf?  I mean, the length() name is probably the
> less useful name that could be given to such a macro, and my change is
> making it unreadable?  okay, okay).

"nitems" is unreadable?  I guess if emails and code review web forms try
the patience of a reader, books are right out.

https://www.google.com/search?q=%22nitems%22+variable&tbm=bks

It's not my favorite name for an lvalue but I've seen it my entire
career.  It's hard to read much C without hitting it.

> Then the review continues by saying that the reviewers are so bad that
> "actually do allow such trivial bugs to happen".
> 
> And goes on to say that it's sad but it's expected of "new
> developers".

You might ask this person what they believe the purpose of code reviews
to be.  Don't bring preconceptions to this conversation, and don't get
drawn into a discussion with them right away.  Find out what they think.
There's a chance they'll have some great insight, but many times I've
found that people with great reputations have a shockingly superficial
understanding of certain things.  (We all do about _something_.)
Passive reception of answers to open-ended questions can tell you a lot
about a person.

> I feel about your INSTALL.* (and other files) what I felt about the
> same man-pages files.  RST is not the easiest thing to read.

I'd hesitate to call it RST.  It's a format that antedates computers,
and goes back to informal memoranda composed on typewriters.  I don't
personally find the retrojection of semantics inferred by a simple
machine parser upon such a format to be satisfactory approach.  Informal
memoranda were devised by slippery and flexible human brains through
social rather than engineering processes, and in my view, all of these
"plain text, honest!" markup languages inevitably dash themselves on
rocks when a document steers into a narrow channel where simple machine
parsing does not suffice to resolve ambiguity.  Human readers have great
capacity for inferring exceptional cases (often on scant evidence) and
re-parsing our inputs.  This is laborious in code.  The remedies are
always the same: add weird-ass computerese to your "plain text" format
![like](this) or simply refuse to admit their expression within the
format.  I find the latter approach more honest, which may explain why
it's less popular.  Many people prefer to claim that they've solved a
problem once and for all; they've got a hustle going and if you mess
with it they'll get aggressive with you.

Personally, apart from human foibles, I blame SGML for this state of
affairs.  It was too chatty and too ugly for human composition and
construction--for which, I gather, it was not designed.  Why _this_
initiative by big companies and institutional players struck people as
so appealing when, in an adjacent discipline, the excellent Ada language
was received mainly with hostility illustrates well how terrible
middlebrow taste is.

> If you're reading it like a book, it might make sense.  If you have
> technical documentation, which is likely to be organized in unrelated
> sections that you may want to consult independently,

There may be a generational shift in evidence here.  :)  Roughly, each
plain text documentation file in the root of the groff source tree
should be read in its entirety if it need be read at all.[7]  True,
there are cases where you can bail out early, or skip a section if its
title suggests irrelevance to the reader's needs.  It's pretty recently
that I began seriously attacking this aspect of the groff documentation,
having started out much more concerned with end-user- (rather than
developer-) facing materials like man pages.  Maybe I can further
improve this stuff.

> indentation can play a big difference.

Possibly.  I don't see much of a role for it in these text files at
present, but others may have different vision.

> That's why I rewrote the man-pages repo documentation in a
> man-pages\[en]like (:P) document.  I find it much easier now to see
> the organization of the files at a short glance, and look for what you
> need.
> 
> Does it make sense to you?

Somewhat.  There is a place for plain text (_truly_ plain text)
documentation, and with groff there's a bit of a bootstrapping issue; an
configuration and installation manual written in a roff macro language
would deter users who thought they needed to have the system built first
before they could read it.[8]  (Some people are easily discouraged.)
While I find *roff source documents plenty readable as-is in a text
editor, I acknowledge that I may have been corrupted by experience.  :P

Regards,
Branden

[1] https://en.wikichip.org/wiki/schlemiel_the_painter%27s_algorithm
[2] https://www.avweb.com/features/say-again-8air-traffic-chaos/
[3] Literally useless, especially once that something that "just works"
    is ported to a new context.  "The real problem is that we didn't
    understand what was going on either."
    https://www.bell-labs.com/usr/dmr/www/odd.html
[4] Except for constructing streams of self-lauding bullshit before
    promotion committees comprised of people who themselves attained,
    and will further advance, their status predicated on the audacity of
    their bullshit.
[5] And once something's _that_ solid, it may be time to consider
    etching it in silicon rather than primary or secondary storage.
[6] But do it in a sandbox lest you become the next Tom Christiansen.
[7] Yeah, the "PROBLEMS" file should be pruned of its advice for working
    around 30 year old compiler problems.  ("OMG, I can't get groff to
    compile on 386BSD 0.1!!1!"  Well, go get support from Bill Jolitz.)
[8] Our Texinfo manual had a section for this.  It sat empty for over 20
    years.

https://git.savannah.gnu.org/cgit/groff.git/commit/?id=e6736968552aa98b0aa602460a3c08de47adfe87

signature.asc
Description: PGP signature

Pascal rides again (was: Specifying dependencies more clearly)

Reply via email to