(Some of this is off-topic for the groff list.) Hi Alex,
At 2022-11-08T22:05:25+0100, Alejandro Colomar wrote: > Okay, here we go for a rant. > > Let's say there's some software with cowboy programmers, which has > things like: > > typedef struct { > size_t length; > u_char *start; > } str_t; Those who do not learn Pascal strings are condemned to reinvent them. > #define length(s) (sizeof(s) - 1) > > #define str_set(str, text) do \ > { \ > (str)->length = length(test); \ > (str)->start = (u_char *) text; \ > } > > (Of course, cowboy programmers don't need terminating NUL bytes, > that's for newbies, but that's not today's rant.) There are some advantages to Pascal strings. Having determination of any string's size be O(1) is a big win over C string-scanning functions (or loops doing repeated "*ch != '\0'" comparisons), especially when these are used by the naïve or slack-jawed.[1] But they have their downsides too. You truly do need a struct for them, as they are not homogeneous sequences. The length is a different type, and may be a different width, than the subsequent (or pointed-to) array. Pascal strings were more obviously advantageous back in the 8-bit era. size_t and chars were both bytes. A string length limit of 255 was not considered terribly limiting on machines where you could fit only 256 of these in addressable memory anyway (16-bit address space). In practice, fewer--often much fewer. But doing Pascal strings in C is a delicate undertaking due to impedance mismatching, as you showed below. > And then some new programmer in the team writes a line of code that's > something like: > > str = str_set(cond ? "someword" : "another"); Right. C doesn't _really_ have strings, except at the library level. It has character arrays and one grain of syntactic sugar for encoding "string literals", which should not have been called that because whether they get the null terminator is context-dependent. char a[5] = "fooba"; char *b = "bazqux"; I see some Internet sources claim that C is absolutely reliable about null-terminating such literals, but I can't agree. The assignment to `b` above adds a null terminator, and the one to `a` does not. This is the opposite of absolute reliability. Since I foresee someone calling me a liar for saying that, I'll grant that if you carry a long enough list of exceptional cases for the syntax in your head, both are predictable. But it's simply a land mine for the everyday programmer. Worse, the hype around C that "arrays are really pointers in disguise!1! They're interchangeable!1!" constitutes a neon sign directing the learner directly into the mine field. As I recall, Peter van der Linden railed against this folk wisdom way back in 1994 (_Expert C Programming: Deep C Secrets_, Pearson). Despite writing a fine book, his efforts did little to quieten this or other examples of braying hype by C partisans. Because we teach and practice similarly imprecise nomenclature regarding "strings" in C, we encourage programmers to make mistakes like the one you encountered. This is one of the reasons I'm such a stickler about terminology in the groff documentation and am willing to break with tradition (to Ralph's oft-voiced derision) in a search for terms that resist extrapolative interpretation. "There are two fundamentally difficult problems in computer science: cache invalidation, naming things, and off-by-one errors." -- anon. Before getting off of strings I wanted to point out a third mechanism of encoding them--one that is usually forgotten nowadays. Back when 64kB was a lot of memory, both Pascal- and C-style strings could be considered wasteful. Zero-terminated strings even more so, because they consumed an _entire byte_ with their invariant zero bits. So what a lot of people did back in the days before ISO 8859 was to flag the end of the string by setting the high bit in its last character. int main(void) { char s[] = { 'H', 'e', 'l', 'l', 'o', ',', ' ', 'W', 'o', 'r', 'l', 'd', '!', '\n' + 0x80 }; char *p = s; do { putchar(*p & 0x7F); } while (*p++ >= 0); // (*p++ < 0x80) on systems w/ unsigned chars } You may notice that there's no way of encoding an empty string with this mechanism. That was by design. Why would you ever point to an empty string? That wastes not one but TWO bytes (16-bit pointers)! > The author of the patch decides to completely rewrite that line even > if the bug is not really understood, and it just works after it. Yes. A sloppy lexicon, combined with cultural and managerial preoccupations with "cadence" (always implicitly a higher one), manures the ground thickly for kludges, black magic, and a habit of individual contributors abandoning projects so that they experience no accountability for their coding errors. And I don't mean "punishment" as a synonym for "accountability"--though that is a substitution typical of hard-driving, "type A", "get 'er done" engineers and managers alike. I mean accountability in terms of someone being able to find out _that_ they erred, and _learning_ from it, without a thick gravy of operant conditioning ladled over it. God forbid we have _that_ sort of personal development in our industry. I may have said this before on this list, since it's one of my favorite things to hold forth about, but, at least in the U.S., civilian air traffic controllers have a maxim. Safe, orderly, efficient.[2] You meet these criteria in order from left to right, and you satisfy one completely, or to some accepted, documented, and well-known standard measure, before you move on to the next. The obvious reason for this is that when aircraft meet each other at cruise altitudes, many people die. I haven't yet settled on a counterpart for software engineering that I like, but my latest stab at it is this. Comprehensible, correct, efficient. Incomprehensible code is useless.[3][4] Even code that is proven correct by formal methods is fragile if human maintainers are defeated by its esoteric expression.[5] (And formal verification can't save you from incorrect specification in the first place.) Richard Feynman once said something along the lines of, if there is any phenomenon in physics that he can't successfully explain to an audience of freshmen (first-year undergraduates), then we don't really understand it yet. We use subtle, complex tools to solve problems only when we haven't worked out ways to overcome them with simple, straightforward ones. > I then investigate that line again, and being told it has a bug, but > that the bug is not known, I quickly realize that it is due to the > ternary operator decaying the array into a pointer and sizeof later > doing shit. Everybody takes a bullet from array decay and sizeof. At least once. > And by doing such a change (with a single reviewer that approved it, > of course), some old developer (which happens to be one of the > reviewers that happened to be reviewing the patch that almost > introduced a bug and didn't catch the bug) complains that I'm touching > sacred code written by god, Good code does not further improve in quality with knowledge of who wrote it. It must speak for itself. > and I am blaspheming by insinuating that it was unsafe code. Write and demonstrate an exploit[6]. It won't make you any more popular than your present approach, but it will knock the jocks' cowboy hats askew. > Then I need to defend my one-line patch (I already defended it in the > commit message with a somewhat extended explanation, including a > dissection of the bug that would have been prevented by a compiler > warning) 2 times with what would will more than what I would write in > two hypothetical manual pages about sizeof() and the ternary operator. > Just imagine around 10 terminal pages of rationale for that change. > And then 3 meetings with different people. And so we decide to bring > this issue to one of the oldest programmers in the group. Then things > go as follows. Some people don't like reading long messages, and will hector you about opportunity costs while spending time in code reviews (or on mailing lists) that they might prefer on a golf course. > I get a review that starts by saying that this makes the macro > unreadable (seriously, wtf? I mean, the length() name is probably the > less useful name that could be given to such a macro, and my change is > making it unreadable? okay, okay). "nitems" is unreadable? I guess if emails and code review web forms try the patience of a reader, books are right out. https://www.google.com/search?q=%22nitems%22+variable&tbm=bks It's not my favorite name for an lvalue but I've seen it my entire career. It's hard to read much C without hitting it. > Then the review continues by saying that the reviewers are so bad that > "actually do allow such trivial bugs to happen". > > And goes on to say that it's sad but it's expected of "new > developers". You might ask this person what they believe the purpose of code reviews to be. Don't bring preconceptions to this conversation, and don't get drawn into a discussion with them right away. Find out what they think. There's a chance they'll have some great insight, but many times I've found that people with great reputations have a shockingly superficial understanding of certain things. (We all do about _something_.) Passive reception of answers to open-ended questions can tell you a lot about a person. > I feel about your INSTALL.* (and other files) what I felt about the > same man-pages files. RST is not the easiest thing to read. I'd hesitate to call it RST. It's a format that antedates computers, and goes back to informal memoranda composed on typewriters. I don't personally find the retrojection of semantics inferred by a simple machine parser upon such a format to be satisfactory approach. Informal memoranda were devised by slippery and flexible human brains through social rather than engineering processes, and in my view, all of these "plain text, honest!" markup languages inevitably dash themselves on rocks when a document steers into a narrow channel where simple machine parsing does not suffice to resolve ambiguity. Human readers have great capacity for inferring exceptional cases (often on scant evidence) and re-parsing our inputs. This is laborious in code. The remedies are always the same: add weird-ass computerese to your "plain text" format  or simply refuse to admit their expression within the format. I find the latter approach more honest, which may explain why it's less popular. Many people prefer to claim that they've solved a problem once and for all; they've got a hustle going and if you mess with it they'll get aggressive with you. Personally, apart from human foibles, I blame SGML for this state of affairs. It was too chatty and too ugly for human composition and construction--for which, I gather, it was not designed. Why _this_ initiative by big companies and institutional players struck people as so appealing when, in an adjacent discipline, the excellent Ada language was received mainly with hostility illustrates well how terrible middlebrow taste is. > If you're reading it like a book, it might make sense. If you have > technical documentation, which is likely to be organized in unrelated > sections that you may want to consult independently, There may be a generational shift in evidence here. :) Roughly, each plain text documentation file in the root of the groff source tree should be read in its entirety if it need be read at all.[7] True, there are cases where you can bail out early, or skip a section if its title suggests irrelevance to the reader's needs. It's pretty recently that I began seriously attacking this aspect of the groff documentation, having started out much more concerned with end-user- (rather than developer-) facing materials like man pages. Maybe I can further improve this stuff. > indentation can play a big difference. Possibly. I don't see much of a role for it in these text files at present, but others may have different vision. > That's why I rewrote the man-pages repo documentation in a > man-pages\[en]like (:P) document. I find it much easier now to see > the organization of the files at a short glance, and look for what you > need. > > Does it make sense to you? Somewhat. There is a place for plain text (_truly_ plain text) documentation, and with groff there's a bit of a bootstrapping issue; an configuration and installation manual written in a roff macro language would deter users who thought they needed to have the system built first before they could read it.[8] (Some people are easily discouraged.) While I find *roff source documents plenty readable as-is in a text editor, I acknowledge that I may have been corrupted by experience. :P Regards, Branden [1] https://en.wikichip.org/wiki/schlemiel_the_painter%27s_algorithm [2] https://www.avweb.com/features/say-again-8air-traffic-chaos/ [3] Literally useless, especially once that something that "just works" is ported to a new context. "The real problem is that we didn't understand what was going on either." https://www.bell-labs.com/usr/dmr/www/odd.html [4] Except for constructing streams of self-lauding bullshit before promotion committees comprised of people who themselves attained, and will further advance, their status predicated on the audacity of their bullshit. [5] And once something's _that_ solid, it may be time to consider etching it in silicon rather than primary or secondary storage. [6] But do it in a sandbox lest you become the next Tom Christiansen. [7] Yeah, the "PROBLEMS" file should be pruned of its advice for working around 30 year old compiler problems. ("OMG, I can't get groff to compile on 386BSD 0.1!!1!" Well, go get support from Bill Jolitz.) [8] Our Texinfo manual had a section for this. It sat empty for over 20 years. https://git.savannah.gnu.org/cgit/groff.git/commit/?id=e6736968552aa98b0aa602460a3c08de47adfe87
signature.asc
Description: PGP signature