Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page
I'm not trying to stop you committing whatever you want to your repository, of course, but I want to be clear that this doesn't actually solve the right problem for manual page indexing. The point of the parsing code in mandb(8) - and I'm not claiming that it's great code or the perfect design, just that it works most of the time - is to extract the names and summary-descriptions from each page so that they can be used by tools such as apropos(1) and whatis(1). Splitting on section boundaries is just the simplest part of that problem, and I don't think that doing it in a separate program really gains anything. (That's leaving aside things like localized man pages, which I know some folks on the groff list tend to sniff at but I think they're important, and the fact that the NAME section has both semantic and presentational meaning means that like it or not the parser needs to be aware of this.) -- Colin Watson (he/him) [cjwat...@debian.org]
Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page
On Sat, Nov 02, 2024 at 07:50:23PM -0500, G. Branden Robinson wrote: > At 2024-11-02T19:06:53+0000, Colin Watson wrote: > > How embarrassing. Could somebody please file a bug on > > https://gitlab.com/man-db/man-db/-/issues to remind me to fix that? > > Done; <https://gitlab.com/man-db/man-db/-/issues/46>. Thanks, working on it. > > I already know that getting acceptable performance for > > this requires care, as illustrated by one of the NEWS entries for > > man-db 2.10.0: > > > > * Significantly improve `mandb(8)` and `man -K` performance in the > >common case where pages are of moderate size and compressed using > >`zlib`: `mandb -c` goes from 344 seconds to 10 seconds on a test > >system. > > > > ... so I'm prepared to bet that forking nroff one page at a time will > > be unacceptably slow. > > Probably, but there is little reason to run nroff that way (as of groff > 1.23). It already works well, but I have ideas for further hardening > groff's man(7) and mdoc(7) packages such that they return to a > well-defined state when changing input documents. Being able to keep track of which output goes with which input pages is critical to the indexer, though (as you acknowledge later in your reply). It can't just throw the whole lot at nroff and call it a day. One other thing: mandb/lexgrog also looks for preprocessing filter hints in pages (`'\" te` and the like). This is obscure, to be sure, but either a replacement would need to do the same thing or we'd need to be certain that it's no longer required. > > and of course care would be needed around error handling and so on. > > I need to give this thought, too. What sorts of error scenarios do you > foresee? GNU troff itself, if it can't open a file to be formatted, > reports an error diagnostic and continues to the next `argv` string > until it reaches the end of input. That might be sufficient, or man-db might need to be able to detect which pages had errors. I'm not currently sure. > > but on the other hand this starts to feel like a much less natural fit > > for the way nroff is run in every other situation, where you're > > processing one document at a time. > > This I disagree with. Or perhaps more precisely, it's another example > of the exception (man(1)) swallowing the rule (nroff/troff). nroff and > troff were written as Unix filters; they read the standard input stream > (and/or argument list)[1], do some processing, and write to standard > output.[2] > > Historically, troff (or one of its preprocessors) was commonly used with > multiple input files to catenate them. But this application is not conceptually like catenation (even if it might be possible to implement it that way). The collection of all manual pages on a system is not like one long document that happens to be split over multiple files, certainly not from an indexer's point of view. -- Colin Watson (he/him) [cjwat...@debian.org]
Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page
On Sat, Nov 02, 2024 at 10:36:20PM +0100, Alejandro Colomar wrote: > This is quite naive, and will not work with pages that define their own > stuff, since this script is not groff(1). But it should be as fast as > is possible, which is what Colin wants, is as simple as it can be (and > thus relatively safe), and should work with most pages (as far as > indexing is concerned, probably all?). I seem to be being invoked here for something I actually don't think I want at all, which suggests that wires have been crossed somewhere. Can you explain why I'd want to replace some part of a fairly well-optimized and established C program with a shell pipeline? I'm pretty certain it would not be faster, at least. Thanks, -- Colin Watson (he/him) [cjwat...@debian.org]
Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page
(now with some local vim macros fixed to stop accidentally corrupting the To: lines of some of my outgoing emails ...) On Sat, Nov 02, 2024 at 08:09:29PM -0500, G. Branden Robinson wrote: > At 2024-11-03T00:47:23+0000, Colin Watson wrote: > > and the fact that the NAME section has both semantic and > > presentational meaning means that like it or not the parser needs to > > be aware of this.) > > Even if mandb(8) doesn't run groff to extract the summary descriptions/ > apropos lines, I think this feature might be useful to you for > coverage/regression testing. Presumably, for valid inputs, groff and > mandb(8) should reach similar conclusions about how the text of a "Name" > section is to be formatted. Yes, that's a good point and I agree with that. -- Colin Watson (he/him) [cjwat...@debian.org]
Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page
On Sat, Nov 02, 2024 at 05:08:37AM -0500, G. Branden Robinson wrote: > On GNU/Linux systems, the only man page indexer I know of is Colin > Watson's man-db--specifically, its mandb(8) program. But it's nicely > designed so that the "topic and summary description extraction" task is > delegated to a standalone tool, lexgrog(1), and we can use that. > > $ lexgrog /tmp/proc_pid_fdinfo_mini.5 > /tmp/proc_pid_fdinfo_mini.5: parse failed > > Oh, damn. I wasn't expecting that. Maybe this is what defeats Michael > Kerrisk's scraper with respect to groff's man pages.[1] How embarrassing. Could somebody please file a bug on https://gitlab.com/man-db/man-db/-/issues to remind me to fix that? (Of course there'll be a lead time for fixes to get into distributions.) > Well, I can find a silver lining here, because it gives me an even > better reason than I had to pitch an idea I've been kicking around for a > while. Why not enhance groff man(7) to support a mode where _it_ will > spit out the "Name"/"NAME" section, and only that, _for_ you? > > This would be as easy as checking for an option, say '-d EXTRACT=Name', > and having the package's "TH" and "SH" macro definitions divert > (literally, with the `di` request) everything _except_ the section of > interest to a diversion that is then never called/output. (This is > similar to an m4 feature known as the "black hole diversion".) > > All of the features necessary to implement this[2] were part of troff as > far as back as the birth of the man(7) package itself. It's not clear > to me why it wasn't done back in the 1980s. > > lexgrog(1) itself will of course have to stay around for years to come, > but this could take a significant distraction off of Colin's plate--I > believe I have seen him grumble about how much *roff syntax he has to > parse to have the feature be workable, and that's without upstart groff > maintainers exploring up to every boundary that existed even in 1979 and > cheerfully exercising their findings in man pages. lexgrog(1) is a useful (if oddly-named, sorry) debugging tool, but if you focus on that then you'll end up with a design that's not very useful. What really matters is indexing the whole system's manual pages, and mandb(8) does not do that by invoking lexgrog(1) one page at a time, but rather by running more or less the same code in-process. I already know that getting acceptable performance for this requires care, as illustrated by one of the NEWS entries for man-db 2.10.0: * Significantly improve `mandb(8)` and `man -K` performance in the common case where pages are of moderate size and compressed using `zlib`: `mandb -c` goes from 344 seconds to 10 seconds on a test system. ... so I'm prepared to bet that forking nroff one page at a time will be unacceptably slow. (This also combines with the fact that man-db applies some sandboxing when it's calling nroff just in case it might happen that a moderately-sized C++ project has less than 100% perfect security when doing text processing, which I'm sure everyone agrees would never happen.) If it were possible to run nroff over a whole batch of pages and get output for each of them in one go, then mybe. man-db would need a reliable way to associate each line (or sometimes multiple lines) of output with each source file, and of course care would be needed around error handling and so on. I can see the appeal, in terms of processing the actual language rather than a pile of hacks that try to guess what to do with it - but on the other hand this starts to feel like a much less natural fit for the way nroff is run in every other situation, where you're processing one document at a time. Cheers, -- Colin Watson (he/him) [cjwat...@debian.org]
Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page
On Sun, Nov 03, 2024 at 01:05:34AM +0100, Alejandro Colomar wrote: > Are you sure? With a small tweak, I get the following comparison: > > alx@devuan:~/src/linux/man-pages/man-pages/main$ time lexgrog man/*/* | > wc > lexgrog: can't resolve man7/groff_man.7 > 12475 99295 919842 Comparing anything to lexgrog isn't very interesting; it's a debugging tool and is not in itself very performance-sensitive. As I've explained elsewhere, the interesting thing is mandb, which uses the same code in-process to scan a whole tree of pages in one go. I do not expect to ever want to replace that with a shell pipeline. -- Colin Watson (he/him) [cjwat...@debian.org]