[fair warning: _gigantic_ message, 5.7k words] Hi Deri & Dave,
I'll quote Dave first since his message was brief and permits me to make a concession early. At 2024-09-04T15:05:38-0500, Dave Kemper wrote: > On Wed, Sep 4, 2024 at 11:04 AM Deri <d...@chuzzlewit.myzen.co.uk> > wrote: > > The example using \[u012F] is superior (in my opinion) because it is > > using a single glyph the font designer intended for that character > > rather than combining two glyphs that don't marry up too well. > > I agree with this opinion. Me too. I can't deny that the pre-composed Ů looks much better than the constructed one. > > If you know of any fonts which include combining diacritics but > > don't provide single glyphs with the base character and the > > diacritic combined, please correct me. > > My go-to example here is the satirical umlaut over the n in the > canonical rendering of the band name Spinal Tap. Combining diacritics > can form glyphs that no natural language uses, so no font will supply > a precomposed form. It does happen, and as a typesetting application I think we _can_ expect people to try such things. Ugly rendering is better than no rendering at all, and it's not our job to make Okular render complex characters prettily in its navigation pane. That said, we _can_ throw it a bone, and it seems easy enough to do so. > > This result may surprise users, that entering exactly the same > > keystrokes as they used when writing the document, finds the text in > > the document, but fails to find the bookmark. > > I also agree this is less than ideal. This, I'm not sure is our fault either. Shouldn't the CMap be getting applied to navigation pane text just as to document text? I confess that it did not occur to me that one would _want_ to do a full-text search on the navigation pane contents themselves. When I search a PDF, I want to search _the document_. But that may simply be a failure of my imagination. At 2024-09-04T17:03:09+0100, Deri wrote: > On Sunday, 1 September 2024 06:09:17 BST G. Branden Robinson wrote: > > But, is doing the decomposition wrong? I think it's intended. > > > > Here's what our documentation says. > > > > groff_char(7): > > > > Unicode code points can be composed as well; when they are, GNU > > troff requires NFD (Normalization Form D), where all Unicode > > glyphs are maximally decomposed. (Exception: precomposed > > characters in the Latin‐1 supplement described above are also > > accepted. Do not count on this exception remaining in a future > > GNU troff that accepts UTF‐8 input directly.) Thus, GNU troff > > accepts “caf\['e]”, “caf\[e aa]”, and “caf\[u0065_0301]”, as > > ways to input “café”. (Due to its legacy 8‐bit encoding > > compatibility, at present it also accepts “caf\[u00E9]” on ISO > > Latin‐1 systems.) > > Exactly, it says it "can" be composed, not that it must be (this text > was added by you post 1.22.4), ...to groff_char(7), yes. Those lunatics who aren't allergic to GNU Texinfo would find it familiar from long ago. commit 3df65a650247b1dd872b7afd4706ebbbfdd93982 Author: Werner LEMBERG <w...@gnu.org> AuthorDate: Sun Mar 2 10:10:17 2003 +0000 Commit: Werner LEMBERG <w...@gnu.org> CommitDate: Sun Mar 2 10:10:17 2003 +0000 Document composite glyphs and the `composite' request. * man/groff.man, man/groff_diff.man, doc/groff.texinfo: Do it. [...] +For simplicity, all Unicode characters which are composites must be +decomposed maximally (this is normalization form@tie{}D in the Unicode +standard); for example, @code{u00CA_0301} is not a valid glyph name +since U+00CA (@sc{latin capital letter e with circumflex}) can be +further decomposed into U+0045 (@sc{latin capital letter e}) and U+0302 +(@sc{combining circumflex accent}). @code{u0045_0302_0301} is thus the +glyph name for U+1EBE, @sc{latin capital letter e with circumflex and +acute}. [...] That's a good 14 years before I wandered in and ruined all our docs. > in fact most \[uxxxx] input to groff is not composed (comes from > preconv). Yes. Werner wrote preconv, too, so I reckon he made it produce what he designed GNU troff to consume.[1] > Troff then performs NFD conversion (so that it matches a named glyph > in the font). Fair. I can concede that the primary purpose of NFD decomposition in groff is to facilitate straightforward and unambiguous glyph lookup. > Conceptually this is a stream of named glyphs, there is a second > stream of device control text, My conceptualization doesn't seem to quite match yours. I think of a grout document as consisting of _one_ stream, the sequence of bytes from its start to its finish. There are however multiple interpretation contexts. Special character names are one. 't' (and 'u') command arguments are another (no special characters allowed)... $ echo "hello, world" | groff -T ascii -Z \ | sed 's/world/\\[uHAHAHAHA CHECK THIS OUT]/' | grep '^t' thello, t\[uHAHAHAHA CHECK THIS OUT] > and this has nothing to do with fonts or glyph names. Here I simply must push back. This depends entirely on what the device extension does, and we have _no control_ over that. Excessive presumptions of such open ended language structures get us into trouble, as with the question of whether setting the line drawing thickness in a '\D' escape sequence should move the drawing position. Coming at the question ab initio, there's no reason to suppose that it should. I will boldly assert that one negative precedent was a blunder of Kernighan's, assuming that all future drawing commands would exclusively comprise sequences of coordinate pairs reflecting page motions. Some geometric objects aren't usefully parameterized that way, as Kernighan should have realized from his own '\D'c radius' command. Apart from line thickness, possibilities like configuration of broken line rendering (a nearly limitless variety of dotted, dashed, dash-dotted, solid, and, maybe, invisible) should have been obvious even at the time. Opinions? I got 'em. Anyway, I think it's a bad idea to assume that a device extension will never have anything to do with fonts or glyph names. In fact, GNU troff already assumes that they might, and has done for over 20 years. 1a153a5268 src/roff/troff/node.cc (Werner LEMBERG 2002-10-02 17:06:46 +0000 880) void troff_output_file::start_special(tfont *tf, color *gcol, color *fcol, 1a153a5268 src/roff/troff/node.cc (Werner LEMBERG 2002-10-02 17:06:46 +0000 881) int no_ini t_string) 037ff7dfcf src/roff/troff/node.cc (Werner LEMBERG 2001-01-17 14:17:26 +0000 882) { 6f6302b0af src/roff/troff/node.cc (Werner LEMBERG 2002-10-26 12:26:12 +0000 883) set_font(tf); 1a153a5268 src/roff/troff/node.cc (Werner LEMBERG 2002-10-02 17:06:46 +0000 884) glyph_color(gcol); 1a153a5268 src/roff/troff/node.cc (Werner LEMBERG 2002-10-02 17:06:46 +0000 885) fill_color(fcol); 6f6302b0af src/roff/troff/node.cc (Werner LEMBERG 2002-10-26 12:26:12 +0000 886) flush_tbuf(); 6f6302b0af src/roff/troff/node.cc (Werner LEMBERG 2002-10-26 12:26:12 +0000 887) do_motion(); 7ae95d63be src/roff/troff/node.cc (Werner LEMBERG 2001-04-06 13:03:18 +0000 888) if (!no_init_string) 7ae95d63be src/roff/troff/node.cc (Werner LEMBERG 2001-04-06 13:03:18 +0000 889) put("x X "); 037ff7dfcf src/roff/troff/node.cc (Werner LEMBERG 2001-01-17 14:17:26 +0000 890) } A fortiori, the formatter seems to assume that a "special" (device extension command) will dirty everything about the drawing context that can possibly be made dirty. This is causing me considerable grief, as you've seen in <https://savannah.gnu.org/bugs/?64484>. > Device controls are passed by .device (and friends). You'd think, wouldn't you? And I'd love to endorse that viewpoint. But I can't, and your own preferences are erecting a barrier to my doing so. I'll come back to that with an illustration below. > \[u0069_0328] is a named glyph in a font, \[u012F] is a 7bit ascii > representation (provided by preconv) of the unicode code point. Okay, couple of things: 0x12F does not fit in 7 bits, nor even in 8. It's precomposed. It's enough to say that. Second, \[u0069_0328] is not _solely_ a named glyph in a font. We use it that way, yes, and for a good reason as far as I can tell (noted above), but it is a _general syntax for combining Unicode characters in the groff language_. Not only can you use it to express "base characters" combined with one or more characters to which Unicode assigns "combining" (non-spacing[2]) semantics, but you can also use it to express ligatures. And GNU troff does. And has, since long before I got here. groff_char(7) again: Ligatures and digraphs Output Input Unicode Notes ────────────────────────────────────────────────────────────────── ff \[ff] u0066_0066 ff ligature + fi \[fi] u0066_0069 fi ligature + fl \[fl] u0066_006C fl ligature + ffi \[Fi] u0066_0066_0069 ffi ligature + ffl \[Fl] u0066_0066_006C ffl ligature + Æ \[AE] u00C6 AE ligature æ \[ae] u00E6 ae ligature Œ \[OE] u0152 OE ligature œ \[oe] u0153 oe ligature IJ \[IJ] u0132 IJ digraph ij \[ij] u0133 ij digraph How one decomposes such a composite character depends. Ligatures should be, and are, broken up and written one-by-one as their constituents, all base characters. Accented characters may have to be degraded to the base character alone; one would certainly not serialize them like a ligature. How nai¨ve! ;-) > The groff_char(7) you quote is simply saying that input to groff can > be composited or not. I don't know about "simply", but yes. > How has that any bearing on how troff talks to its drivers. Of itself, it doesn't. But because the output language, which I call "grout", affords extension in certain ways, including a general purpose escape hatch for device capabilities lacking abstraction in the formatter, it _can_ come up. As in two of the precise situations that lifted the lid on this infernal cauldron: the annotation and rendering _outside of a document's text_ of section headings, and document metadata naming authors, who might foolishly choose to be born to parents that don't feel bound by the ASCII character set, and as such can appear spattered with diacritics in an "info" dialog. If GNU troff is to have any influence over how such things appear, we're must consider the problem of how to express text there, and preferably do so in ways that aren't painful for document authors to use. > If a user actually wants to use a composite character this is saying > you can enter \[u0069_0328] or you can leave it to preconv to use \ > [u012F]. Unfortunately the way you intend to change groff, document > text will always use the single glyph (if available) Eh what? Where is this implied by anything I've committed or proposed? (It may not end up mattering given the point I'm conceding.) > and meta-data will always use a composite glyph. Strictly, it will always use whatever I get back from certain "libgroff" functions like. But I'm willing to flex on that. Your "Se ocksŮ" example is persuasive. Though some irritated Swede is bound to knock us about like tenpins if we keep deliberately misspelling "också" like that. > So there is no real choice for the user. Okay, how about a more pass-through approach when it comes to byte sequences of the form `\[uxxxx]` (where 'xxxx' is 4 to 6 uppercase hexadecimal digits)? I will have to stop using `valid_unicode_code_sequence()` from libgroff. But that can be done. And I need multiple validators regardless (or flags to a common one), as there's no sensible way to handle code points above U+00FF in file names, shell commands, or terminal messages, because they all consist of C `const char *` strings (that moreover will require transformation to C language character escapes--I hope only the octal sort, though). For more on this, see my conversation with Dave in <https://savannah.gnu.org/bugs/?65108>. > User facing programs use NFD, since it makes it easier to sort and > search the glyph stream. Neither grops nor gropdf are "user facing", > they are generators of documents which require a viewer or printer to > render them, the only user facing driver is possibly X11. There is a > visible difference between using NFD and using the actual unicode text > character when specifying pdf bookmarks. The attached PDF has > screenshots of the bookmark panel, using \[u0069_0328] NFD and > \[u012F] NFC. The example using \ [u012F] is superior (in my opinion) > because it is using a single glyph the font designer intended for that > character rather than combining two glyphs that don't marry up too > well. Setting aside the term "user-facing programs", which you and I might define differently, I find the above argument sound. (Well, I'm a _little_ puzzled by how precomposed characters are so valuable for searching bookmarks since the PDF standard already had the CMap facility lying right there.) > This has no bearing on whether it is sensible to use NFD to send text > to output drivers rather than the actual unicode value of the > character. That's vaguely worded. I assume you mean "text in device extension commands here". If so, conceded. > > It seems like a good thing to hold onto for the misty future when we > > get TTF/OTF font support. > > So, it does not make sense now, but might in the future. This isn't a makeweight argument. We know such font _formats_ exist, regardless of the repertoires that their specimens have conventionally supported to date. I think we'd be wise not to nail this door shut, even if we don't walk through it today. > I would concede here if composited glyphs were as good as the single > glyph provided in the font, but the PDF attached shows this is not > always true. Also, from TTF/ OTF fonts I've examined, if the font > contains combining diacritics it also contains glyphs for all the base > characters which can use a diacritic, since it is just calls to > subroutines with any nnecessary repositioning. If you know of any > fonts which include combining diacritics but don't provide single > glyphs with the base character and the diacritic combined, please > correct me. I know of none, and I am confident your experience in font perusal and evaluation is vastly broader than mine. > > > Given that the purpose of \X is to pass meta-data to output > > > drivers, I agree with this earlier statement of yours, but I want to seize on it. Here's why. This is going to take a while. commit f2a92911c552c3995c010f8beb9b89de3612e95a Author: Deri James <d...@chuzzlewit.myzen.co.uk> Date: Thu Mar 1 15:16:11 2018 +0000 Add page transitions to pdfs created with gropdf. * src/devices/gropdf.pl: Handle new '\X' commands to support page transitions in presentation mode pdfs. These commands are a subset of the commands used in present.tmac allowing slideshows to be directly produced from -Tpdf without using postscript -> gpresent.pl -> ghostscript. * tmac/pdf.tmac: New macros '.pdfpause' and '.pdftransition' to support page transitions. * src/devices/gropdf.1.man: Document the '\X' commands supported. diff --git a/tmac/pdf.tmac b/tmac/pdf.tmac index 4a002c37c..350f78391 100644 --- a/tmac/pdf.tmac +++ b/tmac/pdf.tmac @@ -18,7 +18,7 @@ [...] @@ -799,6 +799,12 @@ .de pdfpagename .de pdfswitchtopage .nop \!x X pdf: switchtopage \\$* .. +.de pdfpause +.nop \!x X ps: exec %%%%PAUSE +.. +.de pdftransition +.nop \!x X pdf: transition \\$1 \\$2 \\$3 \\$4 \\$5 \\$6 \\$7 \\$8 +.. [...] Now, I don't want to beat you up about this, but your commit message said you did one thing (handling `\X` commands) and, as I read it, the code _did_ another. I'm a bit puzzled by the phrase "Handle new \X commands". As a macro file, pdf.tmac can't "handle" `\X` escape sequences in any way[3]--not as input. Those are interpreted directly by the formatter. That's a minor point. But neither do they use `\X` themselves! Recall the foregoing: > Device controls are passed by .device (and friends). But they're not! You don't use them that way! Why not? _Because they didn't work!_ `\X` and `.device` get only slight use in pdf.tmac: .char \[lh] \X'pdf: xrev'\[rh]\X'pdf: xrev' (Werner put that in.) . device pdf: markstart \\n[rst] \\n[rsb] \\n[PDFHREF.LEADING] \\*[pdf:href.link] That's you, following intriguingly but mysteriously by ' fl ...a non-breaking flush, a thing for whose purpose one would search groff's documentation for 30 years in vain. and then . device pdf: markend ' fl by me (just a tweak to something of yours, probably), and finally .device pdf: background \\$* .device pdf: pagenumbering \\$* ...which are both more recent additions, from 2021 and 2023 respectively. So why the repeated triple axel hacks with "\!x X pdf:"? I think it's because of that chunk of code I "git blamed" earlier. In fact I'll include a bit more because the second version of this overloaded function goes all the way back to 1991 and is a James Clark original. void troff_output_file::start_special(tfont *tf, color *gcol, color *fcol, bool omit_command_prefix) { set_font(tf); stroke_color(gcol); fill_color(fcol); flush_tbuf(); do_motion(); if (!omit_command_prefix) put("x X "); } void troff_output_file::start_special() { flush_tbuf(); do_motion(); put("x X "); } Remember, now, "special" here refers to a device extension command, and only a laughable naïf would assume it had anything to do with special characters... :-| Really, if we banned the words "special" and "transparent" from the lexicon of all troff developers, we'd make the world a better place. Whenever you can't think of what to call something, just pick one of those two words. Everything will be fine. >:-( I swear, all software engineers should be fitted with shock collars. To get back on track, consider what's going on with the above code. We've got two ways we can "start [a] special [device extension command]". One updates five pieces of state, the other two. Now, maybe that's not crazy, but consider what it means. Your call site determines which one you get. There are only a few call sites, and all are within the same file, "node.cpp". (That means these functions could and should be marked `static`,[4] and I will do that after I finish this gargantuan email.) `special_node::tprint_start()` calls the complex form. The code handling the `\O[5]` escape sequence calls the simpler one, along a few different paths based on conditionals. ...and that's it. Ponder the consequences. There's no way _within a device extension command_ (whether by `\X` _or_ `.device`) to tell the formatter which of the five elements of state need to be updated. The groff language doesn't expose this. In generality, a device extension command _could_ do _anything_, as I emphasized above, and very old GNU troff code shares that assumption. Mess with colors? Maybe! Change the font? Could be! Need to wrap up a grout extension 't' (or 'u') command for writing a sequence of ordinary glyphs? Definitely![5] Move the drawing position? It's a possibility! Considering these matters led me to realize at long last why GNU troff output seems to have so many seemingly superfluous cursor motions, and at least in part, why so many of them are in absolute coordinates (cf. relative ones) when there seems to be no motivating reason. Meanwhile, the ultra-specialized `\O5` escape sequence, which our documentation refuses to explain without accessory garlic and crucifixes to discourage anyone who isn't developing grohtml to stay far away from it, knows what it's going to get dirty: namely, not the font and not the colors, but definitely going to need any pending 't'/'u' command to wrap up, and certainly going to be moving the drawing position. (`\O5` is the means by which rasterized images of tbl tables and eqn equations are inserted into HTML documents produced by groff.) So `device` and `\X` can give you more than you ask for, more than you want, and worse, that excess can lead to bad rendering. And there's NO WAY in the groff language at present to tell the formatter what kind of business your device extension command is going to get up to. When grohtml had more modest needs, it hunted around and grafted on `\O5`. That doesn't scale. But! What if your device extension command makes _nothing_ dirty, and requires nothing about rendering state to be aware that it's even there? Well, then, by golly you can do something mightily clever. And that is to synthesize your _own_ device control command in the grout language, 'x', by smuggling it across the border of the formatting language, thanks to our old friends `\!` ("transparent throughput"--with astonishing chutzpah the most opaquely named escape sequence in the language) and the slightly more recent groff-ism, and brother in request form, `.output`. And it has worked great for years. The pièce de résistance, of course, is, having figured out this trick, to document it nowhere, tell no one, and undertake no effort to attack to the problem at the formatter language level so that everyone can benefit. Some resourceful people might copy it, but it's best left as an "expert mode" trick kept among the cognoscenti. Now I don't know who _exactly_ to blame for this state of affairs; it's better for my blood pressure not to speculate or research the issue, and even saying as much as I have is likely to make people mad at me. But it was not a good call. It created painful technical debt and we should fix it. When we solve a problem with a technique that is weird or fishy, we should cry out in protest, because we're likely not the only ones who have struggled with that type of problem. (It's one of those odd cases where the fewer people who have to deal with it, the _worse_ the problem is, until n becomes 1, at which point it bothers no one else by definition. But when n reaches two, it's instantly a major nightmare because relevant knowledge is so scarce and un-socialized.) Okay. Popping that 55-gallon burning oil drum off the stack... > > ...which _should_ be able to handle NFD if they handle Unicode at > > all, right? > > Of course, much better for any sort, which grops/gropdf do not do, > and if they did, would of course change the given text to NFD prior to > sorting. As regards searching, it's a bit of a two edged sword. For > example, if the word "ocksŮ" in a utf8 document is used as a text > heading and a bookmark entry (think .SH "ocksŮ") preconv converts the > "Ů" to \[u016E], troff then applies NFD to match a glyph name in the > U-TR font - \[u0055_030A]. When .device and .output used "copy in" > mode the original unicode code point \ [u016E] was passed to the > device, but with the recent changes, "new mode 3", to \X are rolled > out to the other 7(?) commands which communicate text to the device > drivers, they receive instead \[u0055_030A]. If this composite code (4 > bytes in UTF16) is used as the bookmark text, we have seen it can > produce less optimum results in the bookmark pane As noted above, I am persuaded that I should abandon decomposition of Unicode special character escape sequences in device extension commands. > but it also can screw up searching in the pdf viewer. Okular (a pdf > viewer) has two search boxes, one for the text, entering "ocksŮ" here > will find the heading, the second search box is for the bookmarks and > entering "ocksŮ" will fail to find the bookmark since the final > character is in fact two characters. This result may surprise users, > that entering exactly the same keystrokes as they used when writing > the document, finds the text in the document, but fails to find the > bookmark. As noted above, _some_ of this seems to me like a deficiency in PDF, either the standard or the tools. But, if the aforementioned abandonment makes the problem less vexing, cool. > Then why does it work in the text search, you may ask, since they have > both been passed an NCD composite code. What's NCD? Do you mean NFD? The former usage persists through the remainder of your email. > The answer is because in the grout passed to the driver it becomes > "Cu0055_030A" and although this looks like unicode it is just the name > of a glyph in the font, just as "Caq" in grout will find the > "quotesingle" glyph. The font header in the pdf identifies the > postscript name of each glyph used for the document text and the pdf > viewer has a lookup table which converts postscript name "Uring" to > U+016E "Ů" (back where we started). [...] > As I've shown the NCD used in grout (Cuxxxx_xxxx) is simply a key to a > font glyph, this information that this glyph is a composite is > entirely unnecessary for device control text, I need to know the > unicode code point delivered by preconv, so I can deliver that single > character back as UTF16 text. Okay. I'd still like to do _some_ validation of Unicode special character escape sequences in device extension commands. I would feel like a crappy engineer if I permitted GNU troff to hand gropdf the sequence "x X ps:exec [\[u012Fz] pdfmark". But gropdf should do validation too. > I used round-tripping in the general sense that, after processing you > end up back where you started (the same as you used it). Why does > groff have to be involved for something to be considfered a > round-trip? I guess we were thinking about the problem in different ways. I am pretty deeply concerned about input to and output from the GNU troff program specifically in this discussion. > Ok, if it can't be done, just leave what you have changed in \X, but > leave .device and .output (plus friends) to the current copy-in mode > which seem to be working fine as they are now, Here are the coupled pairs as I conceive them. \X and .device \! and .output And then we have .cf and .trf, which are vanishingly little used. I need to understand them better, but if `cf` is as laissez-faire as I'm starting to think it is, we should gate it behind unsafe mode. I have an ultra-strong preference for making the coupled pairs behave the same way. There is substantial precedent for this in GNU troff. \f and .ft \s and .ps \m and .gcolor \M and .fcolor \p and .brp I omit `\v` and `.sp`, since the former cannot spring a trap and the latter can, and that fact is by deliberate design with well-established use cases. I don't see any reason why the coupled pairs above should have different interpretation rules (beyond those inherent to the syntactical differences of escape sequences and requests). Most importantly I want document and macro package authors to be able to switch between them at their convenience. Telling them they need to remember, or look up, which one reads in copy mode or which one flushes which aspects of grout state strikes me as an emphatic anti-feature. > unless you have an example which demonstrates a problem which your > code solves. The only example you gave of what you are "fixing", the > .AUTHOR line in a mom example doc, actually works fine, so is probanly > not a good example to justiify your changes. The problem I'm trying to solve is that nearly no one seems to understand how the formatter works in the area under discussion, and the few who have known or figured it out to date, ain't tellin'. If they had cared to, they could have stopped me in my tracks months to years ago, when I started complaining about this stuff. That's an undesirable property of a software system. What we ended up with was in effect if not in intent, "<snort> Go ahead, documentation guy--just you figure it out." That's fine. Challenge accepted. In rejoinder to your implicit scold of undertaking pointless efforts, let me offer the following quotes from familiar personages. "...I was able to make an initial release of Mom after about three years. From the beginning, I followed a self-imposed rule: Write the documentation as it would appear in the manual before defining a macro. These weren't descriptions of what I intended to do, but careful instructions for using as-yet unwritten macros. Documenting an already-written macro can lead to getting all twisted up, but implementing a macro that has to follow the documentation keeps you on top of things." https://technicallywewrite.com/2023/09/30/groffmom "Most details of the constant questioning and experimentation during the early period of rapid change are long forgotten, as are hundreds of transitory states that were recorded in the on-line manual. From time to time, however, a snapshot was taken in the form of a new printed edition. Quite contrary to commercial practice, where a release is supposed to mark a stable, shaken-down state of affairs, the very act of preparing a new edition often caused a flurry of improvements simply to forestall embarrassing admissions of imperfection." https://www.cs.dartmouth.edu/~doug/reader.pdf Documentation, like automated testing, keeps honest engineers honest. If any aspect of a system is infeasible to describe, either without wincing at how many caveats and asides one has to make, or altogether, that aspect bears reconsideration. Hence this thread. Still callow and green, I recall asking this list years ago what the warnings at issue meant. No one would, or maybe could, answer me. I resolved to find my own answers. I've learned a tremendous amount. But some of what I have discovered is less than exemplary. So, yeah, that's the problem I'm trying to solve. > > Because all of > > > > \X > > .device > > \! > > .output > > .cf > > .trf > > Why are two missing? Which two did you have in mind? If I'm overlooking something, you'd be doing me a favor in telling me.[6] > > can inject stuff into "grout". > > > > That seems like a perilous path to me. > > Not if you restrict the changes to \X only, and document the > difference in behaviour from the other 7 methods. That's the status quo, but for the reasons I think I have thoroughly aired above, I think it's a bad one. Authors of interfaces to device features that _you'd think_ would suggest the use of the "device-related" escape sequence and request have avoided them to date because of the undesirable side effects. "Yeah, we have >this< for that, but nobody uses it. Instead we just go straight to page description assembly language." Is no one ashamed of this? > It is not a problem I can certainly embed a composite glyph as part of a > bookmark, the problem is that it does not always look very good (see pdf) > and messes up searching for bookmarks. For the sake of a thorough reply, I acknowledge again that the constraint of running all the Unicode special character escape sequences through the normalization facilities offered by libgroff are unnecessary here. I turned to that resource because it was there and I didn't want to reinvent any wheels. As we say again and again, DRY. ;-) > Have a go if you want, I've got it down to 10 extra lines, but the > results may be depressing (see PDF). The good news is that you've shifted me. I hope I can make `\X` and `device` language features that you can happily employ to greater effect in "pdf.tmac". Thank you for your patience. Regards, Branden [1] commit e7c9dbd201a241e8c42f34ef09acbc16584f16c3 Author: Werner LEMBERG <w...@gnu.org> Date: Fri Dec 30 09:31:50 2005 +0000 New preprocessor `preconv' to convert input encodings to something groff can understand. Not yet integrated within groff. Proper autoconf stuff is missing too. Tomohiro Kubota has written a first draft of this program, and some ideas have been reused (while almost no code has been taken actually). * src/preproc/preconv/preconv.cpp. src/preproc/preconv/Makefile.sub: New files. * MANIFEST, Makefile.in (CCPROGDIRS), test-groff.in (GROFF_BIN_PATH): Add preconv. commit e9a1d5af572610f8ad80a0c18a0f6b02306fed03 Author: Werner LEMBERG <w...@gnu.org> Date: Sun Jan 1 16:31:01 2006 +0000 * src/preproc/preconv/preconv.cpp (emacs_to_mime): Various corrections: . Don't map ascii to latin-1. . Don't use IBMxxx encodings but cpxxx for portability. . Map cp932, cp936, cp949, cp950 to itself. (emacs2mime): Protect calls to strcasecmp. (conversion_iconv): Add missing call to iconv_close. (do_file): Emit error message in case of unsupported encoding. [and so on] [2] plus programmable positioning tricks that advanced font file formats employ, as I understand it, so you can render pretty Vietnamese, among other things [3] Short of, maybe, writing them into a diversion, which has been done, and selectively filtering them based on node identity, for which insufficient facilities in the groff language are available to date. Historically, we throw the `unformat` and `asciify` requests at such diversions and pray that they do what we need. You can also "handle" it by it not _being_ an escape sequence in the first place. For instance, by changing or disabling the escape character. But string handling facilities are few in the groff language. As I keep saying, I hope to fix that. [4] In C, I'm certain of that. In C++, the fact that they're member functions of a class may have some bearing. Static member functions are conceivable, as these need no specialization by object identity. Moreover, there is only ever one `troff_output_file` object in existence during the lifetime of any GNU troff process anyway. My attempt at a minor cleanup might explode in my face anyway. C++ is a language meticulously accreted from chewed bubble gum and whatever could be methodically swept from the floors of jail houses and crack dens, augmented with the glittering chrome of revolutionary innovations the occasional hacker from Microsoft or Sun wangled in on the force of his boundless ambition to get promoted up the Principal Engineer/Distinguished Engineer/Fellow ladder. [5] It seems that `flush_tbuf()` is the only thing that really needs to be unconditional. It refers to the buffer of ordinary characters being assembled into a 't' or 'u' grout command. This is an aspect of formatter state, not document state, and the casual commingling of these matters is yet another frustration. Concretely, if we've got a 't' command in progress when we hit a device extension command, we _have_ to finish that 't' command. t fooba x X pdf: 12 double chocolate chip /bakecookies c r t foobax X pdf: 12 double chocolate chip /bakecookies ...would be riotously wrong. [6] If `\?` is one of them--inapplicable. It's explicitly prevented from bubbling its argument out of the top-level diversion to grout.
signature.asc
Description: PGP signature