Follow-up Comment #4, bug #65601 (group groff): [comment #3 comment #3:] > It is very simple!
> [derij@pip build (master)]$ echo "This is the arabic alef with a madda above -> آ"|groff -Tutf8 -Kutf8 -z > troff:<standard input>:1: error: cannot format glyph: 'u0627_0653' is not a valid composite character Yes, thank you--that's helpful. > It is no longer possible to output this particular utf-8 character (along with hundreds of others), all perfectly valid utf-8 to the utf-8 device. Apparently. Bummer. > I can't believe that is your intention, to police which particular parts of unicode are acceptable to you. But you can contemplate it seriously enough to put it into words. It would help if you didn't assume autocratic motives behind my code changes. This is the complete repertoire of composite components known to _groff_ for quite a while now. 87909b1715 (Werner LEMBERG 2003-03-01 07:34:52 +0000 1) .\" composite.tmac 87909b1715 (Werner LEMBERG 2003-03-01 07:34:52 +0000 2) . 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 3) .do composite ga u0300 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 4) .do composite ` u0300 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 5) .do composite aa u0301 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 6) .do composite ' u0301 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 7) .do composite a^ u0302 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 8) .do composite ^ u0302 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 9) .do composite a~ u0303 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 10) .do composite ~ u0303 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 11) .do composite a- u0304 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 12) .do composite - u0304 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 13) .do composite ab u0306 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 14) .do composite a. u0307 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 15) .do composite . u0307 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 16) .do composite ad u0308 94c91fca8c (Werner LEMBERG 2006-02-28 13:04:27 +0000 17) .do composite : u0308 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 18) .do composite ao u030A 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 19) .do composite a" u030B 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 20) .do composite " u030B 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 21) .do composite ah u030C 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 22) .do composite ac u0327 be90ad7557 (Werner LEMBERG 2003-12-19 23:30:02 +0000 23) .do composite , u0327 77d9af6df8 (Werner LEMBERG 2003-03-12 23:00:25 +0000 24) .do composite ho u0328 87909b1715 (Werner LEMBERG 2003-03-01 07:34:52 +0000 25) . 47fc0a18b8 (G. Branden Robinson 2017-11-18 17:49:36 -0500 26) .\" Local Variables: 47fc0a18b8 (G. Branden Robinson 2017-11-18 17:49:36 -0500 27) .\" mode: nroff 47fc0a18b8 (G. Branden Robinson 2017-11-18 17:49:36 -0500 28) .\" fill-column: 72 47fc0a18b8 (G. Branden Robinson 2017-11-18 17:49:36 -0500 29) .\" End: 47fc0a18b8 (G. Branden Robinson 2017-11-18 17:49:36 -0500 30) .\" vim: set filetype=groff textwidth=72: If a policeman's baton is being swung here, it's not mine. (Though Keith Marshall might disagree as regards editor settings.) > Just in case the character in the above example gets mangled by savannah, here's the equivalent: It came through fine for me, fortunately. > I hope this gives you enough information to fix this bug. Yup, I'll revert the change. At some point I would like to validate the segregated/spacey/whatever form of escape sequence differently. Or, if it's too hard, not at all. That also would be a bummer. Some notes, probably mainly for my own benefit. Not particularly PDF-relevant. This problem has clarified my thinking around what the formatter needs to know to delegate grapheme cluster composition to the device,[1][2] and strengthens my suspicion that HTML output should take place in nroff mode. This is because font families and type sizes are less HTML's job than CSS's, and we might as well communicate such things to the document via different means ("tags", but not exactly as the Mulley/Lemberg solution has applied them). Moreover, stylesheet selection should probably be an option in the output driver, with a stock one generated or embedded in the absence of a user's choice. (This isn't too different from how _grops_ uses a PostScript prologue, for example.) [1] We won't have problems if grapheme cluster composition can't turn a half-width character into a full-width one or vice versa (mostly an issue for terminals), and moreover the formatter doesn't need to know how wide a grapheme cluster is if it's not responsible for line breaking decisions. And for HTML, it isn't. [2] Typesetting devices will still have to get such information back to the formatter. If a composite character has different metrics from the base character, the formatter **must** know this; hence the warnings already produced. _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?65601> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/