Hi, I'm working on resurrecting Brian M. Carlson's work on character classes, and attempting to update it in the light of Werner's comments in the short thread on the subject in January 2008. I'm attempting first to get the input side working, and will revisit the font side after that.
I have a query about my planned design that I'd like to run past you first, though. After playing about with a few possibilities, I noted that, at least in theory, we would want to be able to apply several of the same sets of attributes to character classes as we do to individual groff entities: character classes need to have flags and hyphenation codes, and I could imagine that you might want to apply a translation to a whole character class (for example, "translate all CJK characters to the Unicode replacement symbol since my output device is too stupid to understand them"). As such, rather than having separate .classflags, .classhcode, etc. requests, and rather than having to duplicate several bits of state from groff entities in character classes, it seems sensible to simply put character classes in the same symbol table as ordinary groff entities, and add character-range and class-nesting support to 'class charinfo'. Obviously a class that consisted of more than just a single character wouldn't have a Unicode codepoint or a glyph number or anything, and \[CJKprepunct] wouldn't produce any output, but '.cflags 2 \[CJKprepunct]' or whatever would be a sensible thing to write. A simple initial implementation would essentially just change the accessor methods of 'class charinfo' to look through all registered character classes for ones that include the current character (intentionally vague here as I haven't yet worked out how to deal with ranges of Unicode codepoints that haven't been given entity indices). For flags, we'd need to take the disjunction of all flags set on character classes including the character in question; for other characteristics I suppose we'd need to take the "most restrictive" one (e.g. single character wins over character range) and if there's ambiguity, well, you shouldn't have done that then. For a small number of classes this ought to be perfectly adequate, and the lookups can be optimised later. My immediate needs (CJK support, of course!) only seem to require classes for no-break-before and no-break-after kinsoku shori, a general notion of "CJK character" so that we can adjust kerning between CJK and Latin characters, and a class for double-width characters. I assume that the latter two would need to be done on the font side, so we'd only be starting with two necessary classes on either side plus whatever nesting is used. Does all this make sense? I've been staring at this on and off for a few days now in between regular work, changing the baby's nappy, and so on :-), and my attention has been a bit fragmented. I'd like to check that I'm on roughly the right track before continuing. BTW, in light of Werner's comments that glyphs are strictly an output notion, it isn't half confusing that 'class charinfo' is based on 'struct glyph' ... Thanks, -- Colin Watson [cjwat...@debian.org]