> I'm working on resurrecting Brian M. Carlson's work on character > classes, and attempting to update it in the light of Werner's > comments in the short thread on the subject in January 2008.
Great! > I have a query about my planned design that I'd like to run past you > first, though. After playing about with a few possibilities, I noted > that, at least in theory, we would want to be able to apply several of > the same sets of attributes to character classes as we do to individual > groff entities: [...] Yes. > [...] it seems sensible to simply put character classes in the same > symbol table as ordinary groff entities, and add character-range and > class-nesting support to 'class charinfo'. Good idea. It's so simple that noone has had this idea before. > Obviously a class that consisted of more than just a single > character wouldn't have a Unicode codepoint or a glyph number or > anything, and \[CJKprepunct] wouldn't produce any output, but > '.cflags 2 \[CJKprepunct]' or whatever would be a sensible thing to > write. We could introduce a naming convention for character classes, say, to start such names with a dot, having the word `class' in its name, or something similar. Since the list of groff entities is not extensible, we have a broad range of possibilities. We could even use names similar to POSIX character ranges, e.g., .char \C'[:digit:]' 0123456789 abc\C'[:digit:]'abc Note that entities with a `]' in its name can't be accessed with \[...]; this might work as an additional protection against accidental misuse. > A simple initial implementation would essentially just change the > accessor methods of 'class charinfo' to look through all registered > character classes for ones that include the current character > (intentionally vague here as I haven't yet worked out how to deal > with ranges of Unicode codepoints that haven't been given entity > indices). This should probably support fall-back classes too, similar to the current mechanism for ordinary entities. > For a small number of classes this ought to be perfectly adequate, > and the lookups can be optimised later. My immediate needs (CJK > support, of course!) only seem to require classes for > no-break-before and no-break-after kinsoku shori, a general notion > of "CJK character" so that we can adjust kerning between CJK and > Latin characters, This notion is also necessary to indicate that a break after the current CJK character is allowed. > and a class for double-width characters. Not on the input side. > I assume that the latter two would need to be done on the font side, Exactly. > BTW, in light of Werner's comments that glyphs are strictly an output > notion, it isn't half confusing that 'class charinfo' is based on > 'struct glyph' ... Well, those names are historical, and while James Clark implemented the character/glyph separation quite cleanly, he doesn't paid much attention to proper structure and class names. Werner