[Groff] Character class query

Colin Watson Sun, 01 Mar 2009 13:53:57 -0800

Hi,

I'm working on resurrecting Brian M. Carlson's work on character
classes, and attempting to update it in the light of Werner's comments
in the short thread on the subject in January 2008. I'm attempting first
to get the input side working, and will revisit the font side after
that.


I have a query about my planned design that I'd like to run past you
first, though. After playing about with a few possibilities, I noted
that, at least in theory, we would want to be able to apply several of
the same sets of attributes to character classes as we do to individual
groff entities: character classes need to have flags and hyphenation
codes, and I could imagine that you might want to apply a translation to
a whole character class (for example, "translate all CJK characters to
the Unicode replacement symbol since my output device is too stupid to
understand them").

As such, rather than having separate .classflags, .classhcode, etc.
requests, and rather than having to duplicate several bits of state from
groff entities in character classes, it seems sensible to simply put
character classes in the same symbol table as ordinary groff entities,
and add character-range and class-nesting support to 'class charinfo'.
Obviously a class that consisted of more than just a single character
wouldn't have a Unicode codepoint or a glyph number or anything, and
\[CJKprepunct] wouldn't produce any output, but '.cflags 2
\[CJKprepunct]' or whatever would be a sensible thing to write.

A simple initial implementation would essentially just change the
accessor methods of 'class charinfo' to look through all registered
character classes for ones that include the current character
(intentionally vague here as I haven't yet worked out how to deal with
ranges of Unicode codepoints that haven't been given entity indices).
For flags, we'd need to take the disjunction of all flags set on
character classes including the character in question; for other
characteristics I suppose we'd need to take the "most restrictive" one
(e.g. single character wins over character range) and if there's
ambiguity, well, you shouldn't have done that then.

For a small number of classes this ought to be perfectly adequate, and
the lookups can be optimised later. My immediate needs (CJK support, of
course!) only seem to require classes for no-break-before and
no-break-after kinsoku shori, a general notion of "CJK character" so
that we can adjust kerning between CJK and Latin characters, and a class
for double-width characters. I assume that the latter two would need to
be done on the font side, so we'd only be starting with two necessary
classes on either side plus whatever nesting is used.

Does all this make sense? I've been staring at this on and off for a few
days now in between regular work, changing the baby's nappy, and so on
:-), and my attention has been a bit fragmented. I'd like to check that
I'm on roughly the right track before continuing.

BTW, in light of Werner's comments that glyphs are strictly an output
notion, it isn't half confusing that 'class charinfo' is based on
'struct glyph' ...

Thanks,

-- 
Colin Watson                                       [cjwat...@debian.org]

[Groff] Character class query

Reply via email to