diacritics (combining characters) are a real mess in Unicode. with so
much space in the format why did they have to go this route, i wonder?

erik mentioned cyrillic. i did have an old church slavonic bible text
i was attempting to display correctly on Plan 9 sometime in 2003-4.
top is x11 with correctly (i presume) combined characters, below is
the Plan 9 rendering:
http://mirtchovski.com/screenshots/x-p9-diacritics.jpg

there's a pattern there, as you can see: the combining char always
follows the char it's combined with, so you can try simply not
advancing forward as a first draft of implementing char combinations
in Plan 9. there doesn't seem to be a default list of "combining"
characters in UTF so you'll have to pick up all glyphs described as
"combining" and check for them when you input. fun and slow :)

the real problem isn't in viewing them however, but comes when you
start searching for them: it's easy to search for ë (e-umlaut) for
example, but what if it's described as e+"U+0308 COMBINING DIAERESIS"?
the answer is the UTS#18 Regular Expressions technical standard which
probably contributes at least half of the slowness of gnu grep
discussed in another thread. http://www.unicode.org/reports/tr18/

Reply via email to