On Sun Jul 26 14:40:56 EDT 2009, knapj...@gmail.com wrote:
> If I'm reading you right, you're saying it might be easier if
> everything were encoded as combining (or maybe more aptly
> non-combining) codes, regardless of language?
> 
> So, we might encode 'Waffles' as w+upper a f f l e s and let the
> renderer (if there is one) handle the presentation of the case shift
> and the potential ligature, but things like grep get noticeably easier
> with no overlap of ő and o+umlaut.
> 
> Again, oversimplified, with no real understanding on my part of the
> depth or breadth of the problem space.

you understand.  except, i was taking the opposite position.

if you did for english what is done for indic languages,
if you typed 'this is a sentence.' the 't' would be capitalized
as soon as you typed the '.'.  there's no hint that this rule
need to be applied, the rendered would just have to know
it.  in ak's example a certain combination of codepoints yields
a specific 'letter'.  (i hope i have that right.)  the renderer is
just supposed to know this.  so for consistency and reducing
the need for complicated language-specific (how do we know
that the text represented is actually from the language we think
it is?), i would force the producer to declare the combinations.

btw, the search problem is not at all solved by standardizing
(or is that standardising?) the combiners problem.  consider
the following bits of unicode fun:

; grep 'zero width' /lib/unicode
200b    zero width space
200c    zero width non-joiner
200d    zero width joiner
feff    zero width no-break space

i'm sure that someone more conversant in unicode could
point out other points of real difficulty.

how do you tell unicode from uni\ufeffcode?  not only
is that an annoyance, but it could be a pretty interesting
security problem.  and what a gift for spammers!

- erik

Reply via email to