On Sun Jul 26 14:40:56 EDT 2009, knapj...@gmail.com wrote: > If I'm reading you right, you're saying it might be easier if > everything were encoded as combining (or maybe more aptly > non-combining) codes, regardless of language? > > So, we might encode 'Waffles' as w+upper a f f l e s and let the > renderer (if there is one) handle the presentation of the case shift > and the potential ligature, but things like grep get noticeably easier > with no overlap of ő and o+umlaut. > > Again, oversimplified, with no real understanding on my part of the > depth or breadth of the problem space.
you understand. except, i was taking the opposite position. if you did for english what is done for indic languages, if you typed 'this is a sentence.' the 't' would be capitalized as soon as you typed the '.'. there's no hint that this rule need to be applied, the rendered would just have to know it. in ak's example a certain combination of codepoints yields a specific 'letter'. (i hope i have that right.) the renderer is just supposed to know this. so for consistency and reducing the need for complicated language-specific (how do we know that the text represented is actually from the language we think it is?), i would force the producer to declare the combinations. btw, the search problem is not at all solved by standardizing (or is that standardising?) the combiners problem. consider the following bits of unicode fun: ; grep 'zero width' /lib/unicode 200b zero width space 200c zero width non-joiner 200d zero width joiner feff zero width no-break space i'm sure that someone more conversant in unicode could point out other points of real difficulty. how do you tell unicode from uni\ufeffcode? not only is that an annoyance, but it could be a pretty interesting security problem. and what a gift for spammers! - erik