On Fri, 22 Mar 2013 13:08:01 -0600 Karl Williamson <[email protected]> wrote:
> This is the first time I've heard someone suggest that one can > "tailor" normalizations. I think the officially acceptable term is 'folding'. One would not be 'tailoring a Unicode normalisation', but subverting the code to do what you need. However, in my cases I've also wanted rearrangement as though the characters had what I consider useful canonical combining classes. > Handling Greek shouldn't require having to fake UnicodeData.txt. It does if you have problems with the normalisations. Now it can be argued that the problem is with you if you have difficulty treating U+003B SEMICOLON as indicating a question, but there are many ways of doing most tasks. > And > writing normalization code is complex and tricky, so people use > pre-written code libraries to do this. What you're suggesting says > that one can't use such a library as-is, but you would have to write > your own. >From your description of what you were doing, I assumed you were in charge, rather than the subcontractor being in charge. However, some utilities have the nasty habit of hiding the key data where users can't get at it. One very legitimate reason for changing the data is to test a proposed change to the standard. Myself, I've been pleasantly surprised at how quick it is to parse UnicodeData.txt or even to loop through all codepoints. > I suppose another option is to translate all the > characters you care about into non-characters before calling the > normalization library, and then translate back afterwards, and hope > that the library doesn't use the same non-character(s) internally. With over two planes of Private Use Area at your disposal, you needn't resort to non-characters. > If I'm right, then it would be better > for most normalization routines to ignore/violate the Standard, and > not do this normalization. It is certainly true that normalising everything can be a bad idea. Normalising CJK compatibility characters is a very good way of preventing round-tripping! As to normalisation in general, if one's input were normalised immediately upon receipt, one would not be able to memorise how many deletions were needed to cancel a key stroke, and some input methods would go badly wrong. In general, one Unicode-compliant process cannot instruct another to do something like 'delete the last 5 characters' - sometimes a process needs to not be Unicode-compliant. Richard.

