New submission from Hammerite: Unicode Standard Annex #15 (http://unicode.org/reports/tr15/#Stable_Code_Points) describes how each character in Unicode, for each of the four normalisation forms, has a "Quick_Check" value that aids in determining whether a given string is in that normalisation form. It goes on to describe, in section 9.1, how this "Quick_Check" value may be used to optimise the concatenation of a string onto a normalised string to produce another normalised string: normalisation need only be performed from the last "stable" character in the left-hand string onwards, where a character is "stable" if it has the "Quick_Check" property and has a canonical combining class of 0. This will generally be more efficient than re-running the normalisation algorithm on the entire concatenated string, if the strings involved are long.
The unicodedata standard-library module does not, in my understanding, expose this information. I would like to see a new function added that allows us to determine whether a given character has the "Quick_Check" property for a given normalisation form. This function might accept two parameters, the first of which is a string indicating the normalisation form and the second of which is the character being tested (similar to unicodedata.normalize()). Suppose we have a need to accept text data, receiving chunks of it at a time, and every time we receive a new chunk we need to append it to the string so far and also make sure that the resulting string is normalised to a particular normalisation form (NFD say). This implies that we would like to be able to concatenate the new chunk (which may not be normalised) onto the string "so far" (which is) and have the result be normalised - but without re-doing normalisation of the whole string over again, as this might be inefficient. From the linked UAX, this might be achieved like this, where unicodedata.quick_check() is the requested function: def concat (s1, s2): LSCP = len(s1) # Last stable character position while LSCP > 0: LSCP -= 1 if unicodedata.combining(s1[LSCP]) == 0 and unicodedata.quick_check('NFD', s1[LSCP]): break return s1[:LSCP] + unicodedata.normalize('NFD', s1[LSCP:] + s2) ---------- components: Library (Lib), Unicode messages: 236901 nosy: Hammerite, ezio.melotti, haypo priority: normal severity: normal status: open title: Add to unicodedata a function to query the "Quick_Check" property for a character type: enhancement _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue23550> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com