New submission from Hammerite:

Unicode Standard Annex #15 
(http://unicode.org/reports/tr15/#Stable_Code_Points) describes how each 
character in Unicode, for each of the four normalisation forms, has a 
"Quick_Check" value that aids in determining whether a given string is in that 
normalisation form. It goes on to describe, in section 9.1, how this 
"Quick_Check" value may be used to optimise the concatenation of a string onto 
a normalised string to produce another normalised string: normalisation need 
only be performed from the last "stable" character in the left-hand string 
onwards, where a character is "stable" if it has the "Quick_Check" property and 
has a canonical combining class of 0. This will generally be more efficient 
than re-running the normalisation algorithm on the entire concatenated string, 
if the strings involved are long.

The unicodedata standard-library module does not, in my understanding, expose 
this information. I would like to see a new function added that allows us to 
determine whether a given character has the "Quick_Check" property for a given 
normalisation form. This function might accept two parameters, the first of 
which is a string indicating the normalisation form and the second of which is 
the character being tested (similar to unicodedata.normalize()).

Suppose we have a need to accept text data, receiving chunks of it at a time, 
and every time we receive a new chunk we need to append it to the string so far 
and also make sure that the resulting string is normalised to a particular 
normalisation form (NFD say). This implies that we would like to be able to 
concatenate the new chunk (which may not be normalised) onto the string "so 
far" (which is) and have the result be normalised - but without re-doing 
normalisation of the whole string over again, as this might be inefficient. 
From the linked UAX, this might be achieved like this, where 
unicodedata.quick_check() is the requested function:

    def concat (s1, s2):
        LSCP = len(s1) # Last stable character position
        while LSCP > 0:
            LSCP -= 1
            if unicodedata.combining(s1[LSCP]) == 0 and 
unicodedata.quick_check('NFD', s1[LSCP]):
                break
        return s1[:LSCP] + unicodedata.normalize('NFD', s1[LSCP:] + s2)

----------
components: Library (Lib), Unicode
messages: 236901
nosy: Hammerite, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: Add to unicodedata a function to query the "Quick_Check" property for a 
character
type: enhancement

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue23550>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to