[issue30717] Add unicode grapheme cluster break algorithm

Manish Mon, 06 Jan 2020 01:14:08 -0800


Manish <[email protected]> added the comment:


> one never needs to look at more than two adjacent code points to tell 
whether or not a grapheme break will occur between them, so this ought 
to be pretty efficient. 


That note is outdated (and has been outdated since Unicode 9). The regional 
indicator rules (GB12 and GB13) and the emoji rule (GB11) require arbitrary 
lookbehind (though thankfully not arbitrary lookahead).

I think the ideal API surface is an iterator and nothing else. Everything else 
can be derived from the iterator. It's theoretically possible to expose an 
is_grapheme_break that's faster than just iterating -- look at the code in 
unicode-segmentation's _reverse_ iterator to see how -- but it's going to be 
tricky to get right. Building the iterator on top of is_grapheme_break is not a 
good idea.

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue30717>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue30717] Add unicode grapheme cluster break algorithm

Reply via email to