[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Tom Christiansen Sat, 13 Aug 2011 18:11:59 -0700

Tom Christiansen <[email protected]> added the comment:

Antoine Pitrou <[email protected]> wrote
   on Sat, 13 Aug 2011 21:09:52 -0000:


> And/or a lookup table giving the byte offset of, say, every 16th
> character. It gives you a O(1) lookup with a relatively reasonable
> constant cost (you have to scan for less than 16 characters after the
> lookup).

> On small strings (< 256 UTF-8 bytes) the space overhead for the lookup
> table would be 1/16. It could also be constructed lazily whenever more
> than 2 positions are cached.

You really should talk to the Perl 6 people to see whether their current
strategy for caching offset maps for grapheme positions might be of use to
you.  Larry explained it to me once but I no longer recall any details.

I notice though that they don't seem to think it worth doing for UTF-8 
or UTF-16, just for their synthetic "NFG" (Grapheme Normalization Form)
strings, where it would be needed even if they used UTF-32 underneath.

--tom

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to