[issue21165] Optimize str.translate() for replacement with substrings and non-ASCII strings

Josh Rosenberg Wed, 14 Oct 2015 19:07:13 -0700

Josh Rosenberg added the comment:

I actually have a patch (still requires a little cleanup) that makes 
translations for non-ASCII and 1-n translations substantially faster. I've been 
delaying posting it largely because it makes significant changes to 
str.maketrans so it returns a special mapping that can be used far more 
efficiently than Python dicts. The effects of this are:


1. str.maketrans takes a little longer to run (when mappings are defined 
outside the latin-1 range, it takes about 6x as much time), and technically, 
the runtime is unbounded. I'm using "Perfect Hashing" to make a chaining free 
lookup table, but this involves randomly generating the parameters until they 
produce a collision free set of mappings; the number of rounds of generation is 
probabilistically very small (IIRC, for pathological cases, you'd still have a 
>50% chance of success for any random set of parameters, so the odds of failing 
to map after more than a dozen or so attempts is infinitesimal)
2. The resulting object, while it obeys the contract for 
collections.abc.Mapping, is not a dict, nor is it mutable, which would be a 
backwards incompatible change.

Under the current design, the mapping uses ~2x the space as the old dict 
(largely because it actually stores the dict internally to preserve references 
and simplify basic lookups).

In exchange for the longer time to do str.maketrans and the slightly higher 
memory, it provides:

1. Improved runtime for ASCII->Unicode (and vice-versa) of roughly 15-20x
2. Similar improvements for 1-n translations (regardless of whether non-ASCII 
is involved)
3. In general, much more consistent translation performance; the variance based 
on the contents of the mapping and the contents of the string is much lower, 
making it behave more like the old Py2 str.translate (and Py3 bytes.translate); 
translation is almost always faster than any other approach, instead of being a 
pessimization.

I don't know how to float changes that would make fairly substantial changes to 
existing APIs though, so I'm not sure how to proceed. I'd like translation to 
be beneficial (the optimization made in #21118 didn't actually improve my use 
case of stripping diacritics to convert to ASCII equivalent characters from 
latin-1 and related characters), but I have no good solutions that don't mess 
around with the API (I'd considered trying to internally cache "compiled" 
translation tables like the re module does, but the tables are mutable dicts, 
so caching can't be based on identity, and can't use the dicts as keys, which 
makes it difficult).

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue21165>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21165] Optimize str.translate() for replacement with substrings and non-ASCII strings

Reply via email to