Tim Peters <t...@python.org> added the comment:

Dennis, I'm delighted that the timing harness pointed out an actual glitch, and 
that it was so (seemingly) straightforward to identify the algorithmic cause. 
This gives me increased confidence that this project can be pushed to adoption, 
and your name will be hallowed in Python's history :-)

I have no problem with increasing the constant space. Fredrik was Python's 
Unicode pioneer too, so was naturally repelled by any scheme that required 
additional space proportional to the alphabet size. The "Bloom filter" is the 
tiny bit left of Daniel Sunday's algorithm, which actually has no such thing 
;-) Along the lines you suggested, Sunday precomputes a vector indexed by 
characters, each entry giving the distance to that character's rightmost index 
in the needle.  The "Bloom filter" throws that vector away and only saves 
hashes of the indices where the vector holds its maximum possible value.

Note that there's nothing magical about "right to left" in Sunday's algorithm 
(or, really, in our current use of the Bloom filter). Characters can be 
compared in any order, and when there's a mismatch, the skip vector can be 
indexed by the character one beyond the search window to often find a decent 
amount to skip.  Indeed, in apps where the expected frequency of characters is 
known, Sunday's algorithm is often adapted to compare the least-frequently 
expected needle character first.

The downside isn't really the space, but that stack space is uninitialized 
trash. Initializing it to a known value increases preprocessing overhead, 
albeit independent of needle length. So the relative overhead is higher the 
shorter the needle.

I agree expanding it beyond the tiny bit vector is likely to be significantly 
helpful.  There's no discomfort at all to me if, e.g., it stored 32-bit counts 
and is indexed by the last 6 bits of the character.  That's a measly 256 bytes 
in all.

It's also possible that a more capable "Sunday-ish vector" of this kind would 
render the current `skip` trick usually useless by comparison. Or not ;-)

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue41972>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to