Dennis Sweeney <sweeney.dennis...@gmail.com> added the comment:

In "Fast String Searching" (1991), A. Hume and D. Sunday emphasize that most of 
the runtime of string-search algorithms occurs in a code path that skips over 
immediately-disqualified alignments. As such, that paper recommends extracting 
a hot "Horspool" code path and keeping it small. For that reason, I'm posting a 
PR which boils down that fast path to the following:

    while (skip > 0 && window_last < haystack_end) {
        window_last += skip;
        skip = table[(*window_last) & 63];
    }

This uses two memory accesses, whereas the last implementation used three (the 
extra to get window[cut]). This requires the skip table to change slightly, 
since it is indexed by the last character in the current haystack window rather 
than one character later ("fast" versus "sd1" in the paper). The paper also 
recommends unrolling this loop 3x, but my benchmarks found no benefit to 
unrolling.

The PR also refactors some of the main fastsearch code into separate functions, 
and computes and then uses a similar gap/skip integer used already in the 
default implementation (Hume and Sunday call this "md2"). It retains a 
linear-time constant space worst-case with the Two-Way algorithm.

There are a bunch of benchmarks here, both on Zipf-distributed random strings 
and on a few c/rst/python/binary files in the cpython repo. They compare the 
current main branch to the PR:

    https://gist.github.com/sweeneyde/fc20ed5e72dc6b0f3b41c0abaf4ec3be

Summary of those results:

On the Zipf strings:
    * 83 cases slower (at most 1.59x slower)
    * 654 cases faster (up to 4.49x faster)
    * Geometric mean: 1.10x faster

On the "Real world" source files:
    * 76 cases slower (at most 2.41x slower)
    * 420 cases faster (up to 18.12x faster)
    * Geometric mean: 1.33x faster

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue41972>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to