Dennis Sweeney <sweeney.dennis...@gmail.com> added the comment:

> Offhand do you know what the _best_ timing for two-way search is in a 
> pattern-not-found case?

Looking at the glibc implementation, in the top-level "else" clause
(for when the string isn't completely periodic),
we have:

    period = MAX (suffix, needle_len - suffix) + 1;

so period > needle_len / 2.

Then during each iteration of the j-loop (where j is the index into the 
haystack),
j is sometimes incremented by `period`, which would probably give O(n/m) best 
case.

Here's a sublinear example for the two-way algorithm:

    N = 10**7
    haystack = "ABC" * N
    needle1 = "ABC" * (N // 10) + "DDD"
    needle2 = "ABC" * 10 + "DDD"
    
    === Results ===
    Pure Python Two-Way, shorter needle:     1.7142754
    Pure Python Two-Way. Longer needle:      0.0018845000000000667
    Python builtin, shorter needle:          0.031071400000000082
    Python builtin, longer needle:           0.030566099999999707

This case is surprisingly better than the builtin:

    N = 10**7
    haystack = "ABC" * N
    needle1 = "DDD" + "ABC" * (N // 10)
    needle2 = "DDD" + "ABC" * 10

    === Results ===
    Pure Python Two-Way, shorter needle:     0.0020895000000000774
    Pure Python Two-Way. Longer needle:      0.0017878999999999534
    Python builtin, shorter needle:          0.029998100000000028
    Python builtin, longer needle:           0.02963439999999995

This was measured using the attached fastsearch.py. There are some 
manipulations in there like string slicing that would make more sense as 
pointer math in C, so especially accounting for that, I'm guessing it would be 
pretty competitive in a lot of cases.

A lot of the implementations look like they use a complete Boyer-Moore table to 
go sublinear in even more cases, which it seems we want to avoid. I'm not 
certain, but it's conceivable that the existing techniques of using just one 
one delta-value or the "Bloom filter" could be thrown in there to get some of 
the same improvements. Although that could be redundant -- I'm not sure.

----------
Added file: https://bugs.python.org/file49507/fastsearch.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue41972>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to