Re: Question about critical_factorization() in the Two-Way algorithm

Eric Blake Mon, 21 Jun 2010 16:32:08 -0700

[adding bug-gnulib]

On 06/20/2010 10:00 PM, Carlos wrote:
> Howdy -- I'm writing an article for Code Quarterly on pattern
> matching, featuring the Two-Way algorithm, which you added
> to libc and gnulib a couple of years back.
> 
> There is one result of the critical factorization phase I haven't been
> able to understand, and I was hoping you could give me a clue. Two
> strings that would appear to have identical periodicity, eg "banana"
> and "zanana", result in different factorizations when I run them
> through the example implementation you cite on the libc bugzilla.


Thanks again for the report.  This affects gnulib's memmem, strstr,
strcasestr, and c-strcasestr modules (it would also affect memcasemem,
if we ever have any reason to implement that counterpart to memcasecmp).

> 
> I have reimplemented the algorithm from the description and I get the
> same result as the Java demo at
> http://www-igm.univ-mlv.fr/~lecroq/string/node26.html
> 
> Factorization:
>  xl="ba", xr="nana"
>  xl="z", xr="anana"

I already explained to Carlos off-list that the difference here is
explainable by the fact that the example code on this web page (which is
where I started when writing the gnulib implementation) includes the
entire string in the search for the maximal suffix, and that both
results (2/4 for banana and 1/5 for zanana) are the only two valid
critical factorizations for any 3-element alphabet with that 6-letter
pattern XYZYZY, even though it does indeed look odd that which of the
two factorizations is chosen depends on how X, Y, and Z sort
lexicographically.

For other example, consider "abcabc", which has three critical
factorizations where a non-empty left half still fits within the period
of the overall string (1/5, 2/4, 3/3), but sorting lexicographic
suffixes can find at most two of those factorizations.

> I'm not exactly sure what's going on here, but I wonder if it's a bug
> in the more efficient implementation. Another (likely) possibility is
> a subtlety of the algorithm that I am not understanding. Can you shed
> some light on what might be going on?

Ultimately, it is NOT a behavioral bug.  But it IS an optimization bug.
 For any needle of length 1 or 2, we don't need to do any comparisons to
determine a critical factorization (for "", 0/0 is critical, but empty
needles are already special cased; for "a", both 0/1 and 1/0 are
critical, but only 0/1 plays well with the rest of the code that assumes
a non-empty suffix; for "aa", both 0/2 and 1/1 are critical; and for
"ab", only 1/1 is critical).  Needles of length 0 are already forbidden
by the documented two_way_* interface.  And anything of length 3 or
longer can safely ignore the suffix created by the entire needle itself
(since the 0/n factorization will always be a local period of 1, and can
only be critical if the entire needle consists of the same byte; but in
that case, the 1/n-1 factorization would also be critical).  Therefore,
instead of checking the entire needle for a longest suffix (that is,
starting with a comparison of x[0] and x[1]), we can instead start with
only reduced-length suffixes (that is, start with a comparison of x[1]
and x[2]), for one less comparison, and a slightly faster factorization
time.

Ultimately, I implemented my enhanced string search code for gnulib
first, then ported it to glibc later, so I will be posting a patch soon
then filing a glibc bug to get it folded in there.

Meanwhile, glibc bug 11607 already complains that the time spent on
factorization is extremely noticeable for short strings.  I have not
benchmarked things myself (partly because I don't know how cache-line
effects would play into a benchmark), but do wonder if tweaking some
heuristics to use a known-quadratic naive algorithm for short needles,
to end up averaging less work compared to the time we spend on
factorization (for a worst-case needle "aaaa", the naive algorithm has
no initialization cost, but ends up comparing each "a" almost 4 times
per byte of the haystack, while more common needles like "abcd" easily
approach one comparison per haystack byte.  The two-way algorithm
guarantees no more than two comparisons per haystack byte, but also
costs 8 comparisons and several conditional jumps for any 4-byte needle.
 So given that probability favors that the needle will not be periodic,
at what length of needle does the two-way algorithm save rather than
cost time on an average use?  I'm thinking 4- or 8- bytes might also be
able to take advantage of SSE-type vector operations for some
assembly-based speedups to naive searches, when the entire needle fits
in special-purpose large registers.
http://sourceware.org/bugzilla/show_bug.cgi?id=11607

> 
> Thank you!
>    Carlos Bueno
>    Facebook Engineering
>    c...@facebook.com
> 

-- 
Eric Blake   ebl...@redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: Question about critical_factorization() in the Two-Way algorithm

Reply via email to