[adding bug-gnulib] On 06/20/2010 10:00 PM, Carlos wrote: > Howdy -- I'm writing an article for Code Quarterly on pattern > matching, featuring the Two-Way algorithm, which you added > to libc and gnulib a couple of years back. > > There is one result of the critical factorization phase I haven't been > able to understand, and I was hoping you could give me a clue. Two > strings that would appear to have identical periodicity, eg "banana" > and "zanana", result in different factorizations when I run them > through the example implementation you cite on the libc bugzilla.
Thanks again for the report. This affects gnulib's memmem, strstr, strcasestr, and c-strcasestr modules (it would also affect memcasemem, if we ever have any reason to implement that counterpart to memcasecmp). > > I have reimplemented the algorithm from the description and I get the > same result as the Java demo at > http://www-igm.univ-mlv.fr/~lecroq/string/node26.html > > Factorization: > xl="ba", xr="nana" > xl="z", xr="anana" I already explained to Carlos off-list that the difference here is explainable by the fact that the example code on this web page (which is where I started when writing the gnulib implementation) includes the entire string in the search for the maximal suffix, and that both results (2/4 for banana and 1/5 for zanana) are the only two valid critical factorizations for any 3-element alphabet with that 6-letter pattern XYZYZY, even though it does indeed look odd that which of the two factorizations is chosen depends on how X, Y, and Z sort lexicographically. For other example, consider "abcabc", which has three critical factorizations where a non-empty left half still fits within the period of the overall string (1/5, 2/4, 3/3), but sorting lexicographic suffixes can find at most two of those factorizations. > I'm not exactly sure what's going on here, but I wonder if it's a bug > in the more efficient implementation. Another (likely) possibility is > a subtlety of the algorithm that I am not understanding. Can you shed > some light on what might be going on? Ultimately, it is NOT a behavioral bug. But it IS an optimization bug. For any needle of length 1 or 2, we don't need to do any comparisons to determine a critical factorization (for "", 0/0 is critical, but empty needles are already special cased; for "a", both 0/1 and 1/0 are critical, but only 0/1 plays well with the rest of the code that assumes a non-empty suffix; for "aa", both 0/2 and 1/1 are critical; and for "ab", only 1/1 is critical). Needles of length 0 are already forbidden by the documented two_way_* interface. And anything of length 3 or longer can safely ignore the suffix created by the entire needle itself (since the 0/n factorization will always be a local period of 1, and can only be critical if the entire needle consists of the same byte; but in that case, the 1/n-1 factorization would also be critical). Therefore, instead of checking the entire needle for a longest suffix (that is, starting with a comparison of x[0] and x[1]), we can instead start with only reduced-length suffixes (that is, start with a comparison of x[1] and x[2]), for one less comparison, and a slightly faster factorization time. Ultimately, I implemented my enhanced string search code for gnulib first, then ported it to glibc later, so I will be posting a patch soon then filing a glibc bug to get it folded in there. Meanwhile, glibc bug 11607 already complains that the time spent on factorization is extremely noticeable for short strings. I have not benchmarked things myself (partly because I don't know how cache-line effects would play into a benchmark), but do wonder if tweaking some heuristics to use a known-quadratic naive algorithm for short needles, to end up averaging less work compared to the time we spend on factorization (for a worst-case needle "aaaa", the naive algorithm has no initialization cost, but ends up comparing each "a" almost 4 times per byte of the haystack, while more common needles like "abcd" easily approach one comparison per haystack byte. The two-way algorithm guarantees no more than two comparisons per haystack byte, but also costs 8 comparisons and several conditional jumps for any 4-byte needle. So given that probability favors that the needle will not be periodic, at what length of needle does the two-way algorithm save rather than cost time on an average use? I'm thinking 4- or 8- bytes might also be able to take advantage of SSE-type vector operations for some assembly-based speedups to naive searches, when the entire needle fits in special-purpose large registers. http://sourceware.org/bugzilla/show_bug.cgi?id=11607 > > Thank you! > Carlos Bueno > Facebook Engineering > c...@facebook.com > -- Eric Blake ebl...@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature