Hello everybody, I have done small comparison of the 3 algorithms: the original zend_memnstr implementation, KMP, and Boyer-Moore (inefficient, "book" implementation).
The first part - real life data. I have collected the data about the parameters passed to ZEND_MEMNSTR (not only while it is called from strpos) on my own server. Data is collected over 1h ~ 2 000 000 calls. (if somebody thinks how -> "the hard way" just fwrite() added to zend_memnstr to save in file data about the call) Here are some statistics: - average haystack length: 624.2 - average needle length: 1.9 ! -> 63% of needles of length 1 - avg length of haystacks shorter than avg: 41.0 -> 85% of all haystacks - avg length of haystacks longer than avg: 5685.11 I did several tests - both "human readable" and artificial texts and here are (some) results: In case of searching of words in text (piece of book copied into test bench): For short haystacks (<1000) the original implementation is actually the best one. KMP is slower due to preprocessing time and some overhead in the loop. BM is 10-30 times slower, thus not really applicable. In case of artificial texts (like search of params in kind of config - taken real life example from Joomla) KMP and original implementation are almost the same (the latter a little bit faster) BM is still much slower some results (times in seconds) - only 5 shown. If somebody wants to play with it - I can post small program with implementation of all 3 algorithms) ORIGINAL ------------------- test 1: 2.450000 test 2: 5.380000 test 3: 3.890000 test 4: 2.870000 test 5: 11.840000 KMP ------------------- test 1: 2.910000 test 2: 11.370000 test 3: 7.000000 test 4: 3.970000 test 5: 11.660000 BM ------------------- test 1: 35.030000 test 2: 35.460000 test 3: 8.570000 test 4: 30.500000 test 5: 31.860000 For longer text (<5000) KMP and original zend_memnstr are basically similar, BM starts to show it is better, the real fun is when haystack is loner than 5000 chars - than suddenly BM starts to be really good. Some examples for texts longer than 3000 chars ORIGINAL ------------------- test 1: 0.110000 <- long text with, strpos = 0 test 2: 97.620000 <- really long text (~10000chars, match ~8000) test 3:4400, 17.220000 <- match ~4400 test 4:0, 157.990000 <- degraded long text test 5:0, 99.920000 KMP ------------------- test 1:0, 0.550000 test 2:8596, 36.670000 test 3:4400, 50.610000 test 4:0, 122.000000 test 5:0, 93.800000 BM ------------------- test 1:0 28.690000 test 2:8596, 23.230000 test 3:4400, 18.420000 test 4:0 38.890000 test 5:0, 157.920000 <- this one is strange! As I haven't taken into account the information about haystack/needle KMP seemed to be good choice. But now think about 63% of calls of ZEND_MEMNSTR with needle_len = 1!! I think it is the main place for optimization. Although strpos implements fix for that, some other functions don't. My idea is than to implement ZEND_MEMNSTR once again in shape: if (needle_len = 1) here just linear sweep else if haystack_len < 5000 (5000 is arbitrary - maybe some more tests needed to choose good value) original implementation (as it is the best one in this case) else BM/KMP (i think BM will be better in this case, as some people suggested) Then cleanup all the functions that use zend_memnstr, as they can be this way simpler. I did similar test for strripos - in this case KMP is good enough (the original implementation ALWAYS run in time o(nm) - it is just memcmp at eachposition no matter if the first chars match, thus is outperformed by KMP even for short strings). Implementing BM here is not necessary in my opinion - the match is usually quite close to the end of haystack (average ~80% of haystack length). But fix for needle_len = 1 should be left as here also many searches are with short needles. Do you think it is good idea to rewrite it this way? Cheers, Michal