Stephen,

I didn't actually say anything about Boyer-Moore looking at the ends of words. The issue with suffixing languages isn't caused by the algorithm's interaction with the document being searched. It is caused by the search patterns themselves. In other words--it's not because the algorithm skips from word to word in the document--it's because search patterns will predominantly be words and in English they are often suffixed.

Your summary of the algorithm itself is helpful, but I still wonder at why the abstract itself specifically cites English data--maybe that's not relevant, but I wonder why it was cited in the abstract if that's so.

My statement that I couldn't understand how this speeds search times reflects that suffixing actually decreases the variety of characters you will find towards the ends of English words. Cross-linguistically, you tend to find the richest set of sounds (and hence characters) in the roots of words, so I would expect that starting with the n/2 character of a string of n characters would probably yield better search times overall, using a Boyer-Moore-like algorithm.

But to address the real issue here: there are about 20 million things to do for Sword that are more important than changing the unindexed search algorithm. We already addressed search times by adding Lucene.

--Chris

Stephen Denne wrote:
Chris,

The Boyer-Moore algorithm is only language specific in that it works better
in languages that have larger alphabets, and consequently less chance of a
single character matching a searched for character.

It is a general purpose search algorithm.

It does not look at the end of each word, that is a misunderstanding of the
algorithm. It is not based on word suffixes.

Basically, if it is looking for a 10 character long sequence, it checks the
10th character, the 20th character, the 30th character, the 40th character,
etc, until it finds a character that is in the 10 character long sequence it
is looking for. If it isn't in there, then there was no point checking any
of the previous 9 characters.

Stephen Denne.

_______________________________________________ sword-devel mailing list sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel

Reply via email to