6 apr 2009 kl. 14.59 skrev Glyn Darkin:

Hi Glyn,

to be able to spell check phrases

E.g

"Harry Poter" is converted to "Harry Potter"

We have a fixed dataset so can build indexes/ dictionaries from our
own data.

the most obvious solution is index your contrib/spell checker with shingles. This will however probably only help out with exact phrases. Perhaps that is enough for you.

If your example is a real one that you came up with by analyzing query logs then you might want consider creating an index "stemmed" to handle various problems associated with reading and writing disorders. Dyslectic people often miss out on vowels, they who suffer from dysgraphia have problems with q/p/d/b, other have problems with reoccuring characters, et c. A combination of these problems could end up in a secondary "fuzzy" index that contains weighted shingles like this for the document that points at "harry potter":

"hary poter"^0.9
"harry #otter"^0.8
"hary #oter"^0.7
"hrry pttr"^0.7
"hry ptr"^0.5

In order to get a good precision/recall your query to such an index would have to produce a boolean query containing all of the "stems" above if the input was spelled correct.


One alternative to the contrib/spell checker is Spelt: http://groups.google.com/group/spelt/ and I believe it is supposed to handle phrases.


Note the difference between spell checking and suggestion schemes. Something can be wrong even though the spelling is correct. Consider the game "Heroes of might and magic", people might have fogotten what it was called and search for "Heroes of light and magic" instead. Hopefully your query would still yield a fairly good result for the correct document if the latter was entered, but if you require all terms or something similar then it might return no hits.


More advanced strategies for contextual spell checking of phrases usually involve statistical models such as neural networs, hidden markov models, et c. LingPipe contains such an implementation.


You can also take a look at reinforcement learning, learning from the misstakes and corrections made by your users. It requires a lot of data (user query logs) in order to work but will yeild very cool results such as acronyms.

LUCENE-626 is a multi layered spell checker with reinforcement learning in the top, backed by an a priori corpus (that can be compiled from old user queries) used to find context. It also use a refactored version of the contrib/spell checker as second level suggestion when there is nothing to pick up from previous user behaviour. I never deployed this in a real system, it does however seem to work great when trained with a few hundred thousand query sessions.


Finally I recommend that you take some time to analyze user query sessions to find what the most common problems your users have and try to find a solution that best fit those problems. Too often features are implemented because they are listed in a specification and not because the users need them.


I hope this helps.

     karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to