Re: Lucene and Phrase Correction

Karl Wettin Mon, 06 Apr 2009 12:46:48 -0700

6 apr 2009 kl. 14.59 skrev Glyn Darkin:

Hi Glyn,

to be able to spell check phrases

E.g

"Harry Poter" is converted to "Harry Potter"

We have a fixed dataset so can build indexes/ dictionaries from our
own data.

the most obvious solution is index your contrib/spell checker withshingles. This will however probably only help out with exact phrases.Perhaps that is enough for you.

If your example is a real one that you came up with by analyzing querylogs then you might want consider creating an index "stemmed" tohandle various problems associated with reading and writing disorders.Dyslectic people often miss out on vowels, they who suffer fromdysgraphia have problems with q/p/d/b, other have problems withreoccuring characters, et c. A combination of these problems could endup in a secondary "fuzzy" index that contains weighted shingles likethis for the document that points at "harry potter":


"hary poter"^0.9
"harry #otter"^0.8
"hary #oter"^0.7
"hrry pttr"^0.7
"hry ptr"^0.5

In order to get a good precision/recall your query to such an indexwould have to produce a boolean query containing all of the "stems"above if the input was spelled correct.

One alternative to the contrib/spell checker is Spelt: http://groups.google.com/group/spelt/and I believe it is supposed to handle phrases.

Note the difference between spell checking and suggestion schemes.Something can be wrong even though the spelling is correct. Considerthe game "Heroes of might and magic", people might have fogotten whatit was called and search for "Heroes of light and magic" instead.Hopefully your query would still yield a fairly good result for thecorrect document if the latter was entered, but if you require allterms or something similar then it might return no hits.

More advanced strategies for contextual spell checking of phrasesusually involve statistical models such as neural networs, hiddenmarkov models, et c. LingPipe contains such an implementation.

You can also take a look at reinforcement learning, learning from themisstakes and corrections made by your users. It requires a lot ofdata (user query logs) in order to work but will yeild very coolresults such as acronyms.

LUCENE-626 is a multi layered spell checker with reinforcementlearning in the top, backed by an a priori corpus (that can becompiled from old user queries) used to find context. It also use arefactored version of the contrib/spell checker as second levelsuggestion when there is nothing to pick up from previous userbehaviour. I never deployed this in a real system, it does howeverseem to work great when trained with a few hundred thousand querysessions.

Finally I recommend that you take some time to analyze user querysessions to find what the most common problems your users have and tryto find a solution that best fit those problems. Too often featuresare implemented because they are listed in a specification and notbecause the users need them.



I hope this helps.

     karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene and Phrase Correction

Reply via email to