6 apr 2009 kl. 14.59 skrev Glyn Darkin:
Hi Glyn,
to be able to spell check phrases
E.g
"Harry Poter" is converted to "Harry Potter"
We have a fixed dataset so can build indexes/ dictionaries from our
own data.
the most obvious solution is index your contrib/spell checker with
shingles. This will however probably only help out with exact phrases.
Perhaps that is enough for you.
If your example is a real one that you came up with by analyzing query
logs then you might want consider creating an index "stemmed" to
handle various problems associated with reading and writing disorders.
Dyslectic people often miss out on vowels, they who suffer from
dysgraphia have problems with q/p/d/b, other have problems with
reoccuring characters, et c. A combination of these problems could end
up in a secondary "fuzzy" index that contains weighted shingles like
this for the document that points at "harry potter":
"hary poter"^0.9
"harry #otter"^0.8
"hary #oter"^0.7
"hrry pttr"^0.7
"hry ptr"^0.5
In order to get a good precision/recall your query to such an index
would have to produce a boolean query containing all of the "stems"
above if the input was spelled correct.
One alternative to the contrib/spell checker is Spelt: http://groups.google.com/group/spelt/
and I believe it is supposed to handle phrases.
Note the difference between spell checking and suggestion schemes.
Something can be wrong even though the spelling is correct. Consider
the game "Heroes of might and magic", people might have fogotten what
it was called and search for "Heroes of light and magic" instead.
Hopefully your query would still yield a fairly good result for the
correct document if the latter was entered, but if you require all
terms or something similar then it might return no hits.
More advanced strategies for contextual spell checking of phrases
usually involve statistical models such as neural networs, hidden
markov models, et c. LingPipe contains such an implementation.
You can also take a look at reinforcement learning, learning from the
misstakes and corrections made by your users. It requires a lot of
data (user query logs) in order to work but will yeild very cool
results such as acronyms.
LUCENE-626 is a multi layered spell checker with reinforcement
learning in the top, backed by an a priori corpus (that can be
compiled from old user queries) used to find context. It also use a
refactored version of the contrib/spell checker as second level
suggestion when there is nothing to pick up from previous user
behaviour. I never deployed this in a real system, it does however
seem to work great when trained with a few hundred thousand query
sessions.
Finally I recommend that you take some time to analyze user query
sessions to find what the most common problems your users have and try
to find a solution that best fit those problems. Too often features
are implemented because they are listed in a specification and not
because the users need them.
I hope this helps.
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org