Anti-phrasing feature
---------------------
Key: SOLR-2150
URL: https://issues.apache.org/jira/browse/SOLR-2150
Project: Solr
Issue Type: New Feature
Components: SearchComponents - other
Reporter: Jan Høydahl
Add an anti-phrasing feature to Solr.
Definition: Identifying word sequences in queries that do not contribute
essentially to the query's meaning, such as "Where can I find" or "Where is."
(Source: http://www.google.com/search?q=define%3Aanti+phrasing)
For general purpose search services, such as web, intranet, shopping search,
some users will try to write a question to the search engine, such as "how much
is an ipod nano". One straight-forward way of limiting the number of 0-hits in
such environments is to apply anti-phrasing, which uses a dictionary of common
sentence prefixes which should be stripped from the incoming query before it is
sent further to search.
This can be implemented as a Search Component in Solr. The dictionary can be
language independent. We can encourage users to submit their tested
anti-phrasing dictionaries for various languages, and include those. The
dictionary can be a set of simple .txt files, loaded in memory at startup in an
efficient data structure such as b-tree or finite state automaton to avoid
redundancy and ensure quick matching. The procedure for detecting an
anti-phrase from the incoming query is to first lookup the full query phrase,
if no match, remove a word from the end, and do another lookup until either a
match or end of string. Example for query: "Who is Einstein?", where "Who is"
is defined as an anti phrase.
1. Lookup "Who is Einstein"
2. Lookup "Who is" (match), remove this prefix
3. Issue the query "Einstein" to search
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]