[
https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436045#comment-13436045
]
Jack Krupansky commented on SOLR-3589:
--------------------------------------
bq. If I use a smart Chinese tokenizer to split up a Chinese sentence into
words, why can't the query parser treat those words exactly the same way it
treats words from an English sentence?
Indexing of whole documents can in fact treat text as if it were words from an
English sentence, and split tokens do in fact behave as such in that context,
but a query is not an English sentence or sentence in any natural language.
Rather, a query is a structured expression composed of terms and operators,
typically separated by whitespace or special operators such as parentheses.
Portions of queries may look like natural language phrases or even whole
sentences, but in reality they are sequences of terms and operators.
In addition to being parsed according to the syntax of queries, as opposed to
natural language processing or the raw token stream processing of an indexer,
each of the query terms must be "analyzed" before the final form of the term
can be generated into a Lucene Query structure. That analysis is performed
separate form the "parsing" of the structured user query expression. That means
that the processing of sub-terms that result from analysis is handled at a
different level than source-level query terms that happen to "look" like
English words. In other words, the sub-terms are processed by the "query
generator" while the source terms were processed by the "query parser". We
loosely refer to the combination of (user) query parsing and (Lucene) query
generation as "the query parser", but it is important to distinguish (user
query) "parsing" from (Lucene Query) "generation".
The query parser does its best to handle sub-terms reasonably, but expecting
that they will magically handled the same exact way as source terms is somewhat
impractical. That doesn't mean that there can't be improvement, but simply that
a dose of realism is needed when considering the potential, challenges, and
limits of query parsing/processing/generation.
> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
> Key: SOLR-3589
> URL: https://issues.apache.org/jira/browse/SOLR-3589
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 3.6
> Reporter: Tom Burton-West
>
> With edismax mm set to 100% if one of the tokens is split into two tokens by
> the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is
> ignored and the equivalent of OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to
> separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]