[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

Jack Krupansky (JIRA) Thu, 16 Aug 2012 08:47:38 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436045#comment-13436045
 ]


Jack Krupansky commented on SOLR-3589:
--------------------------------------

bq.  If I use a smart Chinese tokenizer to split up a Chinese sentence into 
words, why can't the query parser treat those words exactly the same way it 
treats words from an English sentence?

Indexing of whole documents can in fact treat text as if it were words from an 
English sentence, and split tokens do in fact behave as such in that context, 
but a query is not an English sentence or sentence in any natural language. 
Rather, a query is a structured expression composed of terms and operators, 
typically separated by whitespace or special operators such as parentheses. 
Portions of queries may look like natural language phrases or even whole 
sentences, but in reality they are sequences of terms and operators.

In addition to being parsed according to the syntax of queries, as opposed to 
natural language processing or the raw token stream processing of an indexer, 
each of the query terms must be "analyzed" before the final form of the term 
can be generated into a Lucene Query structure. That analysis is performed 
separate form the "parsing" of the structured user query expression. That means 
that the processing of sub-terms that result from analysis is handled at a 
different level than source-level query terms that happen to "look" like 
English words. In other words, the sub-terms are processed by the "query 
generator" while the source terms were processed by the "query parser". We 
loosely refer to the combination of (user) query parsing and (Lucene) query 
generation as "the query parser", but it is important to distinguish (user 
query) "parsing" from (Lucene Query) "generation".

The query parser does its best to handle sub-terms reasonably, but expecting 
that they will magically handled the same exact way as source terms is somewhat 
impractical. That doesn't mean that there can't be improvement, but simply that 
a dose of realism is needed when considering the potential, challenges, and 
limits of query parsing/processing/generation.

                
> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3589
>                 URL: https://issues.apache.org/jira/browse/SOLR-3589
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>    Affects Versions: 3.6
>            Reporter: Tom Burton-West
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by 
> the analyzer chain (i.e. "fire-fly"  => fire fly), the mm parameter is 
> ignored and the equivalent of  OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to 
> separate words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token

Reply via email to