[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Michael McCandless (JIRA) Thu, 13 May 2010 02:39:07 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867093#action_12867093
 ]


Michael McCandless commented on LUCENE-2458:
--------------------------------------------

I'd like a solution that lets us have our cake and eat it too...

Ie, we clearly have to fix the disastrous out-of-the-box experience
that non-whitespace languages (CJK) now have with Lucene.  This is
clear.

But, when an analyzer that splits English-like compound words (eg
e-mail -> e mail) is used, I think this should also continue to create
a PhraseQuery, out-of-the-box.

Today when a user searches for "e-mail", s/he will correctly see only
"email/e-mail" hit & highlighted in the search results.  If we break
this behaviour, ie no longer produce a PQ out-of-the-box, suddenly
hits with just "mail" will be returned, which is bad.

So a single setter on QueryParser w/ a global default is not a good
enough solution -- it means either CJK or English-like compound words
will be bad.

This is why I like the token attr based solution -- those analyzers
that are doing "English-like" de-compounding can mark the tokens as
such.  Then QueryParser can notice this attr and (if configured to do so, via
setter), create a PhraseQuery out of that sequence of tokens.

This then pushes the decision of which series of Tokens are produced
via "English-like" de-compounding.  EG I think WordDelimiterFilter
should be default mark its tokens as such (the majority of users use
it this way).  When StandardAnalyzer splits a part-number-like token,
it should do so as well.

This isn't a perfect solution: it's not easy, in general, for an
analyzer to "know" its splits are "English-like" de-compounding, but
this would still give us a solid step forward (progress not
perfection).  And, since the decision point is now in the analyzer,
per-token, it gives users complete flexibility to customize as needed.

BTW, this appears to not be an English-only need; this page
(http://www.seobythesea.com/?p=1206) lists these example languages as
also using "English-like" compound words: "Some example languages that
use compound words include: Afrikaans, Danish, Dutch-Flemish, English,
Faroese, Frisian, High German, Gutnish, Icelandic, Low German,
Norwegian, Swedish, and Yiddish."



> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation 
> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello 
> dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term 
> count is used as some sort of "heuristic" to determine if its a phrase query 
> or not.
> This assumption is a disaster for languages that don't use whitespace 
> separation: CJK, compounding European languages like German, Finnish, etc. It 
> also
> makes it difficult for people to use n-gram analysis techniques. In these 
> cases you get bad relevance (MAP improves nearly *10x* if you use a 
> PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases 
> its being abused as some heuristic to "second guess" the tokenizer and piece 
> back things it shouldn't have split, but for large collections, doing things 
> like generating phrasequeries because StandardTokenizer split a compound on a 
> dash can cause serious performance problems. Instead people should analyze 
> their text with the appropriate methods, and QueryParser should only generate 
> phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty 
> obscure and people are not familiar with it. The result is we have bad 
> out-of-box behavior for many languages, and bad performance for others on 
> some inputs.
> I propose instead that we change the grammar to actually look for double 
> quotes to determine when to generate a phrase query, consistent with the 
> documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Reply via email to