RE: KeywordAnalyzer still getting tokenized on spaces

Uwe Schindler Tue, 09 Sep 2014 00:52:32 -0700

Hi,

the QueryParser does not analyze the whole query text with the analyzer. It 
first parses the query syntax and then only passes those parts through the 
analyzer, which are considered as "tokens" by the query parser. If you want 
such an analyzer be respected by the query parser you may need a nother one 
with a simplified syntax (e.g. SimpleQueryParser).


Ideally, if you want to just pass a text through an analyzer, you should not 
use a query parser (because there is nothing to parse, just to analyze). So 
approach #2 is the right one. To make it easier, Lucene contains the following 
class: 

http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/util/QueryBuilder.html

This one uses no syntax and just passes the string through the Analyzer to 
create the query:

So solution #2 looks like:

Query currQuery = new QueryBuilder(theAnalyzer)
    .createBooleanQuery("sn", currQueryStr, BooleanClause.Occur.MUST);

In your case this would return a Boolean query with one clause, but that gets 
rewritten by the query execution, so its identical to a single term query. This 
approach is  like Elasticsearch's "matchQuery" and is in most cases the 
approach you should use, if you don't need "syntax".

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: atawfik [mailto:contact.txl...@gmail.com]
> Sent: Tuesday, September 09, 2014 9:37 AM
> To: java-user@lucene.apache.org
> Subject: Re: KeywordAnalyzer still getting tokenized on spaces
> 
> The result of QueryParser is confusing. The problem is that you assume the
> query parser uses the analyzer to parse your query. However, that is not the
> case. The query parser first parses the query string, then applies the
> analyzer.
> 
> In other words, the query parser will split the query string using spaces.
> So, you will get three terms : 1023, 4567 and 8765. In fact, you can see that 
> in
> the output of the second query; you have three boolean clauses instead of
> one. After parsing query, the query parser applies the analyzer.
> 
> To fix that, you have two solutions:
> 
> 1- Use term query instead directly without using query parser. In this case,
> you will not apply the analyzer.
>      Query currQuery = new TermQuery(new Term("sn",currQueryStr));
> 2- Analyze the query, then create the Term query:
>       TokenStream ts = theAnalyzer.tokenStream("sn",new
> StringReader(currQueryStr));
>       ts.reset();
>       ts.incrementToken();
>      CharTermAttribute ca = ts.getAttribute(CharTermAttribute.class);
>      String query = ca.toString();
>      ts.close();
>      Query currQuery = new TermQuery(new Term("sn",query));
>      System.out.println(currQuery.getClass() + ", " + currQuery);
> 
> I am not aware of any method that uses QueryParser to achieve that. May
> someone here can correct me.
> 
> Regards
> Ameer
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/KeywordAnalyzer-still-getting-
> tokenized-on-spaces-tp4157537p4157560.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: KeywordAnalyzer still getting tokenized on spaces

Reply via email to