Re: Lucene QueryParser and Analyzer

Wei Ho Thu, 29 Apr 2010 16:00:54 -0700

I think I've figured out what the problem is. Given the inputs,


Input1: C1C2,C3C4,C5C6,C7,C8C9C10
Input2: C1C2  C3C4  C5C6  C7  C8C9C10

Input1 gets parsed as
Query1: (text: "C1C2  C3C4  C5C6  C7  C8C9C10")
whereas Input2 gets parsed as

Query2: (text: "C1C2") (text: "C3C4") (text: "C5C6") (text: "C7") (text:"C8C9C10")

That is, Lucene constructs the query and then pass the query textthrough the analyzer. Is there any way toforce QueryParser to pass the input string through the analyzer beforecreating the query? That is, force Lucene

to create Query2 for both Input1 and Input2.

Thanks,
Wei


-------- Original Message  --------
Subject: Re: Lucene QueryParser and Analyzer
From: Sudarsan, Sithu D. <[email protected]>
To: [email protected]
Date: 4/29/2010 4:54 PM


-------sample code-------------

Analyzer analyzer = new LingPipeAnalyzer();
Searcher searcher = new IndexSearcher(directory);
QueryParser qParser = new MultiFieldQueryParser(Version.LUCENE_30,
SEARCH_FIELDS, analyzer);
Query query = qParser.parse(queryLine[1]);
ScoreDoc[] results = searcher.search(query, TOP_N).scoreDocs;

qParser will use the analyzer LingPipeAnalyzer() before forming the
query.


Sincerely,
Sithu D Sudarsan


-----Original Message-----
From: Wei Ho [mailto:[email protected]]
Sent: Thursday, April 29, 2010 4:44 PM
To: [email protected]
Subject: Re: Lucene QueryParser and Analyzer

Sorry, I guess "discarding the punctuation" was a bit misleading.
I meant that given the two input strings,

Input1: C1C2,C3C4,C5C6,C7,C8C9C10
Input2: C1C2  C3C4  C5C6  C7  C8C9C10

The analyzer I implemented tokenizes both Input1 and Input2 as "C1C2",
"C3C4", "C5C6", "C7", "C8C9C10" - that is, it doesn't include the
punctuation in the tokenization. I'm assuming that QueryParser is simply

passing the entire input string to the analyzer and taking the tokens,
in which case Input1 and Input2 should be considered identifcal. Does
QueryParser doing any sort of pre-processing or filtering beforehand? If

so, how can I turn it off?

Aside from stopping tokens at punctuations, my analyzer is also doing
Chinese word segmentation, so I'd like to be sure that QueryParser is
using the analyzer the way I expect it to.

Thanks,
Wei



-------- Original Message  --------
Subject: Re: Lucene QueryParser and Analyzer
From: Sudarsan, Sithu D.<[email protected]>
To: [email protected]
Date: 4/29/2010 4:08 PM

If so,

Input1:  c1c2c3c4c5c6c7....
Input2: c1c2 c3c4 ...

I guess, they are different! Add a whitespace after commas and see if
that works...

Sincerely,
Sithu D Sudarsan


-----Original Message-----
From: Wei Ho [mailto:[email protected]]
Sent: Thursday, April 29, 2010 4:04 PM
To: [email protected]
Subject: Re: Lucene QueryParser and Analyzer

No, there is no whitespace after the comma in Input1

Input1: C1C2,C3C4,C5C6,C7,C8C9C10
Input2: C1C2  C3C4  C5C6  C7  C8C9C10

Input1 is basically one big long word with commas and Chinese

characters

one after the other. Input2 is where I manually separated the string
into the component terms by replacing the comma with whitespace. My
confusion stems from the fact that I thought it should not matter

since

the analyzer should be discarding the punctuation anyway? So the
tokenization process should be the same for both Input1 and Input2? If
that is not the case, what do I need to change?

Thanks,
Wei Ho

-------- Original Message  --------
Subject: Re: Lucene QueryParser and Analyzer
From: Sudarsan, Sithu D.<[email protected]>
To: [email protected]
Date: 4/29/2010 3:54 PM

Hi,

Is there a whitespace after the comma?


Sincerely,
Sithu D Sudarsan


-----Original Message-----
From: Wei Ho [mailto:[email protected]]
Sent: Thursday, April 29, 2010 3:51 PM
To: [email protected]
Subject: Lucene QueryParser and Analyzer

Hello,

I'm using Lucene to index and search through a collection of Chinese
documents. However, I'm noticing an odd behavior in query
parsing/searching.

Given the two queries below:

(Ci refers to Chinese character i)
Input1: C1C2,C3C4,C5C6,C7,C8C9C10
Input2: C1C2  C3C4  C5C6  C7  C8C9C10

Input1 returns absolutely nothing, while Input2 (replacing the commas
with spaces) works as expected. I'm a bit confused why this would be
happening - it seems that QueryParser uses the Analyzer passed to it

to

tokenize the input query string, so if the Analyzer ignores the
punctuations, it seems that Input1 and Input2 should return identical
results. Is there some pre-Analyzer filtering or whatever that
QueryParser does? I've tried this with the StandardAnalyzer,
SmartChineseAnalyzer, and an analyzer that I implemented which
explicitly skips over punctuations and whitespaces in tokenizing the
query string, but to no avail.


-----------------------------------

I'm probably just doing something dumb, but any help would be greatly
appreciated!

Thanks,
Wei Ho

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Lucene QueryParser and Analyzer

Reply via email to