search keyword Fields +Text Field with BooleanQuery

Xue, Yijun Mon, 22 Jan 2007 08:20:26 -0800

Hello

I created an index for some xml files which contains 10 keyword fields
and one text field.

I created a TermQuery for searching text field. code like this, 

------Dim textQuery As TermQuery = New TermQuery(New Term("content",
"school"))

I also created a Query using QueryParser.Parse for 10 keyword fields.
code like this,
------analyzer = New StandardAnalyzer()
------allkeywordsQuery = QueryParser.Parse("Secondname:Beckwith AND
Firstname:Louise", "Firstname", analyzer)

I try to use BooleanQuery to allow user to search XML data with the
textQuery and filter the result with the allkeywordsQuery.
code like this

------myBooleanQuery.Add(allkeywordsQuery, True, False)
------myBooleanQuery.Add(textQuery, True, False)
------Dim hits As Hits = searcher.Search(myBooleanQuery)

I should get a xml file contain 

<Secondname>Beckwith</Secondname> 
<Firstname>Louise<Firstname>
<content>...He used to pick us up from the first school...</content>
...

But I always get 0 hits. 

Please tell me where I did wrong or other better solution for search and
filter XML data.

I try a query "Secondname:Beckwith AND Firstname:Louise AND
content:school"
on Luke with WhitespaceAnalyzer, I can get hits, but nothing if I use
StandardAnalyzer

Thanks

Yj

-----Original Message-----
From: Michael D. Curtin [mailto:[EMAIL PROTECTED] 
Sent: 22 January 2007 13:53
To: java-user@lucene.apache.org
Subject: Re: Long Query Performance

Somnath Banerjee wrote:
> Thanks for the reply. Good guess I think.
> 
> DB (Index) is basically a collection of encyclopedia documents.
Queries are
> also a collection of documents but of various domains. My task is to
find
> out for each "query document" top 100 matching encyclopedia contents.
> 
> I tried by using only the title of (5-8 words) the query documents
instead
> of full text of the document. But that is also taking 0.5-1 sec for
each
> query. That's mean it will also take nearly 6 and half days to run
> 0.72Mqueries (and expectedly the precision will suffer).

Thank you, the problem is a little clearer now.  This is a big search 
problem, so your biggest obstacle is running it on a tiny computer (from

what you've said, only a fraction of the *queries* fit in your RAM 
budget, not to mention the database).  I'm confident that spending a 
little $$ on your computing resources, particularly RAM, would be FAR, 
FAR more cost-effective than programming labor.

But, if you can't get a bigger computer, then you can't.  In that case, 
I'm not sure that Lucene is the best tool for this problem, at least not

exclusively.  You might find that some preprocessing, of the 
encyclopedia and query documents, could be used to quickly find the 
highest-probability candidates, then use Lucene to score them.  I'm 
think of a merge-sort kind of algorithm between the query "documents" 
and the encyclopedia documents.  For each query, find the top several 
hundred candidate documents from the encyclopedia, perhaps by number of 
matching non-"stop words" (see below).  You could also get more 
sophisticated by looking at word frequencies from your Lucene index, but

I doubt it would make a huge difference for queries of hundreds of words

that are looking for hundreds of hits.

A couple more suggestions:

-   Search the archives for the topic "more documents like this" and 
variations on that theme.  Several people have used Lucene in this way, 
with varying degrees of success.

-  If you haven't already, ditch "stop words", like "a", "the", 
prepositions, etc.  Your index will be smaller and so will your queries,

making each search faster.

Good luck!

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

search keyword Fields +Text Field with BooleanQuery

Reply via email to