Re: Fine Tuning Lucene implementation

Askar Zaidi Wed, 25 Jul 2007 09:31:33 -0700

Heres what I mean:

http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields


title:"The Right Way" AND text:go


Although, I am not searching for the title "the right way" , I am looking
for the score by specifying a unique field (itemID).

when I do System.out.println(query);

I get:

+contents:Harvard +contents:Business + contents: Review

Can I just add:

+contents:Harvard +contents:Business + contents: Review +itemID=id       ??

That query would just return one document.

On 7/25/07, Askar Zaidi <[EMAIL PROTECTED]> wrote:
>
> Instead of refactoring the code, would there be a way to just modify the
> query in each search routine ?
>
> Such as, "search contents:<text> and item:<itemID>"; This means it would
> just collect the score of that one document whose itemID field = itemID
> passed from while( rs.next()).
>
> I just need to collect the score of the <itemID> already in the index.
>
> Would there be a way to modify the query ? Add a clause ?
>
> thanks,
> Askar
>
>
> On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> >
> > So, you really want a single Lucene score (based on the scores of
> > your 4 fields) for every itemID, correct?  And this score consists of
> > scoring the title, tag, summary and body against some keywords correct?
> >
> > Here's what I would do:
> >
> > while (rs.next())
> > {
> >      doc = getDocument(itemId);  // Get your document, including
> > contents from your database, no need even to put them in Lucene,
> > although you could
> >      add the doc to a MemoryIndex (see contrib/memory)
> >      Run your 4 searches against that memory index to get your
> > score.  Even better, combine your query into a single query that
> > searches all 4 fields at once, then Lucene will combine the score for
> > you
> > }
> >
> > MemoryIndex info can be found at http://lucene.zones.apache.org:8080/
> > hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/
> > package-summary.html
> >
> > -Grant
> >
> > On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote:
> >
> > > Hi Grant,
> > >
> > > Thanks for the response. Heres what I am trying to accomplish:
> > >
> > > 1. Iterate over itemID (unique) in the database using one SQL query.
> > > 2. For every itemID found, run 4 searches on Lucene Index.
> > > 3. doTagSearch(itemID....) ; collect score
> > > 4. doTitleSearch(itemID...) ; collect score
> > > 5. doSummarySearch(itemID...) ; collect score
> > > 6. doBodySearch(itemID....) ; collect score
> > >
> > > These scores are then added and I get a total score for each unique
> > > item in
> > > the database.
> > >
> > > Lucene Index has: <itemID><tags><title><summary><contents>
> > >
> > > So if I am running a body search, I have 92 hits from over 300
> > > documents for
> > > a query. I already know my hit with the <itemID> .
> > >
> > > For instance, from step (1) if itemID 16 is passed to all the 4
> > > searches, I
> > > just need to get the score of the document which has itemID field =
> > > 16. I
> > > don't have to iterate over all the hits.
> > >
> > > I suppose I have to change my query to look for <contents> where
> > > itemID=16.
> > > Can you guide me as to how to do it ?
> > >
> > > thanks a ton,
> > >
> > > Askar
> > >
> > > On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote:
> > >>
> > >> Hi Askar,
> > >>
> > >> I suggest we take a step back, and ask the question, what are you
> > >> trying to accomplish?  That is, what is your application trying to
> > >> do?  Forget the code, etc. just explain what you want the end result
> > >> to be and we can work from there.   Based on what you have described,
> > >> I am not sure you need access to the hits.  It seems like you just
> > >> need to make better queries.
> > >>
> > >> Is your itemID a unique identifier?  If yes, then you shouldn't need
> > >> to loop over hits at all, as you should only ever have one result IF
> > >> your query contains a required term.  Also, if this is the case, why
> > >> do you need to do a search at all?  Haven't you already identified
> > >> the items of interest when you did your select query in the
> > >> database?  Or is it that you want to score the item based on some
> > >> terms as well.  If that is the case, there are other ways of doing
> > >> this and we can discuss them.
> > >>
> > >> -Grant
> > >>
> > >> On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote:
> > >>
> > >>> Hey Guys,
> > >>>
> > >>> I need to know how I can use the HitCollector class ? I am using
> > >>> Hits and
> > >>> looping over all the possible document hits (turns out its 92 times
> > >>> I am
> > >>> looping; for 300 searches, its 300*92 !!). Can I avoid this using
> > >>> HitCollector ? I can't seem to understand how its used.
> > >>>
> > >>> thanks a lot,
> > >>>
> > >>> Askar
> > >>>
> > >>> On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote:
> > >>>>
> > >>>> Askar,
> > >>>> why do you need to add +id:<idWeCareAbout>?
> > >>>> thanks,
> > >>>> dt,
> > >>>> www.ejinz.com
> > >>>> search engine news forms
> > >>>> ----- Original Message -----
> > >>>> From: "Askar Zaidi" <[EMAIL PROTECTED] >
> > >>>> To: <java-user@lucene.apache.org>; <[EMAIL PROTECTED]>
> > >>>> Sent: Wednesday, July 25, 2007 12:39 AM
> > >>>> Subject: Re: Fine Tuning Lucene implementation
> > >>>>
> > >>>>
> > >>>>> Hey Hira ,
> > >>>>>
> > >>>>> Thanks so much for the reply. Much appreciate it.
> > >>>>>
> > >>>>> Quote:
> > >>>>>
> > >>>>> Would it be possible to just include a query clause?
> > >>>>>   - i.e., instead of just contents:<userQuery>, also add
> > >>>>> +id:<idWeCareAbout>
> > >>>>>
> > >>>>> How can I do that ?
> > >>>>>
> > >>>>> I see my query as :
> > >>>>>
> > >>>>> +contents:harvard +contents:business +contents:review
> > >>>>>
> > >>>>> where the search phrase was: harvard business review
> > >>>>>
> > >>>>> Now how can I add +id:<idWeCareAbout>  ??
> > >>>>>
> > >>>>> This would give me that one exact document I am looking for , for
> > >>>>> that
> > >>>> id.
> > >>>>> I
> > >>>>> don't have to iterate through hits.
> > >>>>>
> > >>>>> thanks,
> > >>>>>
> > >>>>> Askar
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On 7/24/07, N. Hira < [EMAIL PROTECTED]> wrote:
> > >>>>>>
> > >>>>>> I'm no expert on this (so please accept the comments in that
> > >>>>>> context)
> > >>>>>> but 2 things seem weird to me:
> > >>>>>>
> > >>>>>> 1.  Iterating over each hit is an expensive proposition.  I've
> > >>>>>> often
> > >>>>>> seen people recommending a HitCollector.
> > >>>>>>
> > >>>>>> 2.  It seems that doBodySearch() is essentially saying, do this
> > >>>>>> search
> > >>>>>> and return the score pertinent to this ID (using an exhaustive
> > >>>>>> loop).
> > >>>>>> Would it be possible to just include a query clause?
> > >>>>>>     - i.e., instead of just contents:<userQuery>, also add
> > >>>>>> +id:<idWeCareAbout>
> > >>>>>>
> > >>>>>> In general though, I think your algorithm seems inefficient (if I
> > >>>>>> understand it correctly):-- if I want to search for one term
> > >>>>>> among 3 in
> > >>>>>> a "collection" of 300 documents (as defined by some external
> > >>>> attribute),
> > >>>>>> I will wind up executing 300 x 3 searches, and for each search
> > >>>>>> that is
> > >>>>>> executed, I will iterate over every Hit, even if I've already
> > >>>>>> found the
> > >>>>>> one that I "care about".
> > >>>>>>
> > >>>>>> What would break if you:
> > >>>>>> 1.  Included "creator" in the Lucene index (or, filtered out the
> > >>>>>> Hits
> > >>>>>> using a BitSet or something like it)
> > >>>>>> 2.  Executed 1 search
> > >>>>>> 3.  Collected the results of the first N Hits (where N is some
> > >>>>>> reasonable limit, like 100 or 500)
> > >>>>>>
> > >>>>>> -h
> > >>>>>>
> > >>>>>>
> > >>>>>> On Tue, 2007-07-24 at 20:14 -0400, Askar Zaidi wrote:
> > >>>>>>
> > >>>>>>> Sure.
> > >>>>>>>
> > >>>>>>>  public float doBodySearch(Searcher searcher,String query, int
> > >>>>>>> id){
> > >>>>>>>
> > >>>>>>>                  try{
> > >>>>>>>                                 score = search(searcher,
> > >>>>>>> query,id);
> > >>>>>>>                      }
> > >>>>>>>                       catch(IOException io){}
> > >>>>>>>                       catch(ParseException pe){}
> > >>>>>>>
> > >>>>>>>                       return score;
> > >>>>>>>
> > >>>>>>>                 }
> > >>>>>>>
> > >>>>>>>  private float search(Searcher searcher, String queryString,
> > >>>>>>> int id)
> > >>>>>>> throws ParseException, IOException {
> > >>>>>>>
> > >>>>>>>         // Build a Query object
> > >>>>>>>
> > >>>>>>>         QueryParser queryParser = new QueryParser("contents",
> > >>>>>>> new
> > >>>>>>> KeywordAnalyzer());
> > >>>>>>>
> > >>>>>>>         queryParser.setDefaultOperator
> > >>>>>>> ( QueryParser.Operator.AND);
> > >>>>>>>
> > >>>>>>>         Query query = queryParser.parse(queryString);
> > >>>>>>>
> > >>>>>>>         // Search for the query
> > >>>>>>>
> > >>>>>>>         Hits hits = searcher.search(query);
> > >>>>>>>         Document doc = null;
> > >>>>>>>
> > >>>>>>>         // Examine the Hits object to see if there were any
> > >>>>>>> matches
> > >>>>>>>         int hitCount = hits.length();
> > >>>>>>>
> > >>>>>>>                 for(int i=0;i<hitCount;i++){
> > >>>>>>>                 doc = hits.doc(i);
> > >>>>>>>                 String str = doc.get("item");
> > >>>>>>>                 int tmp = Integer.parseInt (str);
> > >>>>>>>                 if(tmp==id)
> > >>>>>>>                 score = hits.score(i);
> > >>>>>>>                 }
> > >>>>>>>
> > >>>>>>>         return score;
> > >>>>>>>     }
> > >>>>>>>
> > >>>>>>> I really need to optimize doBodySearch(...) as this takes the
> > >>>>>>> most
> > >>>>>>> time.
> > >>>>>>>
> > >>>>>>> thanks guys,
> > >>>>>>> Askar
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote:
> > >>>>>>>
> > >>>>>>>         Could you show us the relevant source from
> > >>>>>>> doBodySearch()?
> > >>>>>>>
> > >>>>>>>         -h
> > >>>>>>>
> > >>>>>>>         On Tue, 2007-07-24 at 19:58 -0400, Askar Zaidi wrote:
> > >>>>>>>> I ran some tests and it seems that the slowness is from
> > >>>>>>>         Lucene calls when I
> > >>>>>>>> do "doBodySearch", if I remove that call, Lucene gives me
> > >>>>>>>         results in 5
> > >>>>>>>> seconds. otherwise it takes about 50 seconds.
> > >>>>>>>>
> > >>>>>>>> But I need to do Body search and that field contains lots
> > >>>> of
> > >>>>>>>         text. The field
> > >>>>>>>> is <contents>. How can I optimize that ?
> > >>>>>>>>
> > >>>>>>>> thanks,
> > >>>>>>>> Askar
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>> -------------------------------------------------------------------
> > >>>> --
> > >>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> > >>>> For additional commands, e-mail: [EMAIL PROTECTED]
> > >>>>
> > >>>>
> > >>
> > >> --------------------------
> > >> Grant Ingersoll
> > >> Center for Natural Language Processing
> > >> http://www.cnlp.org/tech/lucene.asp
> > >>
> > >> Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/
> > >> LuceneFAQ
> > >>
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> >
> > >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> > >> For additional commands, e-mail: [EMAIL PROTECTED]
> > >>
> > >>
> >
> > ------------------------------------------------------
> > Grant Ingersoll
> > http://www.grantingersoll.com/
> > http://lucene.grantingersoll.com
> > http://www.paperoftheweek.com/
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

Re: Fine Tuning Lucene implementation

Reply via email to