Re: Fine Tuning Lucene implementation

Askar Zaidi Wed, 25 Jul 2007 09:26:44 -0700

Instead of refactoring the code, would there be a way to just modify the
query in each search routine ?


Such as, "search contents:<text> and item:<itemID>"; This means it would
just collect the score of that one document whose itemID field = itemID
passed from while(rs.next()).

I just need to collect the score of the <itemID> already in the index.

Would there be a way to modify the query ? Add a clause ?

thanks,
Askar


On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
> So, you really want a single Lucene score (based on the scores of
> your 4 fields) for every itemID, correct?  And this score consists of
> scoring the title, tag, summary and body against some keywords correct?
>
> Here's what I would do:
>
> while (rs.next())
> {
>      doc = getDocument(itemId);  // Get your document, including
> contents from your database, no need even to put them in Lucene,
> although you could
>      add the doc to a MemoryIndex (see contrib/memory)
>      Run your 4 searches against that memory index to get your
> score.  Even better, combine your query into a single query that
> searches all 4 fields at once, then Lucene will combine the score for
> you
> }
>
> MemoryIndex info can be found at http://lucene.zones.apache.org:8080/
> hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/
> package-summary.html
>
> -Grant
>
> On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote:
>
> > Hi Grant,
> >
> > Thanks for the response. Heres what I am trying to accomplish:
> >
> > 1. Iterate over itemID (unique) in the database using one SQL query.
> > 2. For every itemID found, run 4 searches on Lucene Index.
> > 3. doTagSearch(itemID....) ; collect score
> > 4. doTitleSearch(itemID...) ; collect score
> > 5. doSummarySearch(itemID...) ; collect score
> > 6. doBodySearch(itemID....) ; collect score
> >
> > These scores are then added and I get a total score for each unique
> > item in
> > the database.
> >
> > Lucene Index has: <itemID><tags><title><summary><contents>
> >
> > So if I am running a body search, I have 92 hits from over 300
> > documents for
> > a query. I already know my hit with the <itemID> .
> >
> > For instance, from step (1) if itemID 16 is passed to all the 4
> > searches, I
> > just need to get the score of the document which has itemID field =
> > 16. I
> > don't have to iterate over all the hits.
> >
> > I suppose I have to change my query to look for <contents> where
> > itemID=16.
> > Can you guide me as to how to do it ?
> >
> > thanks a ton,
> >
> > Askar
> >
> > On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> >>
> >> Hi Askar,
> >>
> >> I suggest we take a step back, and ask the question, what are you
> >> trying to accomplish?  That is, what is your application trying to
> >> do?  Forget the code, etc. just explain what you want the end result
> >> to be and we can work from there.   Based on what you have described,
> >> I am not sure you need access to the hits.  It seems like you just
> >> need to make better queries.
> >>
> >> Is your itemID a unique identifier?  If yes, then you shouldn't need
> >> to loop over hits at all, as you should only ever have one result IF
> >> your query contains a required term.  Also, if this is the case, why
> >> do you need to do a search at all?  Haven't you already identified
> >> the items of interest when you did your select query in the
> >> database?  Or is it that you want to score the item based on some
> >> terms as well.  If that is the case, there are other ways of doing
> >> this and we can discuss them.
> >>
> >> -Grant
> >>
> >> On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote:
> >>
> >>> Hey Guys,
> >>>
> >>> I need to know how I can use the HitCollector class ? I am using
> >>> Hits and
> >>> looping over all the possible document hits (turns out its 92 times
> >>> I am
> >>> looping; for 300 searches, its 300*92 !!). Can I avoid this using
> >>> HitCollector ? I can't seem to understand how its used.
> >>>
> >>> thanks a lot,
> >>>
> >>> Askar
> >>>
> >>> On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote:
> >>>>
> >>>> Askar,
> >>>> why do you need to add +id:<idWeCareAbout>?
> >>>> thanks,
> >>>> dt,
> >>>> www.ejinz.com
> >>>> search engine news forms
> >>>> ----- Original Message -----
> >>>> From: "Askar Zaidi" <[EMAIL PROTECTED]>
> >>>> To: <java-user@lucene.apache.org>; <[EMAIL PROTECTED]>
> >>>> Sent: Wednesday, July 25, 2007 12:39 AM
> >>>> Subject: Re: Fine Tuning Lucene implementation
> >>>>
> >>>>
> >>>>> Hey Hira ,
> >>>>>
> >>>>> Thanks so much for the reply. Much appreciate it.
> >>>>>
> >>>>> Quote:
> >>>>>
> >>>>> Would it be possible to just include a query clause?
> >>>>>   - i.e., instead of just contents:<userQuery>, also add
> >>>>> +id:<idWeCareAbout>
> >>>>>
> >>>>> How can I do that ?
> >>>>>
> >>>>> I see my query as :
> >>>>>
> >>>>> +contents:harvard +contents:business +contents:review
> >>>>>
> >>>>> where the search phrase was: harvard business review
> >>>>>
> >>>>> Now how can I add +id:<idWeCareAbout>  ??
> >>>>>
> >>>>> This would give me that one exact document I am looking for , for
> >>>>> that
> >>>> id.
> >>>>> I
> >>>>> don't have to iterate through hits.
> >>>>>
> >>>>> thanks,
> >>>>>
> >>>>> Askar
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote:
> >>>>>>
> >>>>>> I'm no expert on this (so please accept the comments in that
> >>>>>> context)
> >>>>>> but 2 things seem weird to me:
> >>>>>>
> >>>>>> 1.  Iterating over each hit is an expensive proposition.  I've
> >>>>>> often
> >>>>>> seen people recommending a HitCollector.
> >>>>>>
> >>>>>> 2.  It seems that doBodySearch() is essentially saying, do this
> >>>>>> search
> >>>>>> and return the score pertinent to this ID (using an exhaustive
> >>>>>> loop).
> >>>>>> Would it be possible to just include a query clause?
> >>>>>>     - i.e., instead of just contents:<userQuery>, also add
> >>>>>> +id:<idWeCareAbout>
> >>>>>>
> >>>>>> In general though, I think your algorithm seems inefficient (if I
> >>>>>> understand it correctly):-- if I want to search for one term
> >>>>>> among 3 in
> >>>>>> a "collection" of 300 documents (as defined by some external
> >>>> attribute),
> >>>>>> I will wind up executing 300 x 3 searches, and for each search
> >>>>>> that is
> >>>>>> executed, I will iterate over every Hit, even if I've already
> >>>>>> found the
> >>>>>> one that I "care about".
> >>>>>>
> >>>>>> What would break if you:
> >>>>>> 1.  Included "creator" in the Lucene index (or, filtered out the
> >>>>>> Hits
> >>>>>> using a BitSet or something like it)
> >>>>>> 2.  Executed 1 search
> >>>>>> 3.  Collected the results of the first N Hits (where N is some
> >>>>>> reasonable limit, like 100 or 500)
> >>>>>>
> >>>>>> -h
> >>>>>>
> >>>>>>
> >>>>>> On Tue, 2007-07-24 at 20:14 -0400, Askar Zaidi wrote:
> >>>>>>
> >>>>>>> Sure.
> >>>>>>>
> >>>>>>>  public float doBodySearch(Searcher searcher,String query, int
> >>>>>>> id){
> >>>>>>>
> >>>>>>>                  try{
> >>>>>>>                                 score = search(searcher,
> >>>>>>> query,id);
> >>>>>>>                      }
> >>>>>>>                       catch(IOException io){}
> >>>>>>>                       catch(ParseException pe){}
> >>>>>>>
> >>>>>>>                       return score;
> >>>>>>>
> >>>>>>>                 }
> >>>>>>>
> >>>>>>>  private float search(Searcher searcher, String queryString,
> >>>>>>> int id)
> >>>>>>> throws ParseException, IOException {
> >>>>>>>
> >>>>>>>         // Build a Query object
> >>>>>>>
> >>>>>>>         QueryParser queryParser = new QueryParser("contents",
> >>>>>>> new
> >>>>>>> KeywordAnalyzer());
> >>>>>>>
> >>>>>>>         queryParser.setDefaultOperator
> >>>>>>> (QueryParser.Operator.AND);
> >>>>>>>
> >>>>>>>         Query query = queryParser.parse(queryString);
> >>>>>>>
> >>>>>>>         // Search for the query
> >>>>>>>
> >>>>>>>         Hits hits = searcher.search(query);
> >>>>>>>         Document doc = null;
> >>>>>>>
> >>>>>>>         // Examine the Hits object to see if there were any
> >>>>>>> matches
> >>>>>>>         int hitCount = hits.length();
> >>>>>>>
> >>>>>>>                 for(int i=0;i<hitCount;i++){
> >>>>>>>                 doc = hits.doc(i);
> >>>>>>>                 String str = doc.get("item");
> >>>>>>>                 int tmp = Integer.parseInt(str);
> >>>>>>>                 if(tmp==id)
> >>>>>>>                 score = hits.score(i);
> >>>>>>>                 }
> >>>>>>>
> >>>>>>>         return score;
> >>>>>>>     }
> >>>>>>>
> >>>>>>> I really need to optimize doBodySearch(...) as this takes the
> >>>>>>> most
> >>>>>>> time.
> >>>>>>>
> >>>>>>> thanks guys,
> >>>>>>> Askar
> >>>>>>>
> >>>>>>>
> >>>>>>> On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote:
> >>>>>>>
> >>>>>>>         Could you show us the relevant source from
> >>>>>>> doBodySearch()?
> >>>>>>>
> >>>>>>>         -h
> >>>>>>>
> >>>>>>>         On Tue, 2007-07-24 at 19:58 -0400, Askar Zaidi wrote:
> >>>>>>>> I ran some tests and it seems that the slowness is from
> >>>>>>>         Lucene calls when I
> >>>>>>>> do "doBodySearch", if I remove that call, Lucene gives me
> >>>>>>>         results in 5
> >>>>>>>> seconds. otherwise it takes about 50 seconds.
> >>>>>>>>
> >>>>>>>> But I need to do Body search and that field contains lots
> >>>> of
> >>>>>>>         text. The field
> >>>>>>>> is <contents>. How can I optimize that ?
> >>>>>>>>
> >>>>>>>> thanks,
> >>>>>>>> Askar
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> -------------------------------------------------------------------
> >>>> --
> >>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>
> >>>>
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> Center for Natural Language Processing
> >> http://www.cnlp.org/tech/lucene.asp
> >>
> >> Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/
> >> LuceneFAQ
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Fine Tuning Lucene implementation

Reply via email to