Instead of refactoring the code, would there be a way to just modify the query in each search routine ?
Such as, "search contents:<text> and item:<itemID>"; This means it would just collect the score of that one document whose itemID field = itemID passed from while(rs.next()). I just need to collect the score of the <itemID> already in the index. Would there be a way to modify the query ? Add a clause ? thanks, Askar On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > So, you really want a single Lucene score (based on the scores of > your 4 fields) for every itemID, correct? And this score consists of > scoring the title, tag, summary and body against some keywords correct? > > Here's what I would do: > > while (rs.next()) > { > doc = getDocument(itemId); // Get your document, including > contents from your database, no need even to put them in Lucene, > although you could > add the doc to a MemoryIndex (see contrib/memory) > Run your 4 searches against that memory index to get your > score. Even better, combine your query into a single query that > searches all 4 fields at once, then Lucene will combine the score for > you > } > > MemoryIndex info can be found at http://lucene.zones.apache.org:8080/ > hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/ > package-summary.html > > -Grant > > On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote: > > > Hi Grant, > > > > Thanks for the response. Heres what I am trying to accomplish: > > > > 1. Iterate over itemID (unique) in the database using one SQL query. > > 2. For every itemID found, run 4 searches on Lucene Index. > > 3. doTagSearch(itemID....) ; collect score > > 4. doTitleSearch(itemID...) ; collect score > > 5. doSummarySearch(itemID...) ; collect score > > 6. doBodySearch(itemID....) ; collect score > > > > These scores are then added and I get a total score for each unique > > item in > > the database. > > > > Lucene Index has: <itemID><tags><title><summary><contents> > > > > So if I am running a body search, I have 92 hits from over 300 > > documents for > > a query. I already know my hit with the <itemID> . > > > > For instance, from step (1) if itemID 16 is passed to all the 4 > > searches, I > > just need to get the score of the document which has itemID field = > > 16. I > > don't have to iterate over all the hits. > > > > I suppose I have to change my query to look for <contents> where > > itemID=16. > > Can you guide me as to how to do it ? > > > > thanks a ton, > > > > Askar > > > > On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > >> > >> Hi Askar, > >> > >> I suggest we take a step back, and ask the question, what are you > >> trying to accomplish? That is, what is your application trying to > >> do? Forget the code, etc. just explain what you want the end result > >> to be and we can work from there. Based on what you have described, > >> I am not sure you need access to the hits. It seems like you just > >> need to make better queries. > >> > >> Is your itemID a unique identifier? If yes, then you shouldn't need > >> to loop over hits at all, as you should only ever have one result IF > >> your query contains a required term. Also, if this is the case, why > >> do you need to do a search at all? Haven't you already identified > >> the items of interest when you did your select query in the > >> database? Or is it that you want to score the item based on some > >> terms as well. If that is the case, there are other ways of doing > >> this and we can discuss them. > >> > >> -Grant > >> > >> On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote: > >> > >>> Hey Guys, > >>> > >>> I need to know how I can use the HitCollector class ? I am using > >>> Hits and > >>> looping over all the possible document hits (turns out its 92 times > >>> I am > >>> looping; for 300 searches, its 300*92 !!). Can I avoid this using > >>> HitCollector ? I can't seem to understand how its used. > >>> > >>> thanks a lot, > >>> > >>> Askar > >>> > >>> On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote: > >>>> > >>>> Askar, > >>>> why do you need to add +id:<idWeCareAbout>? > >>>> thanks, > >>>> dt, > >>>> www.ejinz.com > >>>> search engine news forms > >>>> ----- Original Message ----- > >>>> From: "Askar Zaidi" <[EMAIL PROTECTED]> > >>>> To: <java-user@lucene.apache.org>; <[EMAIL PROTECTED]> > >>>> Sent: Wednesday, July 25, 2007 12:39 AM > >>>> Subject: Re: Fine Tuning Lucene implementation > >>>> > >>>> > >>>>> Hey Hira , > >>>>> > >>>>> Thanks so much for the reply. Much appreciate it. > >>>>> > >>>>> Quote: > >>>>> > >>>>> Would it be possible to just include a query clause? > >>>>> - i.e., instead of just contents:<userQuery>, also add > >>>>> +id:<idWeCareAbout> > >>>>> > >>>>> How can I do that ? > >>>>> > >>>>> I see my query as : > >>>>> > >>>>> +contents:harvard +contents:business +contents:review > >>>>> > >>>>> where the search phrase was: harvard business review > >>>>> > >>>>> Now how can I add +id:<idWeCareAbout> ?? > >>>>> > >>>>> This would give me that one exact document I am looking for , for > >>>>> that > >>>> id. > >>>>> I > >>>>> don't have to iterate through hits. > >>>>> > >>>>> thanks, > >>>>> > >>>>> Askar > >>>>> > >>>>> > >>>>> > >>>>> On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote: > >>>>>> > >>>>>> I'm no expert on this (so please accept the comments in that > >>>>>> context) > >>>>>> but 2 things seem weird to me: > >>>>>> > >>>>>> 1. Iterating over each hit is an expensive proposition. I've > >>>>>> often > >>>>>> seen people recommending a HitCollector. > >>>>>> > >>>>>> 2. It seems that doBodySearch() is essentially saying, do this > >>>>>> search > >>>>>> and return the score pertinent to this ID (using an exhaustive > >>>>>> loop). > >>>>>> Would it be possible to just include a query clause? > >>>>>> - i.e., instead of just contents:<userQuery>, also add > >>>>>> +id:<idWeCareAbout> > >>>>>> > >>>>>> In general though, I think your algorithm seems inefficient (if I > >>>>>> understand it correctly):-- if I want to search for one term > >>>>>> among 3 in > >>>>>> a "collection" of 300 documents (as defined by some external > >>>> attribute), > >>>>>> I will wind up executing 300 x 3 searches, and for each search > >>>>>> that is > >>>>>> executed, I will iterate over every Hit, even if I've already > >>>>>> found the > >>>>>> one that I "care about". > >>>>>> > >>>>>> What would break if you: > >>>>>> 1. Included "creator" in the Lucene index (or, filtered out the > >>>>>> Hits > >>>>>> using a BitSet or something like it) > >>>>>> 2. Executed 1 search > >>>>>> 3. Collected the results of the first N Hits (where N is some > >>>>>> reasonable limit, like 100 or 500) > >>>>>> > >>>>>> -h > >>>>>> > >>>>>> > >>>>>> On Tue, 2007-07-24 at 20:14 -0400, Askar Zaidi wrote: > >>>>>> > >>>>>>> Sure. > >>>>>>> > >>>>>>> public float doBodySearch(Searcher searcher,String query, int > >>>>>>> id){ > >>>>>>> > >>>>>>> try{ > >>>>>>> score = search(searcher, > >>>>>>> query,id); > >>>>>>> } > >>>>>>> catch(IOException io){} > >>>>>>> catch(ParseException pe){} > >>>>>>> > >>>>>>> return score; > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> private float search(Searcher searcher, String queryString, > >>>>>>> int id) > >>>>>>> throws ParseException, IOException { > >>>>>>> > >>>>>>> // Build a Query object > >>>>>>> > >>>>>>> QueryParser queryParser = new QueryParser("contents", > >>>>>>> new > >>>>>>> KeywordAnalyzer()); > >>>>>>> > >>>>>>> queryParser.setDefaultOperator > >>>>>>> (QueryParser.Operator.AND); > >>>>>>> > >>>>>>> Query query = queryParser.parse(queryString); > >>>>>>> > >>>>>>> // Search for the query > >>>>>>> > >>>>>>> Hits hits = searcher.search(query); > >>>>>>> Document doc = null; > >>>>>>> > >>>>>>> // Examine the Hits object to see if there were any > >>>>>>> matches > >>>>>>> int hitCount = hits.length(); > >>>>>>> > >>>>>>> for(int i=0;i<hitCount;i++){ > >>>>>>> doc = hits.doc(i); > >>>>>>> String str = doc.get("item"); > >>>>>>> int tmp = Integer.parseInt(str); > >>>>>>> if(tmp==id) > >>>>>>> score = hits.score(i); > >>>>>>> } > >>>>>>> > >>>>>>> return score; > >>>>>>> } > >>>>>>> > >>>>>>> I really need to optimize doBodySearch(...) as this takes the > >>>>>>> most > >>>>>>> time. > >>>>>>> > >>>>>>> thanks guys, > >>>>>>> Askar > >>>>>>> > >>>>>>> > >>>>>>> On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote: > >>>>>>> > >>>>>>> Could you show us the relevant source from > >>>>>>> doBodySearch()? > >>>>>>> > >>>>>>> -h > >>>>>>> > >>>>>>> On Tue, 2007-07-24 at 19:58 -0400, Askar Zaidi wrote: > >>>>>>>> I ran some tests and it seems that the slowness is from > >>>>>>> Lucene calls when I > >>>>>>>> do "doBodySearch", if I remove that call, Lucene gives me > >>>>>>> results in 5 > >>>>>>>> seconds. otherwise it takes about 50 seconds. > >>>>>>>> > >>>>>>>> But I need to do Body search and that field contains lots > >>>> of > >>>>>>> text. The field > >>>>>>>> is <contents>. How can I optimize that ? > >>>>>>>> > >>>>>>>> thanks, > >>>>>>>> Askar > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>>> ------------------------------------------------------------------- > >>>> -- > >>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>> > >>>> > >> > >> -------------------------- > >> Grant Ingersoll > >> Center for Natural Language Processing > >> http://www.cnlp.org/tech/lucene.asp > >> > >> Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/ > >> LuceneFAQ > >> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > > ------------------------------------------------------ > Grant Ingersoll > http://www.grantingersoll.com/ > http://lucene.grantingersoll.com > http://www.paperoftheweek.com/ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >