Re: HitCollector or Hits

Otis Gospodnetic Thu, 24 May 2007 11:41:57 -0700

Carlos,
It sounds like you'll have to build logic that knows when the index has been 
reopened and repopulates your cache.  Take a look at Solr, it does this type of 
stuff.


Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Carlos Pita <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, May 24, 2007 12:50:04 PM
Subject: Re: HitCollector or Hits

Hi Erick,

I don't think that FieldSelector would be that valuable in my case because I
just need to access a few fields, and those are all fields that are in fact
stored (and indexed too). I was thinking of keeping this extra information
in memory, precisely into an array mapping doc ids to the data structure. I
see that this is done for ScoreDocComparator in a Lucene in Action example.
I'm still not sure how to achieve something similar with a HitCollector. I
mean, I could instantiate a maxDoc() size array and index it by the document
ids that are passed to the collector. But that said, I don't know how to
keep this array synchronized with the index. I've opened a new thread for
this subject, "maxDoc and arrays".

Thank you again.
Cheers,
Carlos

On 5/24/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> You're on the right track. But that said, access to anything that's
> indexed (stored or not) should be pretty quick. Things
> stored, but not indexed, are costlier. This might drive your
> decision on what to index .vs. store.....
>
> Loading the document is anything like IndexReader.document(), or
> Hits.doc().
>
> Part of the difference is that if you load the document, you get
> all the fields, whether you need them or not.
>
> Also, you can use your own TermEnum/TermDocs lookup for
> this kind of thing if the terms you're interested in are indexed...
>
> I wrote a mail some time ago that detailed my experience, in my
> situation with my peculiar data set that you may want to read,
> see...
>
> Lucene 2.1, using FieldSelector speeds up my app by a factor of 10+,
>
>
> As I mentioned in that message, I suspect that my improvement was
> *highly* dependent upon how the index is structured.....
>
> All that said, your notion of benchmarking is a very good one. It lead
> me to using FieldSelector in the first place...
>
> Best
> Erick
>
> On 5/24/07, Carlos Pita <[EMAIL PROTECTED] > wrote:
> >
> > Hi Erick,
> >
> > thank you for your prompt answer. What do you mean by loading the
> > document?
> > Accessing one of the stored fields? In that case I'm afraid I would need
>
> > to
> > do it. For example, in the aforementioned case of a result of products,
> I
> > have to look at any product store_id, which is stored along the
> document.
> > Is
> > this a performance killer? Maybe I should keep some tables in memory,
> for
> > example an array mapping from id to store_id in O(1). I will do some
> > benchmarking before anyway.
> >
> > Cheers,
> > Carlos
> >
> > On 5/24/07, Erick Erickson < [EMAIL PROTECTED]> wrote:
> > >
> > > I know of no way to alter the Hits behavior, I recommend using
> > > a TopDocs/TopDocCollector.
> > >
> > > But be aware that if you load the document for each one, you may incur
>
> > > a significant penalty, although the lazy-loading helped me a lot, see
> > > FieldSelector.....
> > >
> > > On 5/23/07, Carlos Pita <[EMAIL PROTECTED] > wrote:
> > > >
> > > > Hi folks,
> > > >
> > > > I need to collect some global information from my first 1000 search
> > > > results
> > > > in order to build up some search refining components containing only
>
> > > > relevant values (those which correspond to at least one of the first
> > > 1000
> > > > hits). For example, the results are products and there is a store
> > filter
> > > > component that shows only the stores that sells a product between
> the
> > > > first
> > > > 1000 hits. So even if the user sees just the first 20, I would have
> to
> > > > inspect the first 1000. I've read that Hits mantains a cache of
> about
> > > 100
> > > > or
> > > > 200 hits. Is this configurable? If I could set this cache to 1000 I
> > > would
> > > > then use Hits to browse the search results. Another way, I should
> use
> > > > HitCollector. What's your advice?
> > > >
> > > > TIA
> > > > Cheers,
> > > > Carlos
> > > >
> > >
> >
>




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HitCollector or Hits

Reply via email to