TopDocCollector

Uwe Schindler Wed, 10 Jun 2009 11:01:58 -0700

That looks good, but contains the inner search loop (looking up the stored
fields from within the main search loop, which is the hit collector). For
few results this is ok, but if you are collecting thousands of hits from a
very large index that does not fit into memory, the collect gets slow
because of a lot of disk seeking (even when you filter out some fields with
fieldselector, the blocks are read from HDD).


To optimize, store the filename not as stored field, but as a non-tokenized,
indexed term. You can then use

arr = FieldCache.getDefault().getStrings(searcher.getIndexReader(),"FILE");

The returned array contains one entry per document id. Inside the search
loop, just use arr[docID] to get the file name. Please note, on large
indexes the initial field cache loading could take some time.

In Lucene 2.9 this gets better with the new Collectors, that directly work
on segments, if you want to use 2.9 just ask, how the same can be achieved
there. The new collector can there be optimized to get the FieldCaches for
each segment inside Collector.setNextReader()

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Paul J. Lucas [mailto:p...@lucasmail.org]
> Sent: Wednesday, June 10, 2009 5:26 PM
> To: java-user@lucene.apache.org
> Subject: Re: Migrating from Hit/Hits to TopDocs/TopDocCollector
> 
> On Jun 10, 2009, at 3:17 AM, Uwe Schindler wrote:
> 
> > A HitCollector is the correct way to do this (especially because the
> > order of hits is mostly not interesting when retrieving all hits).
> 
> OK, here's what I came up with:
> 
>      Term t = /* ... */
>      Collection<File> files = new LinkedList<File>();
>      FieldSelector fieldSelector = new FieldSelector() {
>          public FieldSelectorResult accept( String fieldName ) {
>              if ( fieldName.equals( "FILE" ) )
>                  return FieldSelectorResult.LOAD_AND_BREAK;
>              return FieldSelectorResult.NO_LOAD;
>          }
>      };
>      HitCollector hitCollector = new HitCollector() {
>          public void collect( int docID, float score ) {
>              try {
>                  Document doc = searcher.doc( docID, fieldSelector );
>                  files.add( new File( doc.get( "FILE" ) ) );
>              }
>              catch ( Exception e ) {
>                  // ignore
>              }
>          }
>      };
>      searcher.search( new TermQuery( t ), hitCollector );
> 
> How's that?
> 
> - Paul
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Migrating from Hit/Hits to TopDocs/TopDocCollector

Reply via email to