[jira] Commented: (LUCENE-851) Pruning

Robert Muir (JIRA) Thu, 23 Sep 2010 18:01:57 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914304#action_12914304
 ]


Robert Muir commented on LUCENE-851:
------------------------------------

Marvin: is this functionality addressed with LUCENE-2482 ?

> Pruning
> -------
>
>                 Key: LUCENE-851
>                 URL: https://issues.apache.org/jira/browse/LUCENE-851
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index, Search
>            Reporter: Marvin Humphrey
>            Priority: Minor
>
> Greets,
> A thread on java-dev a couple of months ago drew my attention to a technique 
> used by Nutch for cutting down the number of hits that have to be processed:  
> if you have an algorithm for ordering documents by importance, and you sort 
> them so that the lowest document numbers have the highest rank, then most of 
> your high-scoring hits are going to occur early on in the hit-collection 
> process.  Say you're looking for the top 100 matches -- the odds are pretty 
> good that after you've found 1000 hits, you've gotten most of the good stuff. 
>  It may not be necessary to score the other e.g. 5,000,000 hits.
> To pull this off in Nutch, they run the index through a post process whereby 
> documents are re-ordered by page score using the IndexSorter class.  
> Unfortunately, post-processing does not live happily with incremental 
> indexing.  
> However, if we ensure that document numbers are ordered according to our 
> criteria within each segment, that's almost as good.
> Say we're looking for 100 hits, as before; what we do is collect a maximum of 
> 1000 hits per segment.  If we are dealing with an index made up of 25 
> segments, that's 25,000 hits max we'll have to process fully -- the rest we 
> can skip over.  That's not as quick as only processing 1000 hits then 
> stopping in a fully optimized index, but it's a lot better than churning 
> through all 5,000,000 hits.
> A lot of those hits from the smallest segments will be garbage; we'll get 
> most of our good hits from a few large segments most of the time.  But that's 
> fine -- the cost to process any one segment is small.
> Writing a low-level scoring loop which implements pruning per segment is 
> straightforward.  KinoSearch's version (in C) is below.
> To control the amount of pruning, we need a high-level 
> Searcher.setPruneFactor API, which sets a multiplier; the number of 
> hits-per-segment which must be processed is determined by multiplying the 
> number of hits you need by pruneFactor.  Here's code from KS for deriving 
> hits-per-seg:
>     # process prune_factor if supplied
>     my $seg_starts;
>     my $hits_per_seg = 2**31 - 1;
>     if ( defined $self->{prune_factor} and defined $args{num_wanted} ) {
>         my $prune_count = $self->{prune_factor} * $args{num_wanted};
>         if ( $prune_count < $hits_per_seg ) {    # don't exceed I32_MAX
>             $hits_per_seg = $prune_count;
>             $seg_starts   = $reader->get_seg_starts;
>         }
>     }
> What I have not yet written is the index-time mechanism for sorting 
> documents.  
> In Nutch, they use the norms from a known indexed, non-tokenized field 
> ("site").  However, in Lucene and KS, we can't count on any existing fields.  
> Document boost isn't stored directly, either.  The obvious answer is to start 
> storing it, which would suffice for Nutch-like uses.  However, it may make 
> sense to to avoid coupling document ordering to boost in order to influence 
> pruning without affecting scores.
> The sort ordering information needs a permanent home in the index, since it 
> will be needed whenever segment merging occurs.  The fixed-width per-document 
> storage in Lucene's .fdx file seems like a good place.  If we use one float 
> per document, we can simply put it before or after the 64-bit file pointer 
> and seek into the file after multiplying the doc num by 12 rather than 8.  
> During indexing, we'd keep the ordering info in an array; after all documents 
> for a segment have been added, we create an array of sorted document numbers. 
>  When flushing the postings, their document numbers get remapped using the 
> sorted array.  Then we rewrite the .fdx file (and also the .tvx file), moving 
> the file pointers (and ordering info) to remapped locations.  The fact that 
> the .fdt file is now "out of order" is isn't a problem -- optimizing 
> sequential access to that file isn't important.
> This issue is closely tied to LUCENE-843, "Improve how IndexWriter uses RAM 
> to buffer added documents", and LUCENE-847, "Factor merge policy out of 
> IndexWriter".  Michael McCandless, Steven Parks, Ning Li, anybody else... 
> comments?  Suggestions?
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
> --------------------------------------------------------------------
> void
> Scorer_collect(Scorer *self, HitCollector *hc, u32_t start, u32_t end,
>                u32_t hits_per_seg, VArray *seg_starts)
> {
>     u32_t seg_num          = 0;
>     u32_t doc_num_thresh   = 0;
>     u32_t hits_this_seg    = 0;
>     u32_t hits_thresh      = hits_per_seg;
>     /* get to first doc */
>     if ( !Scorer_Skip_To(self, start) )
>         return;
>     /* execute scoring loop */
>     do {
>         u32_t doc_num = Scorer_Doc(self);
>         Tally *tally;
>         if (hits_this_seg >= hits_thresh || doc_num >= doc_num_thresh) {
>             if (doc_num >= end) {
>                 /* bail out of loop if we've hit the user-spec'd end */
>                 return;
>             }
>             else if (seg_starts == NULL || seg_starts->size == 0) {
>                 /* must supply seg_starts to enable pruning */
>                 hits_thresh    = U32_MAX;
>                 doc_num_thresh = end;
>             }
>             else if (seg_num == seg_starts->size) {
>                 /* we've finished the last segment */
>                 return;
>             }
>             else {
>                 /* get start of upcoming segment */
>                 Int *this_start = (Int*)VA_Fetch(seg_starts, seg_num);
>                 Int *next_start = (Int*)VA_Fetch(seg_starts, seg_num + 1);
>                 u32_t this_seg_start = this_start->value;
>                 seg_num++;
>                 /* skip docs as appropriate if we're pruning */
>                 if (doc_num < this_seg_start) {
>                     if ( Scorer_Skip_To(self, this_seg_start) )
>                         doc_num = Scorer_Doc(self);
>                     else
>                         return;
>                 }
>                 /* set the last doc_num we'll score this upcoming segment */
>                 doc_num_thresh = next_start == NULL
>                     ? end  /* last segment */
>                     : next_start->value;
>             }
>             /* start over counting hits for the new segment */
>             hits_this_seg = 0;
>         }
>         /* this doc is in range, so collect it */
>         tally = Scorer_Tally(self);
>         hc->collect(hc, doc_num, tally->score);
>         hits_this_seg++;
>     } while (Scorer_Next(self));
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-851) Pruning

Reply via email to