[
https://issues.apache.org/jira/browse/LUCENE-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914304#action_12914304
]
Robert Muir commented on LUCENE-851:
------------------------------------
Marvin: is this functionality addressed with LUCENE-2482 ?
> Pruning
> -------
>
> Key: LUCENE-851
> URL: https://issues.apache.org/jira/browse/LUCENE-851
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index, Search
> Reporter: Marvin Humphrey
> Priority: Minor
>
> Greets,
> A thread on java-dev a couple of months ago drew my attention to a technique
> used by Nutch for cutting down the number of hits that have to be processed:
> if you have an algorithm for ordering documents by importance, and you sort
> them so that the lowest document numbers have the highest rank, then most of
> your high-scoring hits are going to occur early on in the hit-collection
> process. Say you're looking for the top 100 matches -- the odds are pretty
> good that after you've found 1000 hits, you've gotten most of the good stuff.
> It may not be necessary to score the other e.g. 5,000,000 hits.
> To pull this off in Nutch, they run the index through a post process whereby
> documents are re-ordered by page score using the IndexSorter class.
> Unfortunately, post-processing does not live happily with incremental
> indexing.
> However, if we ensure that document numbers are ordered according to our
> criteria within each segment, that's almost as good.
> Say we're looking for 100 hits, as before; what we do is collect a maximum of
> 1000 hits per segment. If we are dealing with an index made up of 25
> segments, that's 25,000 hits max we'll have to process fully -- the rest we
> can skip over. That's not as quick as only processing 1000 hits then
> stopping in a fully optimized index, but it's a lot better than churning
> through all 5,000,000 hits.
> A lot of those hits from the smallest segments will be garbage; we'll get
> most of our good hits from a few large segments most of the time. But that's
> fine -- the cost to process any one segment is small.
> Writing a low-level scoring loop which implements pruning per segment is
> straightforward. KinoSearch's version (in C) is below.
> To control the amount of pruning, we need a high-level
> Searcher.setPruneFactor API, which sets a multiplier; the number of
> hits-per-segment which must be processed is determined by multiplying the
> number of hits you need by pruneFactor. Here's code from KS for deriving
> hits-per-seg:
> # process prune_factor if supplied
> my $seg_starts;
> my $hits_per_seg = 2**31 - 1;
> if ( defined $self->{prune_factor} and defined $args{num_wanted} ) {
> my $prune_count = $self->{prune_factor} * $args{num_wanted};
> if ( $prune_count < $hits_per_seg ) { # don't exceed I32_MAX
> $hits_per_seg = $prune_count;
> $seg_starts = $reader->get_seg_starts;
> }
> }
> What I have not yet written is the index-time mechanism for sorting
> documents.
> In Nutch, they use the norms from a known indexed, non-tokenized field
> ("site"). However, in Lucene and KS, we can't count on any existing fields.
> Document boost isn't stored directly, either. The obvious answer is to start
> storing it, which would suffice for Nutch-like uses. However, it may make
> sense to to avoid coupling document ordering to boost in order to influence
> pruning without affecting scores.
> The sort ordering information needs a permanent home in the index, since it
> will be needed whenever segment merging occurs. The fixed-width per-document
> storage in Lucene's .fdx file seems like a good place. If we use one float
> per document, we can simply put it before or after the 64-bit file pointer
> and seek into the file after multiplying the doc num by 12 rather than 8.
> During indexing, we'd keep the ordering info in an array; after all documents
> for a segment have been added, we create an array of sorted document numbers.
> When flushing the postings, their document numbers get remapped using the
> sorted array. Then we rewrite the .fdx file (and also the .tvx file), moving
> the file pointers (and ordering info) to remapped locations. The fact that
> the .fdt file is now "out of order" is isn't a problem -- optimizing
> sequential access to that file isn't important.
> This issue is closely tied to LUCENE-843, "Improve how IndexWriter uses RAM
> to buffer added documents", and LUCENE-847, "Factor merge policy out of
> IndexWriter". Michael McCandless, Steven Parks, Ning Li, anybody else...
> comments? Suggestions?
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
> --------------------------------------------------------------------
> void
> Scorer_collect(Scorer *self, HitCollector *hc, u32_t start, u32_t end,
> u32_t hits_per_seg, VArray *seg_starts)
> {
> u32_t seg_num = 0;
> u32_t doc_num_thresh = 0;
> u32_t hits_this_seg = 0;
> u32_t hits_thresh = hits_per_seg;
> /* get to first doc */
> if ( !Scorer_Skip_To(self, start) )
> return;
> /* execute scoring loop */
> do {
> u32_t doc_num = Scorer_Doc(self);
> Tally *tally;
> if (hits_this_seg >= hits_thresh || doc_num >= doc_num_thresh) {
> if (doc_num >= end) {
> /* bail out of loop if we've hit the user-spec'd end */
> return;
> }
> else if (seg_starts == NULL || seg_starts->size == 0) {
> /* must supply seg_starts to enable pruning */
> hits_thresh = U32_MAX;
> doc_num_thresh = end;
> }
> else if (seg_num == seg_starts->size) {
> /* we've finished the last segment */
> return;
> }
> else {
> /* get start of upcoming segment */
> Int *this_start = (Int*)VA_Fetch(seg_starts, seg_num);
> Int *next_start = (Int*)VA_Fetch(seg_starts, seg_num + 1);
> u32_t this_seg_start = this_start->value;
> seg_num++;
> /* skip docs as appropriate if we're pruning */
> if (doc_num < this_seg_start) {
> if ( Scorer_Skip_To(self, this_seg_start) )
> doc_num = Scorer_Doc(self);
> else
> return;
> }
> /* set the last doc_num we'll score this upcoming segment */
> doc_num_thresh = next_start == NULL
> ? end /* last segment */
> : next_start->value;
> }
> /* start over counting hits for the new segment */
> hits_this_seg = 0;
> }
> /* this doc is in range, so collect it */
> tally = Scorer_Tally(self);
> hc->collect(hc, doc_num, tally->score);
> hits_this_seg++;
> } while (Scorer_Next(self));
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]