[jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Kaktu Chakarabati (JIRA) Mon, 27 Dec 2010 10:31:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975304#action_12975304
 ]


Kaktu Chakarabati commented on LUCENE-1812:
-------------------------------------------

Hey,
While trying to use this (wonderful!) component I noticed few things that might 
require some work:

1. The issue says this affects lucene 2.9 as well, however the code seems to be 
hard-coded for 3.0 (uses the LUCENE_30 constant, as well as some new API's such 
as IndexWriterConfig).
     I created a patch that'll make it work with 2.9.3 (so I can use it with a 
Solr 1.4.1 deployment), and I can post it as a patch if seems useful, but I 
suspect we might want to come up with a more generic solution as well
     as clear definition of supported versions. Personally I think will be very 
useful to have a backport for 2.9.x so that users of current stable Solr 
release can use it (1.4.x)

2. The code does not compile with the trunk (lucene/solr 4.0). Is this known 
issue? something we wish to solve? 

3. When using it with the 3.0 branch, it does indeed work, However when it 
reads an older version of the index and emits a newer one (e.g reads in 2.9.x, 
spits out 3.x) it renders the pruned index unusable by some platforms (e.g solr 
1.4.x as mentioned above). Is this something that can be fixed? i.e forcing 
output index to be same version as input one? I was gonna do some work on my 
own there but this issue seems alittle more delicate and requires deeper 
understanding of lucene innards than i afford..

-Chak
     

> Static index pruning by in-document term frequency (Carmel pruning)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1812
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1812
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.9, 3.1
>            Reporter: Andrzej Bialecki 
>         Attachments: pruning.patch, pruning.patch, pruning.patch, 
> pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing 
> postings data for those terms where their in-document frequency is below a 
> specified threshold. The net effect of this processing is a much smaller 
> index that for common types of queries returns nearly identical top-N results 
> as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This 
> functionality is largely independent, so it can be used without term pruning 
> (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, 
> search performance increases, and recall decreases (i.e. search quality 
> deteriorates). NOTE: especially phrase recall deteriorates significantly at 
> higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit 
> completely in RAM, and store these indexes using 
> IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class 
> will not be sufficient to use the resulting index view for on-the-fly pruning 
> and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then 
> the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve 
> internal document id-s so that they are in sync with the original index. This 
> means that all other auxiliary information not necessary for first-tier 
> processing, such as some stored fields, can also be removed, to be quickly 
> retrieved on-demand from the original index using the same internal document 
> id. 
> Threshold values can be specified globally (for terms in all fields) using 
> defaultThreshold parameter, and can be overriden using per-field or per-term 
> values supplied in a thresholds map. Keys in this map are either field names, 
> or terms in field:text format. The precedence of these values is the 
> following: first a per-term threshold is used if present, then per-field 
> threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment 
> it doesn't support all functionality available through API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Reply via email to