Whats the best way to filter based on a function of an indexed term or field value

lucene user Tue, 05 Jun 2007 13:20:18 -0700

We have a large and growing number of articles (< 60k but growing) and we
want to divide
articles from some sources into groups so that we can do queries against
just members of
one or two groups and not find articles from publications that are outside
these publication
groups.


We would like to be able to move things from one group to another but this
is far less
important than blinding speed.

What are the best ways of doing this? What experience do others have doing
this sort of thing?
Below please find all of my ideas/suspicions - but dont assume I am correct
about any of this.
If I knew the answer, I would not be asking! Feel free to ignore the below
and just tell me how
you handle this sort of issue.

Query based on the source name

Each article is tagged which the name of its publication.
We use a big OR query listing all the allowed publications.

Advantages
Which sources belong to which packages can be maintained in our database,
and eaisily changed.

Disadvantages
With hundreds on sources in each package the queries will be very long.
Slow performance, possible "Too Many Clauses" Error.

Query based on the package name
Instead of Taging the documents with source names, use the package names.

Advantages
Should be very fast at search time.
Can search on any boolean combination of packages
Very easy to implement.

Disadvantages
Documents would have to be assigned to packages at the time they are added
to the index.
This assignment could not be changed without deleting and re-adding the
document to the index.
This is not practical because a huge number of documents would have to be
re-indexed just
because one source was moved to a different package.

Use a Filtered Query
A Filtered Query combines any Query with a Filter object which independently
restricts
which documents will appear in the output. A Filter takes an IndexReader as
input and
produces a BitSet as output. The output has one bit for each document in the
index,
those bits which are 'on' represent documents that are allowed in the output

(provided they are also selected by the query).

Advantages
This is likely to work. It is the preferred method of doing such things.

Disadvantages
It requires you to make a full pass through the entire index to set all the
bits,
rather than operating on documents as they are encountered.
With a large archive, that bitset will take up quite a bit of memory.

Use FilterIndexReader
A Class FilterIndexReader is provided in Lucene to make it easy to override
the
index reader. One would think that this would allow you to simply skip
documents
you don't want whenever they are encountered in the index.

For my purposes the FilterIndexReader will contain a HashSet of source names
that
will be permitted in the query. As each document is encountered, we will
extract its
source name from the index and see if it is present in the HashSet, if not
we will go
on to the next document.

An index reader has 3 nested classes that one can override within the
FilterIndexReader.

termEnum - lists distinct terms in the index, with the number of documents
each appears in.
termDocs - lists the terms in the index, with the id number of each document
they appear in.
termPositions - lists the terms, with there document numbers and word
position of each occurance.

I modified termDocs and termPositions to skip to the next document whenever
they would
otherwise settle on a document whos source name is not in the hash set.

But I did not modify termEnum and its document Frequency.
The only way to get the correct document frequency would be to iterate
through all the
documents and see which ones will be included. This would be just almost bad
as using a Filter.

Advantages
This method provides a way to include only documents from a query time
selection of sources.
It is efficent because an in-memory hashtable is used for the look up.
It skips documents as they are encountered, rather than having to iterate
through the
whole index at once.

Disadvantages
It didn't work - although my termDocs and termPositions objects ran, and
skiped the
unwanted documents, those documents still appeared in the result set.

Whats the best way to filter based on a function of an indexed term or field value

Reply via email to