We have a large and growing number of articles (< 60k but growing) and we want to divide articles from some sources into groups so that we can do queries against just members of one or two groups and not find articles from publications that are outside these publication groups.
We would like to be able to move things from one group to another but this is far less important than blinding speed. What are the best ways of doing this? What experience do others have doing this sort of thing? Below please find all of my ideas/suspicions - but dont assume I am correct about any of this. If I knew the answer, I would not be asking! Feel free to ignore the below and just tell me how you handle this sort of issue. Query based on the source name Each article is tagged which the name of its publication. We use a big OR query listing all the allowed publications. Advantages Which sources belong to which packages can be maintained in our database, and eaisily changed. Disadvantages With hundreds on sources in each package the queries will be very long. Slow performance, possible "Too Many Clauses" Error. Query based on the package name Instead of Taging the documents with source names, use the package names. Advantages Should be very fast at search time. Can search on any boolean combination of packages Very easy to implement. Disadvantages Documents would have to be assigned to packages at the time they are added to the index. This assignment could not be changed without deleting and re-adding the document to the index. This is not practical because a huge number of documents would have to be re-indexed just because one source was moved to a different package. Use a Filtered Query A Filtered Query combines any Query with a Filter object which independently restricts which documents will appear in the output. A Filter takes an IndexReader as input and produces a BitSet as output. The output has one bit for each document in the index, those bits which are 'on' represent documents that are allowed in the output (provided they are also selected by the query). Advantages This is likely to work. It is the preferred method of doing such things. Disadvantages It requires you to make a full pass through the entire index to set all the bits, rather than operating on documents as they are encountered. With a large archive, that bitset will take up quite a bit of memory. Use FilterIndexReader A Class FilterIndexReader is provided in Lucene to make it easy to override the index reader. One would think that this would allow you to simply skip documents you don't want whenever they are encountered in the index. For my purposes the FilterIndexReader will contain a HashSet of source names that will be permitted in the query. As each document is encountered, we will extract its source name from the index and see if it is present in the HashSet, if not we will go on to the next document. An index reader has 3 nested classes that one can override within the FilterIndexReader. termEnum - lists distinct terms in the index, with the number of documents each appears in. termDocs - lists the terms in the index, with the id number of each document they appear in. termPositions - lists the terms, with there document numbers and word position of each occurance. I modified termDocs and termPositions to skip to the next document whenever they would otherwise settle on a document whos source name is not in the hash set. But I did not modify termEnum and its document Frequency. The only way to get the correct document frequency would be to iterate through all the documents and see which ones will be included. This would be just almost bad as using a Filter. Advantages This method provides a way to include only documents from a query time selection of sources. It is efficent because an in-memory hashtable is used for the look up. It skips documents as they are encountered, rather than having to iterate through the whole index at once. Disadvantages It didn't work - although my termDocs and termPositions objects ran, and skiped the unwanted documents, those documents still appeared in the result set.