On Friday 06 January 2006 18:04, Beady Geraghty wrote: > I would like to do queries that are negative. I mean a query with > only negative terms and phrases. For example, retrieve all > documents that do not contain the term "apple". > > For now, I have a limited set of documents (say, 10000) to index. > I can create a bitset that represents the search result of hits on "apple". > Then I complement (XOR) the result. > Each bit corresponds to a document ID. > My question is : > Inside Lucene, are the hits represented in some form of a bitset. > Can I get at it directly. I saw the BitSet class. (I now use > Java's Bitset class). > Assuming that hits are internally represented as bitset, for a > small number of documets, the bitset won't be very big, > and if there are plenty of hits and many many more documents, > is the bitset still kept entirely > in memory as well ?
A Hits is implemented by caching some of the highest scoring documents, when more documents are needed the search is repeated to collect more documents. The problem with negative queries is that the scores of the results do not vary, so it is not useful to keep only the highest scoring docs. This also means that all results will have to be processed further in some other way. The easiest way to do that is to use the MatchAllDocsQuery as indicated earlier, and then use the low level search API with your own HitCollector. You can then use any data structure in your HitCollector. A simple and fast collect() implementation just counts the results, and that can already be quite informative. Setting up a BitSet for the matching document numbers is also possible. It's best to avoid accessing the index via the IndexReader inside the collect() implementation of the HitCollector. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]