I've finished a query time implementation of a column stride filter, which implements DocIdSetIterator. This just builds the filter at process start and uses it for each subsequent query. The index itself is unchanged.
The results are very impressive. Here are the results on a 45M document index: Firstly without an age constraint as a baseline: Query "+name:tim" startup: 0 Hits: 15089 first query: 1004 100 queries: 132 (1.32 msec per query) Now with a cached filter. This is ideal from a speed standpoint but there are too many possible start/end combinations to cache all the filters. Query "+name:tim age:[18 TO 35]" (ConstantScoreQuery on cached RangeFilter) startup: 3 Hits: 11156 first query: 1830 100 queries: 287 (2.87 msec per query) Now with an uncached filter. This is awful. Query "+name:tim age:[18 TO 35]" (uncached ConstantScoreRangeQuery) startup: 3 Hits: 11156 first query: 1665 100 queries: 51862 (yes, 518 msec per query, 200x slower) A RangeQuery is slightly better but still bad (and has a different result set) Query "+name:tim age:[18 TO 35]" (uncached RangeQuery) startup: 0 Hits: 10147 first query: 1517 100 queries: 27157 (271 msec is 100x slower than the filter) Now with the prebuilt column stride filter: Query "+name:tim age:[18 TO 35]" (ConstantScoreQuery on prebuilt column stride filter) startup: 2811 Hits: 11156 first query: 1395 100 queries: 441 (back down to 4.41msec per query) This is less than 2x slower than the dedicated bitset and more than 50x faster than the range boolean query. Mike, Paul, I'm happy to contribute this (ugly but working) code if there is interest. Let me know and I'll open a JIRA issue for it. Tim On 11/11/08 1:27 PM, "Michael McCandless" <[EMAIL PROTECTED]> wrote: > > Paul Elschot wrote: > >> Op Tuesday 11 November 2008 21:55:45 schreef Michael McCandless: >>> Also, one nice optimization we could do with the "term number column- >>> stride array" is do bit packing (borrowing from the PFOR code) >>> dynamically. >>> >>> Ie since we know there are X unique terms in this segment, when >>> populating the array that maps docID to term number we could use >>> exactly the right number of bits. Enumerated fields with not many >>> unique values (eg, country, state) would take relatively little RAM. >>> With LUCENE-1231, where the fields are stored column stride on disk, >>> we could do this packing during index such that loading at search >>> time is very fast. >> >> Perhaps we'd better continue this at LUCENE-1231 or LUCENE-1410. >> I think what you're referring to is PDICT, which has frame exceptions >> for values that occur infrequently. > > Yes let's move the discussion to Jira. > > Actually I was referring to simple bit-packing. > > For encoding array of compact enum terms (eg city, state, color, zip) > I'm guessing the exceptions logic won't buy us much and would hurt > seeking needed for column-stride fields. But we should certainly test > it. > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]