Re: Term numbering and range filtering

2008-11-19 Thread Paul Elschot
Tim, Op Wednesday 19 November 2008 02:32:40 schreef Tim Sturge: ... > >> > >> This is less than 2x slower than the dedicated bitset and more > >> than 50x faster than the range boolean query. > >> > >> Mike, Paul, I'm happy to contribute this (ugly but working) code > >> if there is interest. Let

Re: Term numbering and range filtering

2008-11-18 Thread Tim Sturge
> With "Allow Filter as clause to BooleanQuery": > https://issues.apache.org/jira/browse/LUCENE-1345 > one could even skip the ConstantScoreQuery with this. > Unfortunately 1345 is unfinished for now. > That would be interesting; I'd like to see how much performance improves. >> startup: 2811

Re: Term numbering and range filtering

2008-11-18 Thread Paul Elschot
Op Wednesday 19 November 2008 00:43:56 schreef Tim Sturge: > I've finished a query time implementation of a column stride filter, > which implements DocIdSetIterator. This just builds the filter at > process start and uses it for each subsequent query. The index itself > is unchanged. > > The resul

Re: Term numbering and range filtering

2008-11-18 Thread Tim Sturge
I've finished a query time implementation of a column stride filter, which implements DocIdSetIterator. This just builds the filter at process start and uses it for each subsequent query. The index itself is unchanged. The results are very impressive. Here are the results on a 45M document index:

Re: Term numbering and range filtering

2008-11-11 Thread Michael McCandless
Paul Elschot wrote: Op Tuesday 11 November 2008 21:55:45 schreef Michael McCandless: Also, one nice optimization we could do with the "term number column- stride array" is do bit packing (borrowing from the PFOR code) dynamically. Ie since we know there are X unique terms in this segment, whe

Re: Term numbering and range filtering

2008-11-11 Thread Paul Elschot
Op Tuesday 11 November 2008 21:55:45 schreef Michael McCandless: > Also, one nice optimization we could do with the "term number column- > stride array" is do bit packing (borrowing from the PFOR code) > dynamically. > > Ie since we know there are X unique terms in this segment, when > populating t

Re: Term numbering and range filtering

2008-11-11 Thread Michael McCandless
Also, one nice optimization we could do with the "term number column- stride array" is do bit packing (borrowing from the PFOR code) dynamically. Ie since we know there are X unique terms in this segment, when populating the array that maps docID to term number we could use exactly the r

Re: Term numbering and range filtering

2008-11-11 Thread Paul Elschot
Op Tuesday 11 November 2008 11:29:27 schreef Michael McCandless: > > The other part of your proposal was to somehow "number" term text > such that term range comparisons can be implemented fast int > comparison. ... > >http://fontoura.org/papers/paramsearch.pdf > > However that'd be quite a bit

Re: Term numbering and range filtering

2008-11-11 Thread Michael McCandless
It seems like for many of your examples (age, zip code, country), simply computing & storing the mapping yourself (your first option below) would actually be viable? Also: I think in fact you never need to merge the term numbering for many segments during searching? Ie, the search runs one Inde

Re: Term numbering and range filtering

2008-11-11 Thread Michael McCandless
The other part of your proposal was to somehow "number" term text such that term range comparisons can be implemented fast int comparison. I like the idea of building dynamic filters on top of a "column-stride" array of field values. You could extend it to be a real Scorer, too. EG I could imag

Re: Term numbering and range filtering

2008-11-10 Thread Tim Sturge
Reading this I realize how unclear it is, so let me give a concrete example: I want to do a search restricting users by age range. So someone can ask for the users 18-35, 40-60 etc. Here are the options I considered: 1) construct a RangeQuery. This is a 20-40 clause boolean subquery in an otherw

Re: Term numbering and range filtering

2008-11-10 Thread Tim Sturge
I think we've gone around in a loop here. It's exactly due to the inadequacy of cached filters that I'm considering what I'm doing. Here's the section from my first email that is most illuminating: " The reason I have this question is that I am writing a multi-filter for single term fields. My ind

Re: Term numbering and range filtering

2008-11-10 Thread Paul Elschot
Op Monday 10 November 2008 22:21:20 schreef Tim Sturge: > Hmmm -- I hadn't thought about that so I took a quick look at the > term vector support. > > What I'm really looking for is a compact but performant > representation of a set of filters on the same (one term field). > Using term vectors woul

Re: Term numbering and range filtering

2008-11-10 Thread Tim Sturge
Hmmm -- I hadn't thought about that so I took a quick look at the term vector support. What I'm really looking for is a compact but performant representation of a set of filters on the same (one term field). Using term vectors would mean an algorithm similar to: String myfield; String myterm; Te

Re: Term numbering and range filtering

2008-11-10 Thread Paul Elschot
Tim, I didn't follow all the details, so this may be somewhat off, but did you consider using TermVectors? Regards, Paul Elschot Op Monday 10 November 2008 19:18:38 schreef Tim Sturge: > Yes, that is a significant issue. What I'm coming to realize is that > either I will end up with something l

Re: Term numbering and range filtering

2008-11-10 Thread Tim Sturge
Yes, that is a significant issue. What I'm coming to realize is that either I will end up with something like class MultiFilter { String field; private int[] termInDoc; Map termToInt; ... } which can be entirely built on the current lucene APIs but has significantly more overhead (the

Re: Term numbering and range filtering

2008-11-09 Thread Michael McCandless
Conceivably, TermInfosReader could track the sequence number of each term. A seek/skipTo would know which sequence number it just jumped too, because the index is regular (every 128 terms by default), and then each next() call could increment that. Then retrieving this number would be