See below... On Wed, Mar 26, 2008 at 11:29 AM, Wojtek H <[EMAIL PROTECTED]> wrote:
> Thank you for reply. What I did not mention before was that for > iteration we don't care about scoring, so that's not the issue at all. > Creating Filter with BitSet seems much better idea than keeping > HitIterator in memory. Am I right that in such a case with > MatchAllDocsQuery memory usage would be around > ( NUM_OF_DOCS_IN_INDEX / 8 ) bytes? Yes, it's very close to that number as far as I can tell. > > We didn't check it yet, but do you think that time for accessing > documents (reader.doc(i)) is big enough to make iteration in > HitCollector (without accessing any objects) over documents > already returned almost un-noticable? I don't quite understand this. Accessing a single document in HitCollector via reader.doc(i) is probably unnoticeable. Accessing every documents in the HitCollector is bad, very bad. If you're doing something like spinning through the HitCollector N times, then returning the Nth document and breaking, I don't know. You'll just have to experiment I'm afraid. > And another question - if I don't care about scoring is there a way to > make Lucene don't spend time on calculating score (I don't know if > that time matters)? HitCollector receives doc and its score (as far as > I remember the difference here is that it is not normalized to value > between 0 and 1). Is there a way (and does it make sense) to make > scoring faster in such a case? I *think* ConstantScoreQuery is your friend here, although I haven't used it personally. > > And to make things clear - am I right that if I operate on the same > searcher over requests for docs chunks I don't see neither additions > nor deletions which could happen meanwhile? So if I would like to > iterate over point-in-time keeping the same searcher opened would do. > Thanks and regards, > You must close and re-open a reader to see any changes since the last time you opened that reader. So I think your assumption is OK. Have fun, Erick > wojtek > > > 2008/3/26, Erick Erickson <[EMAIL PROTECTED]>: > > Why not keep a Filter in memory? It consists of a single bit per > document > > and the ordinal position of that bit is the Lucene doc ID. You could > create > > this reasonably quickly for the *first* query that came in via > HitCollector. > > > > Then each time you wanted another chunk, use the filter to know which > > docs to return. You could either, say, extend the Filter class and add > > some bookeeping or just zero out each bit that you returned to the > user. > > > > NOTE: you don't get relevance this way, but for the case of returning > all > > docs do you really want it? > > > > About updating the index. Remember that there is no "update in place". > > So you'll only have to check whether any document in the filter has > been > > deleted when you are returning. Then you'd have to do something about > > looking for any new additions as you returned the last document in the > > set... > > But remember that until you close/reopen the searcher, you won't see > changes > > anyway..... > > > > But you may not need to do any of this. If, each time you return a > chunk, > > you're using a Hits object, then this is the first thing I'd change. A > Hits > > object re-executes the query every 100th element you look at. So, > assume > > you have something like > > > > (bad pseudo code here) > > for (int idx = 0; idx < firstdocinchunk && Hits.next(); ++idx) > > { > > } > > > > for (idx = 0; idx < chunksize && Hits.next(); ++idx) > > { > > assemble doc for return > > } > > and the first doc you want to return is number 1,000, you'll actually > > be re-executing the query 10 times. Which probably accounts for your > > quadratic time. > > > > So I'd try just using a new HitCollector each time and see if that > solves > > your problems before getting fancy. There really shouldn't be any > > noticeable difference between the first and last request unless you're > > doing something like accessing the documents before you get to > > the first one you expect to return. And a TopDocs should even > > preserve scoring....... > > > > Best > > > > Erick > > > > > > > > > > On Wed, Mar 26, 2008 at 5:48 AM, Wojtek H <[EMAIL PROTECTED]> wrote: > > > > > Hi all, > > > > > > our problem is to choose the best (the fastest) way to iterate over > huge > > > set > > > of documents (basic and most important case is to iterate over all > > > documents > > > in the index). Some slow process accesses documents and now it is > done via > > > repeating query (for instance MatchAllDocsQuery). It processess first > N > > > docs > > > then repeats query and processes next N docs and so on. Repeating > query > > > means in fact quadratic time! So we think about changing the way docs > are > > > accessed. > > > In case of generic query the only way to speed it up we see is to > keep > > > HitCollector in memory between requests for docs. Isn't this approach > too > > > memory consuming? > > > In case of iterating over all documents I was wondering if there is a > way > > > to > > > determine set of index ids over which we could iterate (and of course > > > control index changes - if index is changed between requests we > should > > > probably invalidate 'iterating session'). > > > What is the best solution for this problem? > > > Thanks and regards, > > > > > > wojtek > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >