Thank you very much!
> On 14 нояб. 2015 г., at 15:49, Uwe Schindler <u...@thetaphi.de> wrote: > > Hi, > > This code is buggy! The collect() call of the collector does not get a > document ID relative to the top-level IndexSearcher, it only gets a document > id relative to the reader reported in setNextReader (which is a atomic reader > responsible for a single Lucene index segment). > > In setNextReader, save the reference to the "current" reader. And use this > "current" reader to get the stored fields: > > indexSearcher.search(query, queryFilter, new Collector() { > AtomicReader current; > > @Override > public void setScorer(Scorer arg0) throws IOException { > } > > @Override > public void setNextReader(AtomicReaderContext ctx) > throws IOException { > current = ctx.reader(); > } > > @Override > public void collect(int docID) throws IOException { > Document doc = current.document(docID, > loadFields); > found.found(doc); > } > > @Override > public boolean acceptsDocsOutOfOrder() { > return true; > } > }); > > Otherwise you get wrong document ids reported!!! > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > >> -----Original Message----- >> From: Valentin Popov [mailto:valentin...@gmail.com] >> Sent: Saturday, November 14, 2015 1:04 PM >> To: java-user@lucene.apache.org >> Subject: Re: 500 millions document for loop. >> >> Hi, Uwe. >> >> Thanks for you advise. >> >> After implementing you suggestion, our calculation time drop down from ~20 >> days to 3,5 hours. >> >> /** >> * >> * DocumentFound - callback function for each document >> */ >> public void iterate(SearchOptions options, final DocumentFound found, final >> Set<String> loadFields) throws Exception { >> Query query = options.getQuery(); >> Filter queryFilter = options.getQueryFilter(); >> final IndexSearcher indexSearcher = new >> VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx >> ecutor()); >> >> indexSearcher.search(query, queryFilter, new Collector() { >> >> @Override >> public void setScorer(Scorer arg0) throws IOException >> { } >> >> @Override >> public void setNextReader(AtomicReaderContext >> arg0) throws IOException { } >> >> @Override >> public void collect(int docID) throws IOException { >> Document doc = indexSearcher.doc(docID, >> loadFields); >> found.found(doc); >> } >> >> @Override >> public boolean acceptsDocsOutOfOrder() { >> return true; >> } >> }); >> >> } >> >> >>> On 12 нояб. 2015 г., at 21:15, Uwe Schindler <u...@thetaphi.de> wrote: >>> >>> Hi, >>> >>>>> The big question is: Do you need the results paged at all? >>>> >>>> Yup, because if we return all results, we get OME. >>> >>> You get the OME because the paging collector cannot handle that, so this is >> an XY problem. Would it not be better if you application just gets the >> results >> as a stream and processes them one after each other? If this is the case (and >> most statistics need it like that), your much better to NOT USE TOPDOCS!!!! >> Your requirement is diametral to getting top-scoring documents! You want to >> get ALL results as a sequence. >>> >>>>> Do you need them sorted? >>>> >>>> Nope. >>> >>> OK, so unsorted streaming is the right approach. >>> >>>>> If not, the easiest approach is to use a custom Collector that does no >>>> sorting and just consumes the results. >>>> >>>> Main bottleneck as I see come from next page search, that took ~2-4 >>>> seconds. >>> >>> This is because when paging the collector has to re-execute the whole >> query and sort all results again, just with a larger window. So if you have >> result pages of 50000 results and you want to get the second page, it will >> internally sort 100000 results, because the first page needs to be >> calculated, >> too. If you go forward in results the windows gets larger and larger, until >> it >> finally collects all results. >>> >>> So just get the results as a stream by implementing the Collector API is the >> right way to do this. >>> >>>>> >>>>> Uwe >>>>> >>>>> ----- >>>>> Uwe Schindler >>>>> H.-H.-Meier-Allee 63, D-28213 Bremen >>>>> http://www.thetaphi.de >>>>> eMail: u...@thetaphi.de >>>>> >>>>>> -----Original Message----- >>>>>> From: Valentin Popov [mailto:valentin...@gmail.com] >>>>>> Sent: Thursday, November 12, 2015 6:48 PM >>>>>> To: java-user@lucene.apache.org >>>>>> Subject: Re: 500 millions document for loop. >>>>>> >>>>>> Toke, thanks! >>>>>> >>>>>> We will look at this solution, looks like this is that what we need. >>>>>> >>>>>> >>>>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen <t...@statsbiblioteket.dk> >>>>>> wrote: >>>>>>> >>>>>>> Valentin Popov <valentin...@gmail.com> wrote: >>>>>>> >>>>>>>> We have ~10 indexes for 500M documents, each document >>>>>>>> has «archive date», and «to» address, one of our task is >>>>>>>> calculate statistics of «to» for last year. Right now we are >>>>>>>> using search archive_date:(current_date - 1 year) and paginate >>>>>>>> results for 50k records for page. Bottleneck of that approach, >>>>>>>> pagination take too long time and on powerful server it take >>>>>>>> ~20 days to execute, and it is very long. >>>>>>> >>>>>>> Lucene does not like deep page requests due to the way the internal >>>>>> Priority Queue works. Solr has CursorMark, which should be fairly >> simple >>>> to >>>>>> emulate in your Lucene handling code: >>>>>>> >>>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr- >> efficient- >>>>>> cursor-based-iteration-of-large-result-sets/ >>>>>>> >>>>>>> - Toke Eskildsen >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>> >>>>>> >>>>>> Regards, >>>>>> Valentin Popov >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>> >>>> >>>> С Уважением, >>>> Валентин Попов >>>> >>>> >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> С Уважением, >> Валентин Попов >> >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > С Уважением, Валентин Попов --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org