RE: Language Specific Analyzer

2015-11-14 Thread Uwe Schindler
Hi, you cannot change the behavior of predefined analyzers! But since Lucene 5 there is no need to write your own subclass to define a custom analyzer. Just use CustomAnalyzer and define via fluent builder API how your analysis should look like (see example in javadocs): https://lucene.apache.

Language Specific Analyzer

2015-11-14 Thread marco turchi
Dear Users, I need to develop my language specific analyzer that: 1) does not remove punctuations 2) lowercases and stems each term in the text. I have tried some of the pre-implemented language analyzer (e.g. German and Italian analyzers), but they remove punctuation. I/m not sure, but probably

Re: 500 millions document for loop.

2015-11-14 Thread Valentin Popov
Return "false" for "out of order", save 1 sec for 1M records, at the end it save 500 sec or ~10 minutes! Thank you! > On 14 нояб. 2015 г., at 15:54, Uwe Schindler wrote: > > For performance reasons, I would also return "false" for "out of order" > documents. This allows to access stored fiel

Re: 500 millions document for loop.

2015-11-14 Thread Valentin Popov
Thank you! Will follow you suggestion. > On 14 нояб. 2015 г., at 15:54, Uwe Schindler wrote: > > For performance reasons, I would also return "false" for "out of order" > documents. This allows to access stored fields in a more effective way > (otherwise it seeks too much). For this type o

RE: 500 millions document for loop.

2015-11-14 Thread Uwe Schindler
For performance reasons, I would also return "false" for "out of order" documents. This allows to access stored fields in a more effective way (otherwise it seeks too much). For this type of collector the IO cost is higher than the small computing performance increase caused by out of order docu

Re: 500 millions document for loop.

2015-11-14 Thread Valentin Popov
Thank you very much! > On 14 нояб. 2015 г., at 15:49, Uwe Schindler wrote: > > Hi, > > This code is buggy! The collect() call of the collector does not get a > document ID relative to the top-level IndexSearcher, it only gets a document > id relative to the reader reported in setNextReader

RE: 500 millions document for loop.

2015-11-14 Thread Uwe Schindler
Hi, This code is buggy! The collect() call of the collector does not get a document ID relative to the top-level IndexSearcher, it only gets a document id relative to the reader reported in setNextReader (which is a atomic reader responsible for a single Lucene index segment). In setNextReader

Re: 500 millions document for loop.

2015-11-14 Thread Valentin Popov
Hi, Uwe. Thanks for you advise. After implementing you suggestion, our calculation time drop down from ~20 days to 3,5 hours. /** * * DocumentFound - callback function for each document */ public void iterate(SearchOptions options, final DocumentFound found, final Set loadFields) throws Exc

RE: debugging growing index size

2015-11-14 Thread Rob Audenaerde
Thank you all, I will further fix and investigate! On Nov 14, 2015 10:00, "Uwe Schindler" wrote: > I agree. On Linux it is impossible that MMapDirectory is the reason! Only > on windows you cannot delete still open/mapped files. > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen >

RE: debugging growing index size

2015-11-14 Thread Uwe Schindler
I agree. On Linux it is impossible that MMapDirectory is the reason! Only on windows you cannot delete still open/mapped files. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Michael McCandless [mailto: