forceMerge(1) leads to ~10% perf gains

2023-09-21 Thread qrdl kaggle
After testing on 4800 fairly complex queries, I see a performance gain of 10% after doing indexWriter.forceMerge(1); indexWriter.commit(); from 209 ms per query, to 185 ms per query. Queries are quite complex, often about 30 or words, of the format OR text: It went from 214 to 14 files on the for

Re: How to retain % sign next to number during tokenization

2023-09-21 Thread Amitesh Kumar
Thank you! I will give it a try and share my findings with you all Regards Amitesh On Thu, Sep 21, 2023 at 08:18 Uwe Schindler wrote: > The problem with WhitespaceTokenizer is that is splits only on > whitespace. If you have text like "This is, was some test." then you get > tokens like "is," a

Re: How to retain % sign next to number during tokenization

2023-09-21 Thread Uwe Schindler
The problem with WhitespaceTokenizer is that is splits only on whitespace. If you have text like "This is, was some test." then you get tokens like "is," and "test." including the punctuations. This is the reason why StandardTokenizer is normally used for human readable text. WhitespaceTokeniz

Re: How to retain % sign next to number during tokenization

2023-09-21 Thread Mikhail Khludnev
Hello, I'm surprised and in doubt it may happen. Would you mind to upload a short test reproducing it? On Wed, Sep 20, 2023 at 11:44 PM Amitesh Kumar wrote: > Thanks Mikhail! > > I have tried all other tokenizers from Lucene4.4. In case of > WhitespaceTokwnizer, it loses romanizing of special ch