Re: standard tokenizer seemingly splitting on dot

2023-05-03 Thread Shawn Heisey
On 5/2/23 15:30, Bill Tantzen wrote: This works as I expected: ab00c.tif -- tokenizes as it should with a value of ab00c.tif This doesn't work as I expected ab003.tif -- tokenizes with a result of ab003 and tif I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode handling.

Re: standard tokenizer seemingly splitting on dot

2023-05-03 Thread Bill Tantzen
Shawn, No, email addresses are not preserved -- from the docs: - The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens. but the non-split on "test.com" vs the split on "test7.com" is unexpected! ~~Bill On Wed, May 3,

Solr logs (hits value) and memory allocation

2023-05-03 Thread Vincenzo D'Amore
Hi all, Just asking if there could be some correlation from the amount of memory allocated by a Solr query and the number of *hits* selected in solr logs. I haven't found anything in the Solr documentation. Do you know if there is some advice for the hits value? Thanks, Vincenzo -- Vincenzo D'

Re: Solr logs (hits value) and memory allocation

2023-05-03 Thread Markus Jelsma
Hello Vincenzo, Yes. Last time i checked, an array of ScoreDoc objects is created for each query with the size of the numFound for the local core/replica. This should clearly visible in VisualVM. This happens in SolrIndexSearcher. Regards, Markus Op wo 3 mei 2023 om 17:20 schreef Vincenzo D'Amor

Re: Solr logs (hits value) and memory allocation

2023-05-03 Thread Vincenzo D'Amore
Hi Markus, thanks for your explanation. What if I submit a query q=*:*&rows=0 and there are 200M of documents in the solr core? Will I allocate an array of ScoreDoc objects so big? On Wed, May 3, 2023 at 5:32 PM Markus Jelsma wrote: > Hello Vincenzo, > > Yes. Last time i checked, an array of

Re: Solr logs (hits value) and memory allocation

2023-05-03 Thread Kevin Risden
Here is an example calculation of bytes -> number of entries held from the bitset. (2864256-12-12)/24 = 119343 long objects = 22913856 entries The above is from a cluster where each query is generating a bitset of size 2864256 bytes - ~2.8 MB on heap. This is for 22 million results in the results

Re: Help regarding solr request timeout because of spellcheck component performance.

2023-05-03 Thread Chris Hostetter
1) timeAllowed does limit spellcheck (at least in all the code paths i can think of that may be "slow") ... have you tried it? 2) what is your configuration for the dictionaries you are using? 3) be wary of https://github.com/apache/lucene/issues/12077 : Date: Tue, 2 May 2023 00:04:27 +0530