Re: standard tokenizer seemingly splitting on dot
On 5/2/23 15:30, Bill Tantzen wrote: This works as I expected: ab00c.tif -- tokenizes as it should with a value of ab00c.tif This doesn't work as I expected ab003.tif -- tokenizes with a result of ab003 and tif I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode handling. I am pretty sure ICU4J is IBM's implementation of Unicode. I think StandardTokenizer is using a different implementation. I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses reference icu4j version 70.1, which is dated Oct 28, 2021 on maven central. Two different Unicode implementations are doing exactly the same thing. Is it a bug, or expected behavior? It does mean filenames are sometimes not being handled in the way you expect. I ran another check ... I had thought that StandardTokenizer preserved email addresses as a single token ... but I am seeing that t...@test.com is split into two terms. It splits t...@test7.com into three terms. Thanks, Shawn
Re: standard tokenizer seemingly splitting on dot
Shawn, No, email addresses are not preserved -- from the docs: - The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens. but the non-split on "test.com" vs the split on "test7.com" is unexpected! ~~Bill On Wed, May 3, 2023 at 10:04 AM Shawn Heisey wrote: > On 5/2/23 15:30, Bill Tantzen wrote: > > This works as I expected: > > ab00c.tif -- tokenizes as it should with a value of ab00c.tif > > > > This doesn't work as I expected > > ab003.tif -- tokenizes with a result of ab003 and tif > > I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode > handling. I am pretty sure ICU4J is IBM's implementation of Unicode. I > think StandardTokenizer is using a different implementation. > > I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses > reference icu4j version 70.1, which is dated Oct 28, 2021 on maven central. > > Two different Unicode implementations are doing exactly the same thing. > Is it a bug, or expected behavior? It does mean filenames are sometimes > not being handled in the way you expect. > > I ran another check ... I had thought that StandardTokenizer preserved > email addresses as a single token ... but I am seeing that t...@test.com > is split into two terms. It splits t...@test7.com into three terms. > > Thanks, > Shawn > -- Human wheels spin round and round While the clock keeps the pace... -- John Mellencamp Bill TantzenUniversity of Minnesota Libraries 612-626-9949 (U of M)612-325-1777 (cell)
Solr logs (hits value) and memory allocation
Hi all, Just asking if there could be some correlation from the amount of memory allocated by a Solr query and the number of *hits* selected in solr logs. I haven't found anything in the Solr documentation. Do you know if there is some advice for the hits value? Thanks, Vincenzo -- Vincenzo D'Amore
Re: Solr logs (hits value) and memory allocation
Hello Vincenzo, Yes. Last time i checked, an array of ScoreDoc objects is created for each query with the size of the numFound for the local core/replica. This should clearly visible in VisualVM. This happens in SolrIndexSearcher. Regards, Markus Op wo 3 mei 2023 om 17:20 schreef Vincenzo D'Amore : > Hi all, > > Just asking if there could be some correlation from the amount of memory > allocated by a Solr query and the number of *hits* selected in solr logs. > I haven't found anything in the Solr documentation. > > Do you know if there is some advice for the hits value? > > Thanks, > Vincenzo > > -- > Vincenzo D'Amore >
Re: Solr logs (hits value) and memory allocation
Hi Markus, thanks for your explanation. What if I submit a query q=*:*&rows=0 and there are 200M of documents in the solr core? Will I allocate an array of ScoreDoc objects so big? On Wed, May 3, 2023 at 5:32 PM Markus Jelsma wrote: > Hello Vincenzo, > > Yes. Last time i checked, an array of ScoreDoc objects is created for each > query with the size of the numFound for the local core/replica. This should > clearly visible in VisualVM. This happens in SolrIndexSearcher. > > Regards, > Markus > > Op wo 3 mei 2023 om 17:20 schreef Vincenzo D'Amore : > > > Hi all, > > > > Just asking if there could be some correlation from the amount of memory > > allocated by a Solr query and the number of *hits* selected in solr logs. > > I haven't found anything in the Solr documentation. > > > > Do you know if there is some advice for the hits value? > > > > Thanks, > > Vincenzo > > > > -- > > Vincenzo D'Amore > > > -- Vincenzo D'Amore
Re: Solr logs (hits value) and memory allocation
Here is an example calculation of bytes -> number of entries held from the bitset. (2864256-12-12)/24 = 119343 long objects = 22913856 entries The above is from a cluster where each query is generating a bitset of size 2864256 bytes - ~2.8 MB on heap. This is for 22 million results in the resultset. There is some algorithmic stuff to say whether this is a spare bitset or a fixed bitset - over a certain size result this is always a fixed bitset [1]. It grows based on number of documents in the resultset for the shard. This is easily viewable with a profiler like async-profiler where bitsets are created for each query. I recently looked at this in https://issues.apache.org/jira/browse/SOLR-16555 where filtercache bitsets were being recreated over and over if there were multiple fq clauses. SOLR-16555 drastically reduced heap usage on the cluster I was working on (you can see some of the metrics on the PR from before/after) If you have a shard with 200M documents - I think that bitset could be ~20MB per bitset per query. [1] https://github.com/apache/solr/blame/main/solr/core/src/java/org/apache/solr/search/DocSetUtil.java#L46 PS - for G1 GC almost all of these big bitsets are humongous allocations (due to G1 region size) which idk is a problem or not. Its something I'd like to look at further, but haven't had time to benchmark or look at other approaches. Kevin Risden On Wed, May 3, 2023 at 1:14 PM Vincenzo D'Amore wrote: > Hi Markus, > > thanks for your explanation. > What if I submit a query q=*:*&rows=0 and there are 200M of documents in > the solr core? Will I allocate an array of ScoreDoc objects so big? > > > > On Wed, May 3, 2023 at 5:32 PM Markus Jelsma > wrote: > > > Hello Vincenzo, > > > > Yes. Last time i checked, an array of ScoreDoc objects is created for > each > > query with the size of the numFound for the local core/replica. This > should > > clearly visible in VisualVM. This happens in SolrIndexSearcher. > > > > Regards, > > Markus > > > > Op wo 3 mei 2023 om 17:20 schreef Vincenzo D'Amore : > > > > > Hi all, > > > > > > Just asking if there could be some correlation from the amount of > memory > > > allocated by a Solr query and the number of *hits* selected in solr > logs. > > > I haven't found anything in the Solr documentation. > > > > > > Do you know if there is some advice for the hits value? > > > > > > Thanks, > > > Vincenzo > > > > > > -- > > > Vincenzo D'Amore > > > > > > > > -- > Vincenzo D'Amore >
Re: Help regarding solr request timeout because of spellcheck component performance.
1) timeAllowed does limit spellcheck (at least in all the code paths i can think of that may be "slow") ... have you tried it? 2) what is your configuration for the dictionaries you are using? 3) be wary of https://github.com/apache/lucene/issues/12077 : Date: Tue, 2 May 2023 00:04:27 +0530 : From: kumar gaurav : Reply-To: users@solr.apache.org : To: solr-u...@lucene.apache.org, users@solr.apache.org : Subject: Re: Help regarding solr request timeout because of spellcheck : component performance. : : Just a reminder if someone can help here. : : On Mon, 24 Apr 2023 at 13:40, kumar gaurav wrote: : : > ++ users@solr.apache.org : > : > On Mon, 24 Apr 2023 at 13:12, kumar gaurav wrote: : > : >> HI Everyone : >> : >> I am getting a solr socket timeout exception in the select search query : >> because of bad spellcheck performance. : >> : >> I am using the spellcheck component in solr select request handler. : >> solrconfig : >> : >> : >> : >> : >> edismax : >> true : >> 1 : >> AND : >> 100 : >> true : >> 25 : >> false : >> true : >> true : >> true : >> false : >> 10 : >> 150 : >> 100% : >> default : >> wordbreak : >> : >> : >> spellcheck : >> : >> : >> : >> : >> Do we have any time allowed parameter for spellcheck like query : >> timeAllowed parameter ? : >> : >> how can i identify query timeout because of spellcheck component process ? : >> : >> Please help. Thanks in advance. : >> : >> : >> : >> -- : >> Thanks & Regards : >> Kumar Gaurav : >> : > : -Hoss http://www.lucidworks.com/