Re: standard tokenizer seemingly splitting on dot

2023-05-03 Thread Shawn Heisey

On 5/2/23 15:30, Bill Tantzen wrote:

This works as I expected:
ab00c.tif -- tokenizes as it should with a value of ab00c.tif

This doesn't work as I expected
ab003.tif -- tokenizes with a result of ab003 and tif


I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode 
handling.  I am pretty sure ICU4J is IBM's implementation of Unicode.  I 
think StandardTokenizer is using a different implementation.


I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses 
reference icu4j version 70.1, which is dated Oct 28, 2021 on maven central.


Two different Unicode implementations are doing exactly the same thing. 
Is it a bug, or expected behavior?  It does mean filenames are sometimes 
not being handled in the way you expect.


I ran another check ... I had thought that StandardTokenizer preserved 
email addresses as a single token ... but I am seeing that t...@test.com 
is split into two terms.  It splits t...@test7.com into three terms.


Thanks,
Shawn


Re: standard tokenizer seemingly splitting on dot

2023-05-03 Thread Bill Tantzen
Shawn,
No, email addresses are not preserved -- from the docs:


   -

   The "@" character is among the set of token-splitting punctuation, so
   email addresses are not preserved as single tokens.


but the non-split on "test.com" vs the split on "test7.com" is unexpected!
~~Bill


On Wed, May 3, 2023 at 10:04 AM Shawn Heisey  wrote:

> On 5/2/23 15:30, Bill Tantzen wrote:
> > This works as I expected:
> > ab00c.tif -- tokenizes as it should with a value of ab00c.tif
> >
> > This doesn't work as I expected
> > ab003.tif -- tokenizes with a result of ab003 and tif
>
> I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode
> handling.  I am pretty sure ICU4J is IBM's implementation of Unicode.  I
> think StandardTokenizer is using a different implementation.
>
> I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses
> reference icu4j version 70.1, which is dated Oct 28, 2021 on maven central.
>
> Two different Unicode implementations are doing exactly the same thing.
> Is it a bug, or expected behavior?  It does mean filenames are sometimes
> not being handled in the way you expect.
>
> I ran another check ... I had thought that StandardTokenizer preserved
> email addresses as a single token ... but I am seeing that t...@test.com
> is split into two terms.  It splits t...@test7.com into three terms.
>
> Thanks,
> Shawn
>


-- 
Human wheels spin round and round
While the clock keeps the pace... -- John Mellencamp

Bill TantzenUniversity of Minnesota Libraries
612-626-9949 (U of M)612-325-1777 (cell)


Solr logs (hits value) and memory allocation

2023-05-03 Thread Vincenzo D'Amore
Hi all,

Just asking if there could be some correlation from the amount of memory
allocated by a Solr query and the number of *hits* selected in solr logs.
I haven't found anything in the Solr documentation.

Do you know if there is some advice for the hits value?

Thanks,
Vincenzo

-- 
Vincenzo D'Amore


Re: Solr logs (hits value) and memory allocation

2023-05-03 Thread Markus Jelsma
Hello Vincenzo,

Yes. Last time i checked, an array of ScoreDoc objects is created for each
query with the size of the numFound for the local core/replica. This should
clearly visible in VisualVM. This happens in SolrIndexSearcher.

Regards,
Markus

Op wo 3 mei 2023 om 17:20 schreef Vincenzo D'Amore :

> Hi all,
>
> Just asking if there could be some correlation from the amount of memory
> allocated by a Solr query and the number of *hits* selected in solr logs.
> I haven't found anything in the Solr documentation.
>
> Do you know if there is some advice for the hits value?
>
> Thanks,
> Vincenzo
>
> --
> Vincenzo D'Amore
>


Re: Solr logs (hits value) and memory allocation

2023-05-03 Thread Vincenzo D'Amore
Hi Markus,

thanks for your explanation.
What if I submit a query q=*:*&rows=0 and there are 200M of documents in
the solr core? Will I allocate an array of ScoreDoc objects so big?



On Wed, May 3, 2023 at 5:32 PM Markus Jelsma 
wrote:

> Hello Vincenzo,
>
> Yes. Last time i checked, an array of ScoreDoc objects is created for each
> query with the size of the numFound for the local core/replica. This should
> clearly visible in VisualVM. This happens in SolrIndexSearcher.
>
> Regards,
> Markus
>
> Op wo 3 mei 2023 om 17:20 schreef Vincenzo D'Amore :
>
> > Hi all,
> >
> > Just asking if there could be some correlation from the amount of memory
> > allocated by a Solr query and the number of *hits* selected in solr logs.
> > I haven't found anything in the Solr documentation.
> >
> > Do you know if there is some advice for the hits value?
> >
> > Thanks,
> > Vincenzo
> >
> > --
> > Vincenzo D'Amore
> >
>


-- 
Vincenzo D'Amore


Re: Solr logs (hits value) and memory allocation

2023-05-03 Thread Kevin Risden
Here is an example calculation of bytes -> number of entries held from the
bitset.

(2864256-12-12)/24 = 119343 long objects = 22913856 entries

The above is from a cluster where each query is generating a bitset of size
2864256 bytes - ~2.8 MB on heap. This is for 22 million results in the
resultset. There is some algorithmic stuff to say whether this is a spare
bitset or a fixed bitset - over a certain size result this is always a
fixed bitset [1]. It grows based on number of documents in the resultset
for the shard.

This is easily viewable with a profiler like async-profiler where bitsets
are created for each query. I recently looked at this in
https://issues.apache.org/jira/browse/SOLR-16555 where filtercache bitsets
were being recreated over and over if there were multiple fq clauses.
SOLR-16555 drastically reduced heap usage on the cluster I was working on
(you can see some of the metrics on the PR from before/after)

If you have a shard with 200M documents - I think that bitset could be
~20MB per bitset per query.

[1]
https://github.com/apache/solr/blame/main/solr/core/src/java/org/apache/solr/search/DocSetUtil.java#L46

PS - for G1 GC almost all of these big bitsets are humongous allocations
(due to G1 region size) which idk is a problem or not. Its something I'd
like to look at further, but haven't had time to benchmark or look at other
approaches.

Kevin Risden


On Wed, May 3, 2023 at 1:14 PM Vincenzo D'Amore  wrote:

> Hi Markus,
>
> thanks for your explanation.
> What if I submit a query q=*:*&rows=0 and there are 200M of documents in
> the solr core? Will I allocate an array of ScoreDoc objects so big?
>
>
>
> On Wed, May 3, 2023 at 5:32 PM Markus Jelsma 
> wrote:
>
> > Hello Vincenzo,
> >
> > Yes. Last time i checked, an array of ScoreDoc objects is created for
> each
> > query with the size of the numFound for the local core/replica. This
> should
> > clearly visible in VisualVM. This happens in SolrIndexSearcher.
> >
> > Regards,
> > Markus
> >
> > Op wo 3 mei 2023 om 17:20 schreef Vincenzo D'Amore :
> >
> > > Hi all,
> > >
> > > Just asking if there could be some correlation from the amount of
> memory
> > > allocated by a Solr query and the number of *hits* selected in solr
> logs.
> > > I haven't found anything in the Solr documentation.
> > >
> > > Do you know if there is some advice for the hits value?
> > >
> > > Thanks,
> > > Vincenzo
> > >
> > > --
> > > Vincenzo D'Amore
> > >
> >
>
>
> --
> Vincenzo D'Amore
>


Re: Help regarding solr request timeout because of spellcheck component performance.

2023-05-03 Thread Chris Hostetter


1) timeAllowed does limit spellcheck (at least in all the code paths i can 
think of that may be "slow") ... have you tried it?

2) what is your configuration for the dictionaries you are using?

3) be wary of https://github.com/apache/lucene/issues/12077


: Date: Tue, 2 May 2023 00:04:27 +0530
: From: kumar gaurav 
: Reply-To: users@solr.apache.org
: To: solr-u...@lucene.apache.org, users@solr.apache.org
: Subject: Re: Help regarding solr request timeout because of spellcheck
: component performance.
: 
: Just a reminder if someone can help here.
: 
: On Mon, 24 Apr 2023 at 13:40, kumar gaurav  wrote:
: 
: > ++ users@solr.apache.org
: >
: > On Mon, 24 Apr 2023 at 13:12, kumar gaurav  wrote:
: >
: >> HI Everyone
: >>
: >> I am getting a solr socket timeout exception in the select search query
: >> because of bad spellcheck performance.
: >>
: >> I am using the spellcheck component in solr select request handler.
: >> solrconfig
: >>
: >> 
: >>
: >>   
: >> edismax
: >> true
: >> 1
: >> AND
: >> 100
: >> true
: >> 25
: >> false
: >> true
: >> true
: >> true
: >> false
: >> 10
: >> 150
: >> 100%
: >> default
: >> wordbreak
: >>   
: >>   
: >> spellcheck
: >>   
: >> 
: >>
: >>
: >> Do we have any time allowed parameter for spellcheck like query
: >> timeAllowed parameter ?
: >>
: >> how can i identify query timeout because of spellcheck component process ?
: >>
: >> Please help. Thanks in advance.
: >>
: >>
: >>
: >> --
: >> Thanks & Regards
: >> Kumar Gaurav
: >>
: >
: 

-Hoss
http://www.lucidworks.com/