Re: [External] : Re: Querying Solr Locally through Java API without using HttpClient

2022-12-06 Thread Nagarajan Muthupandian
Hi Chris,

Thanks for the details..

Our use case is simple, we have standalone solr servers, and custom request 
handler (plugin) is already defined and working as expected.

The ask is we have to enhance the functionality of the Custom Request handler 
by providing a update feature as part of the flow. Update would be to query all 
the documents and update 1 or 2 field with particular value.

If we were to do the query and update from any other process.. it would be 
HTTP. For our use case the query and update would be local to the container, 
wanted to avoid HTTP.

As per your suggestion let me explore 
SearchHandler/QueryComponent/SolrIndexSearcher for our use case.

Thanks
Rajan

From: Chris Hostetter 
Date: Tuesday, 6 December 2022 at 12:34 AM
To: users@solr.apache.org 
Subject: Re: [External] : Re: Querying Solr Locally through Java API without 
using HttpClient

: POC would be to add a function in the plugin.. which would query all the
: documents locally (Say 100+ Million Documents) and update 1 or 2 fields
: with a particular value.
:
: As the plugin would be local to this core.. wanted to avoid HTTP calls.

I'm assuming here that you mean you want to write a *Solr* plugin (ie: A
RequestHandler, SearchCOmponent, etc...) and from that code do a "query"
to find documents.

In no circumstances would i suggest that using EmbeddedSolrServer, inside
of a real solr server, is a good idea.

If you need your plugin to run on a single core, and iterate over docs
from all shards, then you're going to need to make some sort of network
call -- this is what things like the SearchHandler/QueryComponent do.

If you are ok with your plugin only handling the "local" docs, then you
can just talk to the SolrIndexSearcher direcly -- the way things like
the QueryComponent do in distrib=false mode.

If you also planning to *update* these docs, then you're going to need to
be very careful in your code to check if you are running on the leader
cores of each replica, so you don't have multiple replicas trying to make
the same updates (you'll also need some way to ensure that your plugin
gets "executed" on every leader (ie: running on every shard leader is a
requirement, not just a liimiation)

But ultimatley you've asked a very vague question about a very complicated
concept -- and i would urge you to take a step back, describe your actual
use cases (how are the documents selected? what kinds of updates are you
doing? when will this plugin run? etc...) in more details so more
useful/specific advice can be given...

https://urldefense.com/v3/__https://people.apache.org/*hossman/*xyproblem__;fiM!!ACWV5N9M2RV99hQ!MLkq3YJRJnIzzbrVipDHb9q7KvQPp4Gx-pBknk6v63vknBNcpzMIC1X0q9Wo9z_7DxkeG9X6iMBTDTVOmpCIzI4WOOnTfcFbNaQ$

XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: 
https://urldefense.com/v3/__http://www.perlmonks.org/index.pl?node_id=542341__;!!ACWV5N9M2RV99hQ!MLkq3YJRJnIzzbrVipDHb9q7KvQPp4Gx-pBknk6v63vknBNcpzMIC1X0q9Wo9z_7DxkeG9X6iMBTDTVOmpCIzI4WOOnTJ-PoMR8$



-Hoss
https://urldefense.com/v3/__http://www.lucidworks.com/__;!!ACWV5N9M2RV99hQ!MLkq3YJRJnIzzbrVipDHb9q7KvQPp4Gx-pBknk6v63vknBNcpzMIC1X0q9Wo9z_7DxkeG9X6iMBTDTVOmpCIzI4WOOnT_SdGbsQ$


RE: Solr 9.1 performance

2022-12-06 Thread Joe Jones (DHCW - Software Development)
Workaround tested and no difference with or without it.  New cloud set up with 
1200mb heap for each instance and 32gb system RAM on each server.  I'm seeing 
just over 20gb in system cache.   Anti-virus exclusions applied and system 
doesn't appear to be swapping unnecessarily.

One thing I have noticed is elapsed query time is often better when running 
without replica's.  Still see slow start up on random access where no activity 
has taken place for 2minutes or more.


Cloud was designed some time ago by design architects and under Solr 5.4.1 has 
been running perfectly fine.  This is in use 24/7 so I can't test what it is 
like when idle.  I can only assume at the time the 4 nodes per server was to 
leverage the 4 CPU's allocated to each machine.  This setup has 16gb system RAM 
for each server so half of what the new cloud has.
This contains confidential information so we host in our own datacentres and 
therefore can not make use of datadoghq.


-Original Message-
From: Jan Høydahl  
Sent: 02 December 2022 18:47
To: users@solr.apache.org
Subject: Re: Solr 9.1 performance

WARNING: This email originated from outside of NHS Wales. Do not open links or 
attachments unless you know the content is safe.


What I'm saying is that 9.1 includes a workaround for the cache issues, see 
https://github.com/apache/solr/blob/releases/solr/9.1.0/solr/bin/solr#L2246-L2250
You may want to try to disable this workaround to see if it helps with the 
performance of your system. Alternatively try with JDK11, which does not 
trigger the workaround.

But it is just a blind shot, your issues may stem from something else, and we'd 
need much more details on your setup, config, physical RAM, heap etc.

I would like to question the decision of running 4 solr nodes on the same 
server. Have you tried instead to run one solr process per server, keeping 12 
shards and 2 replicas?
If you enable affinity placement plugin and tag each node with data-center id 
and hostname, then solr will place the shards/replicas evenly across all 6 
servers.

Finally, add some observability to your cluster to learn what is actually going 
on. You can e.g. use Datadog 
 or another cloud 
provider to quickly get started. It will help you discover what is happening in 
your cluster.

PS: Have you disabled all Antivirus software? Made sure your heap size is as 
low as possible? Verified that your system is not swapping?

Jan

> 2. des. 2022 kl. 17:25 skrev Joe Jones (DHCW - Software Development) 
> :
>
> No, out of the box 9.1 doesn't include the patch.  Tried adding it in and no 
> difference.
>
> I've done some testing running the queries with "distrib=false" and can see 
> the query itself runs fine it's just the call to the instance and the 
> response is slow.
>
> Something to do with Jetty?
>
> -Original Message-
> From: Jan Høydahl 
> Sent: 02 December 2022 10:14
> To: users@solr.apache.org
> Subject: Re: Solr 9.1 performance
>
> WARNING: This email originated from outside of NHS Wales. Do not open links 
> or attachments unless you know the content is safe.
>
>
> Could it be related to 
> https://solr.apache.org/news.html#java-17-bug-affecting-solr ? Doubt it as 
> you don't use much caching, but hotspot optimization of caches are disabled 
> by default in 9.1. You could try to edit bin/solr script to disable the patch 
> and see if anything is faster - risking a segfault crash instead :)
>
> Jan
>
>> 2. des. 2022 kl. 10:11 skrev Joe Jones (DHCW - Software Development) 
>> :
> Rydym yn croesawu derbyn gohebiaeth yng Nghymraeg. Byddwn yn ateb y fath 
> ohebiaeth yng Nghymraeg ac ni fydd hyn yn arwain at oedi.
> We welcome receiving correspondence in Welsh. We will reply to such 
> correspondence in Welsh and this will not lead to a delay.



FieldCache and _version_field

2022-12-06 Thread Dominique Bejean
Hi,

One of my customers has a huge collection (1.5 billion docs across 14
shards).
All fields are correctly configured in order to enable docValues
except _version_. They are still using the old configuration with
indexed=true instead of docValues and hence _version_ populate FiledCache
in JVM heap (several Gb).

They need to reindex for various reasons including this one but this can't
be done before several weeks due to the complexity to handle full
reindexing and continuous indexing at the same time.

Why is FieldCache populated with _version_ field as it isn't explicitly
used in sort, facet, grouping and function ?
I guess Solr needs this internally.

Is there a workaround in order to avoid FiledCache to be populated by
_version_ field  waiting to reindex ?

Regards

Dominique


Re: FieldCache and _version_field

2022-12-06 Thread Mikhail Khludnev
Hello, Dominique.
I suppose it's used for updates and specifically in
AtomicUpdateProcessorFactory
and UpdateLog. Presumably, if that cluster can live without atomic updates,
you can try to drop them out of update chain.

On Tue, Dec 6, 2022 at 5:14 PM Dominique Bejean 
wrote:

> Hi,
>
> One of my customers has a huge collection (1.5 billion docs across 14
> shards).
> All fields are correctly configured in order to enable docValues
> except _version_. They are still using the old configuration with
> indexed=true instead of docValues and hence _version_ populate FiledCache
> in JVM heap (several Gb).
>
> They need to reindex for various reasons including this one but this can't
> be done before several weeks due to the complexity to handle full
> reindexing and continuous indexing at the same time.
>
> Why is FieldCache populated with _version_ field as it isn't explicitly
> used in sort, facet, grouping and function ?
> I guess Solr needs this internally.
>
> Is there a workaround in order to avoid FiledCache to be populated by
> _version_ field  waiting to reindex ?
>
> Regards
>
> Dominique
>


-- 
Sincerely yours
Mikhail Khludnev


Re: Solr 9.1 performance

2022-12-06 Thread Shawn Heisey

On 12/6/22 05:08, Joe Jones (DHCW - Software Development) wrote:

Workaround tested and no difference with or without it.  New cloud set up with 
1200mb heap for each instance and 32gb system RAM on each server.  I'm seeing 
just over 20gb in system cache.   Anti-virus exclusions applied and system 
doesn't appear to be swapping unnecessarily.


A 1.2gb heap sounds extremely small for dealing with millions of 
documents, even if each node only handles one shard.  I'd be curious 
what your GC logs might reveal.



Cloud was designed some time ago by design architects and under Solr 5.4.1 has 
been running perfectly fine.  This is in use 24/7 so I can't test what it is 
like when idle.  I can only assume at the time the 4 nodes per server was to 
leverage the 4 CPU's allocated to each machine.


If Solr were a single-threaded application, this might make some sense.  
But Solr is heavily multithreaded, there is no need to have multiple 
nodes per machine to take full advantage of a multi-CPU system.  With 
multiple nodes, you're actually more likely to have problems with too 
many threads competing for CPU resources because each node has no 
insight into the other nodes.


Thanks,
Shawn



Re: Solr 9.1 performance

2022-12-06 Thread Shawn Heisey

On 12/6/22 07:53, Shawn Heisey wrote:
If Solr were a single-threaded application, this might make some 
sense.  But Solr is heavily multithreaded, there is no need to have 
multiple nodes per machine to take full advantage of a multi-CPU 
system.  With multiple nodes, you're actually more likely to have 
problems with too many threads competing for CPU resources because 
each node has no insight into the other nodes.


This is a jconsole session connected to my tiny Solr install. Tiny 
meaning a little over 200K docs and 700MB total index size, with a 1GB 
heap.  It works with a 512MB heap, I just wanted it to have a little 
more room for Java to work in.  It is a single Solr node in cloud mode 
with embedded ZK:


https://www.dropbox.com/s/i7kceedb6qy5es1/jconsole_showing_solr_threads.png?dl=0

The spike in the graph is when I kicked off a complete reindex.

I've seen busy Solr installs with over 1000 threads.  And that's 
standalone mode, SolrCloud will have threads that standalone doesn't.  A 
single Solr node has no trouble using all available CPU resources.


Thanks,
Shawn



Re: FieldCache and _version_field

2022-12-06 Thread Dominique Bejean
Hi Mikhail,

Thank you for the response.
More details. Solr is version 7.7.0 and collection replicas are TLOG.

I will check, but I don't think atomic updates are required.

Regards

Dominique

Le mar. 6 déc. 2022 à 15:43, Mikhail Khludnev  a écrit :

> Hello, Dominique.
> I suppose it's used for updates and specifically in
> AtomicUpdateProcessorFactory
> and UpdateLog. Presumably, if that cluster can live without atomic updates,
> you can try to drop them out of update chain.
>
> On Tue, Dec 6, 2022 at 5:14 PM Dominique Bejean  >
> wrote:
>
> > Hi,
> >
> > One of my customers has a huge collection (1.5 billion docs across 14
> > shards).
> > All fields are correctly configured in order to enable docValues
> > except _version_. They are still using the old configuration with
> > indexed=true instead of docValues and hence _version_ populate FiledCache
> > in JVM heap (several Gb).
> >
> > They need to reindex for various reasons including this one but this
> can't
> > be done before several weeks due to the complexity to handle full
> > reindexing and continuous indexing at the same time.
> >
> > Why is FieldCache populated with _version_ field as it isn't explicitly
> > used in sort, facet, grouping and function ?
> > I guess Solr needs this internally.
> >
> > Is there a workaround in order to avoid FiledCache to be populated by
> > _version_ field  waiting to reindex ?
> >
> > Regards
> >
> > Dominique
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Re: Wired behavior of maxClauseCount restriction since upgrading to solr 9.1

2022-12-06 Thread Chris Hostetter


: I'm happy to provide some details as I still do not really understand the
: difference to the situation before.

The main difference is coming from the changes introduced in LUCENE-8811 
(Lucene 9.0) which sought to ensure that the "global" maxClauseCount would 
be honored no matter what kind of nested structure the query might 
involve.

You're situation is an interesting case that i had never considered, more 
detais below...

: * I upgraded from 8.11.1 to 9.1. I observed the behavior for a completely
: rebuild index (solr version 9.1 / lucene version 9.3)

thank you for clarifing.  This confirms that changes introduced 
in LUCENE-8811 (and related solr issues) are relavant to the change in 
behavior you are seeing (if you had said you upgraded from Solr 9 we'd be 
having a different conversation)

: * maxBooleanClauses is only configured in solrconfig.xml (1024) but not in
: solr.xml.

FYI: If you don't configure in solr.xml, then the (Lucene) default
IndexSearcher.getMaxClauseCount() is left as is (and that is also 1024)

: * Sorry for the confusion about the field definition. As you already
: assumed correctly: 'categoryId' is also a 'p_long_dv'

Meaning that it has both points nad docvalues configured, which it turns 
out is significant to why it behaves differently from a string field.


: * Stacktrace for String field ("id"). For better readability I replaced the
: original query by "1 2 ... 1025":

Snipping down to the key lines of code from the root cause...

: Caused by: org.apache.lucene.search.IndexSearcher$TooManyClauses:
: maxClauseCount is set to 1024
: at
: org.apache.lucene.search.BooleanQuery$Builder.add(BooleanQuery.java:116)
: at
: org.apache.lucene.search.BooleanQuery$Builder.add(BooleanQuery.java:130)
: at
: 
org.apache.solr.parser.SolrQueryParserBase.rawToNormal(SolrQueryParserBase.java:1065)

...so in this case, as the query parser is building up a boolean query (of 
many strings), it is hitting the limit because the (top level) boolean 
query is being asked to add one more item then 
IndexSearcher.getMaxClauseCount() == 1024


: * Stacktrace for Point field ("categoryId") with 1 2 ... 513:

Again, snipping down to just the key lines of code.  (Note also the 
difference in the exception message: "too many nested clauses") ..

: org.apache.lucene.search.IndexSearcher$TooManyNestedClauses: Query contains
: too many nested clauses; maxClauseCount is set to 1024
: at
: org.apache.lucene.search.IndexSearcher$3.visitLeaf(IndexSearcher.java:801)
: at
: 
org.apache.lucene.document.SortedNumericDocValuesRangeQuery.visit(SortedNumericDocValuesRangeQuery.java:73)
: at
: 
org.apache.lucene.search.IndexOrDocValuesQuery.visit(IndexOrDocValuesQuery.java:121)
: at
: org.apache.lucene.search.BooleanQuery.visit(BooleanQuery.java:575)
: at
: org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:769)

...here the exception is happening during the actual search -- meaning the 
query parser had no problem building up the BooleanQuery of 512 clauses

But what matters is that each of those 512 clauses is no longer a simple 
exact term query (or a simple exact point query, or a simple exact 
docvalue query) ... because this fieldType is configured to support both 
points and docvalues, those 512 clauses are IndexOrDocValuesQuery queries 
-- which each contain 2 sub-clauses

(the purpose of this class is to provide teh most efficient impl based on 
where/how this clause is used, which can depend on term stats, other 
clauses in the parent query, etc...)

So to sumarize:

1) the reason you're seeing this behavior in 9x but didnt' in 8x is 
because 9x added more checks of the safety valve

2) the reason you're seeing the 1024 limit hit for some (but not all) 
fields, even with with less then 1024 "original user query clauses" is 
because for some (but not all) field types, 1 original query clause can 
become N internal clauses.


-Hoss
http://www.lucidworks.com/


Re: Wired behavior of maxClauseCount restriction since upgrading to solr 9.1

2022-12-06 Thread michael dürr
Hi Hoss,

This is a really helpful explanation!
Even though I already shifted to the usage of the {!terms} query for such
large boolean clause queries, it feels a lot better to know how and why
things behave differently compared to the 8x solr version.

Thanks!
Michael

On Tue, Dec 6, 2022 at 7:32 PM Chris Hostetter 
wrote:

>
> : I'm happy to provide some details as I still do not really understand the
> : difference to the situation before.
>
> The main difference is coming from the changes introduced in LUCENE-8811
> (Lucene 9.0) which sought to ensure that the "global" maxClauseCount would
> be honored no matter what kind of nested structure the query might
> involve.
>
> You're situation is an interesting case that i had never considered, more
> detais below...
>
> : * I upgraded from 8.11.1 to 9.1. I observed the behavior for a completely
> : rebuild index (solr version 9.1 / lucene version 9.3)
>
> thank you for clarifing.  This confirms that changes introduced
> in LUCENE-8811 (and related solr issues) are relavant to the change in
> behavior you are seeing (if you had said you upgraded from Solr 9 we'd be
> having a different conversation)
>
> : * maxBooleanClauses is only configured in solrconfig.xml (1024) but not
> in
> : solr.xml.
>
> FYI: If you don't configure in solr.xml, then the (Lucene) default
> IndexSearcher.getMaxClauseCount() is left as is (and that is also 1024)
>
> : * Sorry for the confusion about the field definition. As you already
> : assumed correctly: 'categoryId' is also a 'p_long_dv'
>
> Meaning that it has both points nad docvalues configured, which it turns
> out is significant to why it behaves differently from a string field.
>
>
> : * Stacktrace for String field ("id"). For better readability I replaced
> the
> : original query by "1 2 ... 1025":
>
> Snipping down to the key lines of code from the root cause...
>
> : Caused by: org.apache.lucene.search.IndexSearcher$TooManyClauses:
> : maxClauseCount is set to 1024
> : at
> : org.apache.lucene.search.BooleanQuery$Builder.add(BooleanQuery.java:116)
> : at
> : org.apache.lucene.search.BooleanQuery$Builder.add(BooleanQuery.java:130)
> : at
> :
> org.apache.solr.parser.SolrQueryParserBase.rawToNormal(SolrQueryParserBase.java:1065)
>
> ...so in this case, as the query parser is building up a boolean query (of
> many strings), it is hitting the limit because the (top level) boolean
> query is being asked to add one more item then
> IndexSearcher.getMaxClauseCount() == 1024
>
>
> : * Stacktrace for Point field ("categoryId") with 1 2 ... 513:
>
> Again, snipping down to just the key lines of code.  (Note also the
> difference in the exception message: "too many nested clauses") ..
>
> : org.apache.lucene.search.IndexSearcher$TooManyNestedClauses: Query
> contains
> : too many nested clauses; maxClauseCount is set to 1024
> : at
> :
> org.apache.lucene.search.IndexSearcher$3.visitLeaf(IndexSearcher.java:801)
> : at
> :
> org.apache.lucene.document.SortedNumericDocValuesRangeQuery.visit(SortedNumericDocValuesRangeQuery.java:73)
> : at
> :
> org.apache.lucene.search.IndexOrDocValuesQuery.visit(IndexOrDocValuesQuery.java:121)
> : at
> : org.apache.lucene.search.BooleanQuery.visit(BooleanQuery.java:575)
> : at
> : org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:769)
>
> ...here the exception is happening during the actual search -- meaning the
> query parser had no problem building up the BooleanQuery of 512 clauses
>
> But what matters is that each of those 512 clauses is no longer a simple
> exact term query (or a simple exact point query, or a simple exact
> docvalue query) ... because this fieldType is configured to support both
> points and docvalues, those 512 clauses are IndexOrDocValuesQuery queries
> -- which each contain 2 sub-clauses
>
> (the purpose of this class is to provide teh most efficient impl based on
> where/how this clause is used, which can depend on term stats, other
> clauses in the parent query, etc...)
>
> So to sumarize:
>
> 1) the reason you're seeing this behavior in 9x but didnt' in 8x is
> because 9x added more checks of the safety valve
>
> 2) the reason you're seeing the 1024 limit hit for some (but not all)
> fields, even with with less then 1024 "original user query clauses" is
> because for some (but not all) field types, 1 original query clause can
> become N internal clauses.
>
>
> -Hoss
> http://www.lucidworks.com/
>


EventListeners lib path

2022-12-06 Thread Eashwar Natarajan
Hi all,

We want to register EventListener(EventListener (Solr 7.3.0 API)
(apache.org)
)
with
DIH in data-config.xml to listen for events.

 

   

  ...

   




 We would like to know the directory path in which we should place
the eventlistener jar.


Regards,

Eashwar