[jira] [Commented] (SOLR-16305) MODIFYCOLLECTION with 'property.*' changes can't change values used in config file variables (even though they can be set during collection CREATE)
[ https://issues.apache.org/jira/browse/SOLR-16305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17615955#comment-17615955 ] Andrzej Bialecki commented on SOLR-16305: - AFAIK the propagation of `property.*` values to cores is accidental, the original purpose (again, AFAIK) was to be able to set aux properties in the `state.json`, to keep additional per-collection state that could be used by other components. The advantage of this is that they would automatically appear in DocCollection at the API level (unlike COLLECTIONPROP API which is incomplete, because only the "write" part is supported but not "read", without going directly to ZK. AFAIK the COLLECTIONPROP was added because routed aliases needed some place to keep additional state, potentially too large / inconvenient to stick into state.json.) However, even using these `property.*` values is half-broken, as I recently discovered - it's supported in MODIFYCOLLECTION but not in CREATE, due to `ClusterStateMutator.createCollection()` copying only the predefined properties and ignoring anything else. This should be fixed in some way - I'm inclined to say in both ways ;) that is, the COLLECTIONPROP API should be completed so that it includes the reading part, and the CREATE should be fixed to accept `property.*`. And I don't see the purpose of propagating these collection-level props to individual cores, so this part could be removed until it's needed. > MODIFYCOLLECTION with 'property.*' changes can't change values used in config > file variables (even though they can be set during collection CREATE) > --- > > Key: SOLR-16305 > URL: https://issues.apache.org/jira/browse/SOLR-16305 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Chris M. Hostetter >Priority: Major > Attachments: SOLR-16305_test.patch > > > Consider a configset with a {{solrconfig.xml}} that includes a snippet like > this... > {code:java} > ${custom.prop:customDefVal} > {code} > ...this {{custom.prop}} can be set when doing a {{CREATE}} command for a > collection that uses this configset, using the {{property.*}} prefix as noted > in the reg-guide... > {quote}{{property.{_}name{_}={_}value{_}}} > |Optional|Default: none| > Set core property _name_ to {_}value{_}. See the section [Core > Discovery|https://solr.apache.org/guide/solr/latest/configuration-guide/core-discovery.html] > for details on supported properties and values. > {quote} > ...BUT > These values can *not* be changed by using the {{MODIFYCOLLECTION}} command, > in spite of the ref-guide indicating that it can be used to modify custom > {{property.*}} attributes... > {quote}The attributes that can be modified are: > * {{replicationFactor}} > * {{collection.configName}} > * {{readOnly}} > * other custom properties that use a {{property.}} prefix > See the [CREATE > action|https://solr.apache.org/guide/solr/latest/deployment-guide/collection-management.html#create] > section above for details on these attributes. > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16305) MODIFYCOLLECTION with 'property.*' changes can't change values used in config file variables (even though they can be set during collection CREATE)
[ https://issues.apache.org/jira/browse/SOLR-16305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617242#comment-17617242 ] Andrzej Bialecki commented on SOLR-16305: - {quote}I think you mean the exact opposite of what you just said? {quote} No, I meant that they cannot be set as DocCollection properties, they are silently skipped there (wihle they are indeed propagated to cores). If you want to set a DocCollection property you have to use MODIFYCOLLECTION, and while this works for setting `property.*` in DocCollection it indeed does not propagate these custom props to cores. Whichever way you look at it it's a mess. > MODIFYCOLLECTION with 'property.*' changes can't change values used in config > file variables (even though they can be set during collection CREATE) > --- > > Key: SOLR-16305 > URL: https://issues.apache.org/jira/browse/SOLR-16305 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Chris M. Hostetter >Priority: Major > Attachments: SOLR-16305_test.patch > > > Consider a configset with a {{solrconfig.xml}} that includes a snippet like > this... > {code:java} > ${custom.prop:customDefVal} > {code} > ...this {{custom.prop}} can be set when doing a {{CREATE}} command for a > collection that uses this configset, using the {{property.*}} prefix as noted > in the reg-guide... > {quote}{{property.{_}name{_}={_}value{_}}} > |Optional|Default: none| > Set core property _name_ to {_}value{_}. See the section [Core > Discovery|https://solr.apache.org/guide/solr/latest/configuration-guide/core-discovery.html] > for details on supported properties and values. > {quote} > ...BUT > These values can *not* be changed by using the {{MODIFYCOLLECTION}} command, > in spite of the ref-guide indicating that it can be used to modify custom > {{property.*}} attributes... > {quote}The attributes that can be modified are: > * {{replicationFactor}} > * {{collection.configName}} > * {{readOnly}} > * other custom properties that use a {{property.}} prefix > See the [CREATE > action|https://solr.apache.org/guide/solr/latest/deployment-guide/collection-management.html#create] > section above for details on these attributes. > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16305) MODIFYCOLLECTION with 'property.*' changes can't change values used in config file variables (even though they can be set during collection CREATE)
[ https://issues.apache.org/jira/browse/SOLR-16305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618812#comment-17618812 ] Andrzej Bialecki commented on SOLR-16305: - {quote}Should this Jira have a linked "converse" issue: "CREATE collection with property.* doesn't set values in DocCollection (even though MODIFYCOLLECTION can cahnge them)" ? {quote} I think so. {quote}WTF the {{COLLECTIONPROP}} command's purpose / expected usage is? {quote} AFAIK they are currently used only for maintaining routed aliases. We could extend it to cover a use case of "I want to maintain arbitrary props per collection" but then we would have to add the reading API and document it. And probably do some other work too, because this API is isolated from the main DocCollection model. (For me one reason for ab-using DocCollection to keep properties was that there's currently no connection between props that you can set with COLLECTIONPROP and the replica placement API model, which purposely uses API disconnected from Solr internals. So if I want to mark some collection as having this or other replica placement properties, the SolrCollection.getCustomProperty ONLY returns props set in DocCollection and not those set with COLLECTIONPROP. Of course, I can always keep these special props in a config file specific to the placement plugin ... but this complicates the lifecycle of these properties as you create / delete collections, so keeping them in DocCollection is convenient). > MODIFYCOLLECTION with 'property.*' changes can't change values used in config > file variables (even though they can be set during collection CREATE) > --- > > Key: SOLR-16305 > URL: https://issues.apache.org/jira/browse/SOLR-16305 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Chris M. Hostetter >Priority: Major > Attachments: SOLR-16305_test.patch > > > Consider a configset with a {{solrconfig.xml}} that includes a snippet like > this... > {code:java} > ${custom.prop:customDefVal} > {code} > ...this {{custom.prop}} can be set when doing a {{CREATE}} command for a > collection that uses this configset, using the {{property.*}} prefix as noted > in the reg-guide... > {quote}{{property.{_}name{_}={_}value{_}}} > |Optional|Default: none| > Set core property _name_ to {_}value{_}. See the section [Core > Discovery|https://solr.apache.org/guide/solr/latest/configuration-guide/core-discovery.html] > for details on supported properties and values. > {quote} > ...BUT > These values can *not* be changed by using the {{MODIFYCOLLECTION}} command, > in spite of the ref-guide indicating that it can be used to modify custom > {{property.*}} attributes... > {quote}The attributes that can be modified are: > * {{replicationFactor}} > * {{collection.configName}} > * {{readOnly}} > * other custom properties that use a {{property.}} prefix > See the [CREATE > action|https://solr.apache.org/guide/solr/latest/deployment-guide/collection-management.html#create] > section above for details on these attributes. > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15616) Allow thread metrics to be cached
[ https://issues.apache.org/jira/browse/SOLR-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677899#comment-17677899 ] Andrzej Bialecki commented on SOLR-15616: - LGTM, thanks for seeing this through, Ishan! One minor suggestion: since the interval is expressed in seconds (whereas often other intervals are expressed in millis) maybe we should use `threadIntervalSec` or something like that? I leave it up to you - the docs say it's in seconds but if it's in the name then it's self-explanatory. > Allow thread metrics to be cached > - > > Key: SOLR-15616 > URL: https://issues.apache.org/jira/browse/SOLR-15616 > Project: Solr > Issue Type: Improvement > Components: metrics >Reporter: Ishan Chattopadhyaya >Assignee: Ishan Chattopadhyaya >Priority: Major > Attachments: SOLR-15616-2.patch, SOLR-15616-9x.patch, > SOLR-15616.patch, SOLR-15616.patch > > > Computing JVM metrics for threads can be expensive, and we should provide > option to users to avoid doing so on every call to the metrics API > (group=jvm). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-16649) Http2SolrClient.processErrorsAndResponse uses wrong instance of ResponseParser
Andrzej Bialecki created SOLR-16649: --- Summary: Http2SolrClient.processErrorsAndResponse uses wrong instance of ResponseParser Key: SOLR-16649 URL: https://issues.apache.org/jira/browse/SOLR-16649 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: clients - java Affects Versions: 9.1.1, main (10.0) Reporter: Andrzej Bialecki `Http2SolrClient:800` calls `wantStream(...)` method but passes the wrong argument to it - instead of passing the local `processor` arg it uses the instance field `parser`. Throughout this class there's a repeated pattern that easily leads to this confusion - in many methods a local var `parser` is created that overshadows the instance field, and then this local `parser` is passed around as argument to various operations. However, in this method the argument passed from the caller is named differently (`processor`) and thus does not overshadow the instance field, which leads to this mistake. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-16649) Http2SolrClient.processErrorsAndResponse uses wrong instance of ResponseParser
[ https://issues.apache.org/jira/browse/SOLR-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-16649: Description: {{Http2SolrClient:800}} calls {{wantStream(...)}} method but passes the wrong argument to it - instead of passing the local {{processor}} arg it uses the instance field {{parser}}. Throughout this class there's a repeated pattern that easily leads to this confusion - in many methods a local var {{parser}} is created that overshadows the instance field, and then this local {{parser}} is passed around as argument to various operations. However, in this particular method the argument passed from the caller is named differently ({{processor}}) and thus does not overshadow the instance field, which leads to this mistake. was: `Http2SolrClient:800` calls `wantStream(...)` method but passes the wrong argument to it - instead of passing the local `processor` arg it uses the instance field `parser`. Throughout this class there's a repeated pattern that easily leads to this confusion - in many methods a local var `parser` is created that overshadows the instance field, and then this local `parser` is passed around as argument to various operations. However, in this method the argument passed from the caller is named differently (`processor`) and thus does not overshadow the instance field, which leads to this mistake. > Http2SolrClient.processErrorsAndResponse uses wrong instance of ResponseParser > -- > > Key: SOLR-16649 > URL: https://issues.apache.org/jira/browse/SOLR-16649 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: clients - java >Affects Versions: main (10.0), 9.1.1 >Reporter: Andrzej Bialecki >Priority: Major > > {{Http2SolrClient:800}} calls {{wantStream(...)}} method but passes the wrong > argument to it - instead of passing the local {{processor}} arg it uses the > instance field {{parser}}. > Throughout this class there's a repeated pattern that easily leads to this > confusion - in many methods a local var {{parser}} is created that > overshadows the instance field, and then this local {{parser}} is passed > around as argument to various operations. However, in this particular method > the argument passed from the caller is named differently ({{processor}}) and > thus does not overshadow the instance field, which leads to this mistake. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-16649) Http2SolrClient.processErrorsAndResponse uses wrong instance of ResponseParser
[ https://issues.apache.org/jira/browse/SOLR-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-16649: Attachment: SOLR-16649.patch > Http2SolrClient.processErrorsAndResponse uses wrong instance of ResponseParser > -- > > Key: SOLR-16649 > URL: https://issues.apache.org/jira/browse/SOLR-16649 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: clients - java >Affects Versions: main (10.0), 9.1.1 >Reporter: Andrzej Bialecki >Priority: Major > Attachments: SOLR-16649.patch > > > {{Http2SolrClient:800}} calls {{wantStream(...)}} method but passes the wrong > argument to it - instead of passing the local {{processor}} arg it uses the > instance field {{parser}}. > Throughout this class there's a repeated pattern that easily leads to this > confusion - in many methods a local var {{parser}} is created that > overshadows the instance field, and then this local {{parser}} is passed > around as argument to various operations. However, in this particular method > the argument passed from the caller is named differently ({{processor}}) and > thus does not overshadow the instance field, which leads to this mistake. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16649) Http2SolrClient.processErrorsAndResponse uses wrong instance of ResponseParser
[ https://issues.apache.org/jira/browse/SOLR-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685416#comment-17685416 ] Andrzej Bialecki commented on SOLR-16649: - Simple patch with a test case - it fails with stock code, succeeds with the fix. > Http2SolrClient.processErrorsAndResponse uses wrong instance of ResponseParser > -- > > Key: SOLR-16649 > URL: https://issues.apache.org/jira/browse/SOLR-16649 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: clients - java >Affects Versions: main (10.0), 9.1.1 >Reporter: Andrzej Bialecki >Priority: Major > Attachments: SOLR-16649.patch > > > {{Http2SolrClient:800}} calls {{wantStream(...)}} method but passes the wrong > argument to it - instead of passing the local {{processor}} arg it uses the > instance field {{parser}}. > Throughout this class there's a repeated pattern that easily leads to this > confusion - in many methods a local var {{parser}} is created that > overshadows the instance field, and then this local {{parser}} is passed > around as argument to various operations. However, in this particular method > the argument passed from the caller is named differently ({{processor}}) and > thus does not overshadow the instance field, which leads to this mistake. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-16649) Http2SolrClient.processErrorsAndResponse uses wrong instance of ResponseParser
[ https://issues.apache.org/jira/browse/SOLR-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-16649: Attachment: SOLR-16649-1.patch > Http2SolrClient.processErrorsAndResponse uses wrong instance of ResponseParser > -- > > Key: SOLR-16649 > URL: https://issues.apache.org/jira/browse/SOLR-16649 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: clients - java >Affects Versions: main (10.0), 9.1.1 >Reporter: Andrzej Bialecki >Priority: Major > Attachments: SOLR-16649-1.patch, SOLR-16649.patch > > > {{Http2SolrClient:800}} calls {{wantStream(...)}} method but passes the wrong > argument to it - instead of passing the local {{processor}} arg it uses the > instance field {{parser}}. > Throughout this class there's a repeated pattern that easily leads to this > confusion - in many methods a local var {{parser}} is created that > overshadows the instance field, and then this local {{parser}} is passed > around as argument to various operations. However, in this particular method > the argument passed from the caller is named differently ({{processor}}) and > thus does not overshadow the instance field, which leads to this mistake. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16649) Http2SolrClient.processErrorsAndResponse uses wrong instance of ResponseParser
[ https://issues.apache.org/jira/browse/SOLR-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685747#comment-17685747 ] Andrzej Bialecki commented on SOLR-16649: - Oops, right - I attached the new patch. > Http2SolrClient.processErrorsAndResponse uses wrong instance of ResponseParser > -- > > Key: SOLR-16649 > URL: https://issues.apache.org/jira/browse/SOLR-16649 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: clients - java >Affects Versions: main (10.0), 9.1.1 >Reporter: Andrzej Bialecki >Priority: Major > Attachments: SOLR-16649-1.patch, SOLR-16649.patch > > > {{Http2SolrClient:800}} calls {{wantStream(...)}} method but passes the wrong > argument to it - instead of passing the local {{processor}} arg it uses the > instance field {{parser}}. > Throughout this class there's a repeated pattern that easily leads to this > confusion - in many methods a local var {{parser}} is created that > overshadows the instance field, and then this local {{parser}} is passed > around as argument to various operations. However, in this particular method > the argument passed from the caller is named differently ({{processor}}) and > thus does not overshadow the instance field, which leads to this mistake. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16507) Remove NodeStateProvider & Snitch
[ https://issues.apache.org/jira/browse/SOLR-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703331#comment-17703331 ] Andrzej Bialecki commented on SOLR-16507: - Thanks [~dsmiley] for bringing this to my attention. The reason we used the existing NodeStateProvider abstraction in the replica placement code was that retrieving per-node metrics is messy and quirky, all of which is hidden in SolrClientNodeStateProvider. The internal structure (snitches and co) can and should be refactored and simplified because these concepts are not used anywhere else anymore, they are legacy abstractions from the time when they were used for collection rules DSL. However, IMHO something like NodeStateProvider still has its place. No matter what you replace it with, the complexity of retrieving per-node attributes will still be present somewhere- and hiding it in a NodeStateProvider (or similar concept) as a high-level API at least gives us a possibility of reuse. If we were to put all this nasty code into AttributeFetcherImpl then we would pretty much limit its usefulness only to the placement code. SolrCloudManager is perhaps no longer useful anymore and can be factored out, but IMHO something equivalent to NodeStateProvider is still needed. Re. "snitchSession" - this is now used only in `ImplicitSnitch` for caching the node roles, in order to avoid loading this data from ZK for every node. > Remove NodeStateProvider & Snitch > - > > Key: SOLR-16507 > URL: https://issues.apache.org/jira/browse/SOLR-16507 > Project: Solr > Issue Type: Task >Reporter: David Smiley >Priority: Major > Labels: newdev > > The NodeStateProvider is a relic relating to the old autoscaling framework > that was removed in Solr 9. The only remaining usage of it is for > SplitShardCmd to check the disk space. For this, it could use the metrics > api. > I think we'll observe that Snitch and other classes in > org.apache.solr.common.cloud.rule can be removed as well, as it's related to > NodeStateProvider. > Only > org.apache.solr.cluster.placement.impl.AttributeFetcherImpl#getMetricSnitchTag > and org.apache.solr.cluster.placement.impl.NodeMetricImpl refer to some > constants in the code to be removed. Those constants could move out, > consolidated somewhere we think is appropriate. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16507) Remove NodeStateProvider & Snitch
[ https://issues.apache.org/jira/browse/SOLR-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703746#comment-17703746 ] Andrzej Bialecki commented on SOLR-16507: - bq. Do you think there's a point to SplitShardCmd using NodeStateProvider vs just going to the metrics API? You're making my point for me ;) you could do that, but the SplitShardCmd would become very complex with a lot of non-reusable code - because you would still have to bake-in making HTTP requests to other nodes and parsing / extracting metrics values. So it's better to hide this complexity in a high-level utility API. IMHO NodeStateProvider is a good abstraction, just its implementation needs to be cleaned up. > Remove NodeStateProvider & Snitch > - > > Key: SOLR-16507 > URL: https://issues.apache.org/jira/browse/SOLR-16507 > Project: Solr > Issue Type: Task >Reporter: David Smiley >Priority: Major > Labels: newdev > > The NodeStateProvider is a relic relating to the old autoscaling framework > that was removed in Solr 9. The only remaining usage of it is for > SplitShardCmd to check the disk space. For this, it could use the metrics > api. > I think we'll observe that Snitch and other classes in > org.apache.solr.common.cloud.rule can be removed as well, as it's related to > NodeStateProvider. > Only > org.apache.solr.cluster.placement.impl.AttributeFetcherImpl#getMetricSnitchTag > and org.apache.solr.cluster.placement.impl.NodeMetricImpl refer to some > constants in the code to be removed. Those constants could move out, > consolidated somewhere we think is appropriate. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-17138) Support other QueryTimeout criteria
Andrzej Bialecki created SOLR-17138: --- Summary: Support other QueryTimeout criteria Key: SOLR-17138 URL: https://issues.apache.org/jira/browse/SOLR-17138 Project: Solr Issue Type: New Feature Security Level: Public (Default Security Level. Issues are Public) Components: Query Budget Reporter: Andrzej Bialecki Complex Solr queries can consume significant memory and CPU while being processed. When OOM or CPU saturation is reached Solr becomes unresponsive, which further compounds the problem. Often such “killer queries” are not written to logs, which makes them difficult to diagnose. This happens even with best practices in place. It should be possible to set limits in Solr that cannot be exceeded by individual queries. This mechanism would monitor an accumulating “cost” of a query while it’s being executed and compare it to the configured maximum cost (budget), expressed in terms of CPU and/or memory usage that can be attributed to this query. Should these limits be exceeded the individual query execution should be terminated, without affecting other concurrently executing queries. The CircuitBreakers functionality doesn't distinguish the source of the load and can't protect other query executions from a particular runaway query. We need a more fine-grained mechanism. The existing `QueryTimeout` API enables such termination of individual queries. However, the existing implementation (`SolrQueryTimeoutImpl` used with `timeAllowed` query param) only uses elapsed wall-clock time as the termination criterion. This is insufficient - in case of resource contention the wall-clock time doesn’t represent correctly the actual CPU cost of executing a particular query. A query may produce results after a long time not because of its complexity or bad behavior but because of the general resource contention caused by other concurrently executing queries. OTOH a single runaway query may consume all resources and cause all other valid queries to fail if they exceed the wall-clock `timeAllowed`. I propose adding two additional criteria for limiting the maximum "query budget": * per-thread CPU time: using `getThreadCpuTime` to periodically check (`QueryTimeout.shouldExit()`) the current CPU consumption since the start of the query execution. * per-thread memory allocation: using `getThreadAllocatedBytes`. I ran some JMH microbenchmarks to ensure that these two methods are available on modern OS/JVM combinations and their cost is negligible (less than 0.5 us/call). This means that the initial implementation may call these methods directly for every `shouldExist()` call without undue burden. If we decide that this still adds too much overhead we can change this to periodic updates in a background thread. These two "query budget" constraints can be implemented as subclasses of `QueryTimeout`. Initially we can use a similar configuration mechanism as with `timeAllowed`, i.e. pass the max value as a query param, or add it to the search handler's invariants. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-17138) Support other QueryTimeout criteria
[ https://issues.apache.org/jira/browse/SOLR-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-17138: Description: Complex Solr queries can consume significant memory and CPU while being processed. When OOM or CPU saturation is reached Solr becomes unresponsive, which further compounds the problem. Often such “killer queries” are not written to logs, which makes them difficult to diagnose. This happens even with best practices in place. It should be possible to set limits in Solr that cannot be exceeded by individual queries. This mechanism would monitor an accumulating “cost” of a query while it’s being executed and compare it to the configured maximum cost (budget), expressed in terms of CPU and/or memory usage that can be attributed to this query. Should these limits be exceeded the individual query execution should be terminated, without affecting other concurrently executing queries. The CircuitBreakers functionality doesn't distinguish the source of the load and can't protect other query executions from a particular runaway query. We need a more fine-grained mechanism. The existing `QueryTimeout` API enables such termination of individual queries. However, the existing implementation (`SolrQueryTimeoutImpl` used with `timeAllowed` query param) only uses elapsed wall-clock time as the termination criterion. This is insufficient - in case of resource contention the wall-clock time doesn’t represent correctly the actual CPU cost of executing a particular query. A query may produce results after a long time not because of its complexity or bad behavior but because of the general resource contention caused by other concurrently executing queries. OTOH a single runaway query may consume all resources and cause all other valid queries to fail if they exceed the wall-clock `timeAllowed`. I propose adding two additional criteria for limiting the maximum "query budget": * per-thread CPU time: using `getThreadCpuTime` to periodically check (`QueryTimeout.shouldExit()`) the current CPU consumption since the start of the query execution. * per-thread memory allocation: using `getThreadAllocatedBytes`. I ran some JMH microbenchmarks to ensure that these two methods are available on modern OS/JVM combinations and their cost is negligible (less than 0.5 us/call). This means that the initial implementation may call these methods directly for every `shouldExit()` call without undue burden. If we decide that this still adds too much overhead we can change this to periodic updates in a background thread. These two "query budget" constraints can be implemented as subclasses of `QueryTimeout`. Initially we can use a similar configuration mechanism as with `timeAllowed`, i.e. pass the max value as a query param, or add it to the search handler's invariants. was: Complex Solr queries can consume significant memory and CPU while being processed. When OOM or CPU saturation is reached Solr becomes unresponsive, which further compounds the problem. Often such “killer queries” are not written to logs, which makes them difficult to diagnose. This happens even with best practices in place. It should be possible to set limits in Solr that cannot be exceeded by individual queries. This mechanism would monitor an accumulating “cost” of a query while it’s being executed and compare it to the configured maximum cost (budget), expressed in terms of CPU and/or memory usage that can be attributed to this query. Should these limits be exceeded the individual query execution should be terminated, without affecting other concurrently executing queries. The CircuitBreakers functionality doesn't distinguish the source of the load and can't protect other query executions from a particular runaway query. We need a more fine-grained mechanism. The existing `QueryTimeout` API enables such termination of individual queries. However, the existing implementation (`SolrQueryTimeoutImpl` used with `timeAllowed` query param) only uses elapsed wall-clock time as the termination criterion. This is insufficient - in case of resource contention the wall-clock time doesn’t represent correctly the actual CPU cost of executing a particular query. A query may produce results after a long time not because of its complexity or bad behavior but because of the general resource contention caused by other concurrently executing queries. OTOH a single runaway query may consume all resources and cause all other valid queries to fail if they exceed the wall-clock `timeAllowed`. I propose adding two additional criteria for limiting the maximum "query budget": * per-thread CPU time: using `getThreadCpuTime` to periodically check (`QueryTimeout.shouldExit()`) the current CPU consumption since the start of the query execution. * per-thread memory allocation: using `getThreadAllocatedBytes`. I ran so
[jira] [Created] (SOLR-17140) Refactor SolrQueryTimeoutImpl to support other implementations
Andrzej Bialecki created SOLR-17140: --- Summary: Refactor SolrQueryTimeoutImpl to support other implementations Key: SOLR-17140 URL: https://issues.apache.org/jira/browse/SOLR-17140 Project: Solr Issue Type: Sub-task Security Level: Public (Default Security Level. Issues are Public) Reporter: Andrzej Bialecki -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-17141) Create CpuQueryTimeout implementation
Andrzej Bialecki created SOLR-17141: --- Summary: Create CpuQueryTimeout implementation Key: SOLR-17141 URL: https://issues.apache.org/jira/browse/SOLR-17141 Project: Solr Issue Type: Sub-task Security Level: Public (Default Security Level. Issues are Public) Reporter: Andrzej Bialecki This class will use `getThreadCpuTime` to determine when to signal `shouldExit`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Assigned] (SOLR-17141) Create CpuQueryTimeout implementation
[ https://issues.apache.org/jira/browse/SOLR-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned SOLR-17141: --- Assignee: Andrzej Bialecki > Create CpuQueryTimeout implementation > - > > Key: SOLR-17141 > URL: https://issues.apache.org/jira/browse/SOLR-17141 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > > This class will use `getThreadCpuTime` to determine when to signal > `shouldExit`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-17150) Create MemQueryLimit implementation
Andrzej Bialecki created SOLR-17150: --- Summary: Create MemQueryLimit implementation Key: SOLR-17150 URL: https://issues.apache.org/jira/browse/SOLR-17150 Project: Solr Issue Type: Sub-task Security Level: Public (Default Security Level. Issues are Public) Components: Query Budget Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki An implementation of {{QueryTimeout}} that terminates misbehaving queries that allocate too much memory for their execution. This is a bit more complicated than {{CpuQueryLimits}} because the first time a query is submitted it may legitimately allocate many sizeable objects (caches, field values, etc). So we want to catch and terminate queries that either exceed any reasonable threshold (eg. 2GB), or significantly exceed a time-weighted percentile of the recent queries. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-17141) Create CpuQueryLimit implementation
[ https://issues.apache.org/jira/browse/SOLR-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-17141: Summary: Create CpuQueryLimit implementation (was: Create CpuQueryTimeout implementation) > Create CpuQueryLimit implementation > --- > > Key: SOLR-17141 > URL: https://issues.apache.org/jira/browse/SOLR-17141 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > > This class will use `getThreadCpuTime` to determine when to signal > `shouldExit`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-17138) Support other QueryTimeout criteria
[ https://issues.apache.org/jira/browse/SOLR-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-17138: Description: Complex Solr queries can consume significant memory and CPU while being processed. When OOM or CPU saturation is reached Solr becomes unresponsive, which further compounds the problem. Often such “killer queries” are not written to logs, which makes them difficult to diagnose. This happens even with best practices in place. It should be possible to set limits in Solr that cannot be exceeded by individual queries. This mechanism would monitor an accumulating “cost” of a query while it’s being executed and compare it to the configured maximum cost (budget), expressed in terms of CPU and/or memory usage that can be attributed to this query. Should these limits be exceeded the individual query execution should be terminated, without affecting other concurrently executing queries. The CircuitBreakers functionality doesn't distinguish the source of the load and can't protect other query executions from a particular runaway query. We need a more fine-grained mechanism. The existing {{QueryTimeout}} API enables such termination of individual queries. However, the existing implementation ({{SolrQueryTimeoutImpl}} used with {{timeAllowed}} query param) only uses elapsed wall-clock time as the termination criterion. This is insufficient - in case of resource contention the wall-clock time doesn’t represent correctly the actual CPU cost of executing a particular query. A query may produce results after a long time not because of its complexity or bad behavior but because of the general resource contention caused by other concurrently executing queries. OTOH a single runaway query may consume all resources and cause all other valid queries to fail if they exceed the wall-clock {{timeAllowed}}. I propose adding two additional criteria for limiting the maximum "query budget": * per-thread CPU time: using {{getThreadCpuTime}} to periodically check ({{QueryTimeout.shouldExit()}}) the current CPU consumption since the start of the query execution. * per-thread memory allocation: using {{getThreadAllocatedBytes}}. I ran some JMH microbenchmarks to ensure that these two methods are available on modern OS/JVM combinations and their cost is negligible (less than 0.5 us/call). This means that the initial implementation may call these methods directly for every {{shouldExit()}} call without undue burden. If we decide that this still adds too much overhead we can change this to periodic updates in a background thread. These two "query budget" constraints can be implemented as subclasses of {{QueryTimeout}}. Initially we can use a similar configuration mechanism as with {{timeAllowed}}, i.e. pass the max value as a query param, or add it to the search handler's invariants. was: Complex Solr queries can consume significant memory and CPU while being processed. When OOM or CPU saturation is reached Solr becomes unresponsive, which further compounds the problem. Often such “killer queries” are not written to logs, which makes them difficult to diagnose. This happens even with best practices in place. It should be possible to set limits in Solr that cannot be exceeded by individual queries. This mechanism would monitor an accumulating “cost” of a query while it’s being executed and compare it to the configured maximum cost (budget), expressed in terms of CPU and/or memory usage that can be attributed to this query. Should these limits be exceeded the individual query execution should be terminated, without affecting other concurrently executing queries. The CircuitBreakers functionality doesn't distinguish the source of the load and can't protect other query executions from a particular runaway query. We need a more fine-grained mechanism. The existing `QueryTimeout` API enables such termination of individual queries. However, the existing implementation (`SolrQueryTimeoutImpl` used with `timeAllowed` query param) only uses elapsed wall-clock time as the termination criterion. This is insufficient - in case of resource contention the wall-clock time doesn’t represent correctly the actual CPU cost of executing a particular query. A query may produce results after a long time not because of its complexity or bad behavior but because of the general resource contention caused by other concurrently executing queries. OTOH a single runaway query may consume all resources and cause all other valid queries to fail if they exceed the wall-clock `timeAllowed`. I propose adding two additional criteria for limiting the maximum "query budget": * per-thread CPU time: using `getThreadCpuTime` to periodically check (`QueryTimeout.shouldExit()`) the current CPU consumption since the start of the query execution. * per-thread memory allocation: using `getThreadAlloca
[jira] [Created] (SOLR-17151) Review current usage of QueryLimits to ensure complete coverage
Andrzej Bialecki created SOLR-17151: --- Summary: Review current usage of QueryLimits to ensure complete coverage Key: SOLR-17151 URL: https://issues.apache.org/jira/browse/SOLR-17151 Project: Solr Issue Type: Sub-task Security Level: Public (Default Security Level. Issues are Public) Components: Query Budget Reporter: Andrzej Bialecki Resource usage by a query is not limited to the actual search within {{QueryComponent}}. Other components invoked by {{SearchHandler}} may significantly contribute to this usage, either before or after the {{QueryComponent}}. Those components that already use {{QueryTimeout}} either directly or indirectly will properly observe the limits and terminate if needed. However, other components may be expensive or misbehaving but fail to observe the limits imposed on the end-to-end query processing. One such obvious place where we could add this check is where the {{SearchHandler}} loops over {{SearchComponent}-s - it should call explicitly {{QueryLimits.shouldExit()}} to ensure that even if previously executed component ignored the limits they will be still enforced at the {{SearchHandler}} level. There may be other places like this, too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17150) Create MemQueryLimit implementation
[ https://issues.apache.org/jira/browse/SOLR-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815673#comment-17815673 ] Andrzej Bialecki commented on SOLR-17150: - Here's the proposed approach to implement two thresholds: * an absolute max limit to terminate any query that exceeds this allocation * a relative dynamic limit to terminate queries that exceed "typical" allocation For the absolute limit: as with other implementations, {{memAllowed}} would set the absolute limit per query (float value in megabytes?). In order to accommodate initial queries this should be set to a relatively high value, which isn't optimal later for typical queries - this higher limit will eventually catch runaway queries but not before they consume significant memory. For the dynamic limit: a histogram would be added to the metrics to track the recent memory usage per query (using exponentially decaying reservoir). The life-cycle of the histogram could be tied either to SolrCore or to SolrIndexSearcher (the latter seems more appropriate because of the warmup queries that would skew the longer-term stats in SolrCore's life-cycle). After collecting sufficient number of data points (eg. {{{}N = 100{}}}) the component could start enforcing a dynamic limit based on a formula that takes into account the "typical" recent queries. For example: {{{}dynamicThreshold = X * p99{}}}, where {{X = 2.0}} by default. Open issues: * does the dynamic threshold make sense? does the formula make sense? * I think that both the static and dynamic limits should be optional, ie. some combination of query params should allow user to skip the enforcement of either / both. * since the dynamic limit involves parameters (at least N and X above) that determine long-term tracking it can no longer be expressed just as short-lived query params, it needs a configuration with a life-cycle of SolrCore or longer. Where should we put this configuration? > Create MemQueryLimit implementation > --- > > Key: SOLR-17150 > URL: https://issues.apache.org/jira/browse/SOLR-17150 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) > Components: Query Limits >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > An implementation of {{QueryTimeout}} that terminates misbehaving queries > that allocate too much memory for their execution. > This is a bit more complicated than {{CpuQueryLimits}} because the first time > a query is submitted it may legitimately allocate many sizeable objects > (caches, field values, etc). So we want to catch and terminate queries that > either exceed any reasonable threshold (eg. 2GB), or significantly exceed a > time-weighted percentile of the recent queries. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-17158) Terminate distributed processing quickly when query limit is reached
Andrzej Bialecki created SOLR-17158: --- Summary: Terminate distributed processing quickly when query limit is reached Key: SOLR-17158 URL: https://issues.apache.org/jira/browse/SOLR-17158 Project: Solr Issue Type: Sub-task Security Level: Public (Default Security Level. Issues are Public) Components: Query Limits Reporter: Andrzej Bialecki Solr should make sure that when query limits are reached and partial results are not needed (and not wanted) then both the processing in shards and in the query coordinator should be terminated as quickly as possible, and Solr should minimize wasted resources spent on eg. returning data from the remaining shards, merging responses in the coordinator, or returning any data back to the user. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17138) Support other QueryTimeout criteria
[ https://issues.apache.org/jira/browse/SOLR-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816106#comment-17816106 ] Andrzej Bialecki commented on SOLR-17138: - Here are results from a set of simple JMH benchmarks where the code would call the respective method for 100 threads. Results are in nanoseconds / call / thread. Both methods are supported and enabled by default on all tested JVMs. * Results on Macbook M1 Max, MacOS Sonoma. ||*Java version*||*getThreadAllocatedBytes*||*getCpuThreadTime*|| |Azul Zulu 11|95|757| |OpenJDK 17|72|730| |OpenJDK 21|83|819| * Results on a Linux VM (on a Kubernetes cluster) running Ubuntu 22.04. ||*Java version*||*getThreadAllocatedBytes*||*getCpuThreadTime*|| |OpenJDK 11|40|238| |OpenJDK 17|36|239| |OpenJDK 21|41|236| * Results on a Windows VM (on a Kubernetes cluster) running Windows Server Core 10. ||*Java version*||*getThreadAllocatedBytes*||*getCpuThreadTime*|| |OpenJDK 11|108|440| |Oracle Java 17|103|426| |Oracle Java 21|105|447| > Support other QueryTimeout criteria > --- > > Key: SOLR-17138 > URL: https://issues.apache.org/jira/browse/SOLR-17138 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Components: Query Limits >Reporter: Andrzej Bialecki >Priority: Major > > Complex Solr queries can consume significant memory and CPU while being > processed. When OOM or CPU saturation is reached Solr becomes unresponsive, > which further compounds the problem. Often such “killer queries” are not > written to logs, which makes them difficult to diagnose. This happens even > with best practices in place. > It should be possible to set limits in Solr that cannot be exceeded by > individual queries. This mechanism would monitor an accumulating “cost” of a > query while it’s being executed and compare it to the configured maximum cost > (budget), expressed in terms of CPU and/or memory usage that can be > attributed to this query. Should these limits be exceeded the individual > query execution should be terminated, without affecting other concurrently > executing queries. > The CircuitBreakers functionality doesn't distinguish the source of the load > and can't protect other query executions from a particular runaway query. We > need a more fine-grained mechanism. > The existing {{QueryTimeout}} API enables such termination of individual > queries. However, the existing implementation ({{SolrQueryTimeoutImpl}} used > with {{timeAllowed}} query param) only uses elapsed wall-clock time as the > termination criterion. This is insufficient - in case of resource contention > the wall-clock time doesn’t represent correctly the actual CPU cost of > executing a particular query. A query may produce results after a long time > not because of its complexity or bad behavior but because of the general > resource contention caused by other concurrently executing queries. OTOH a > single runaway query may consume all resources and cause all other valid > queries to fail if they exceed the wall-clock {{timeAllowed}}. > I propose adding two additional criteria for limiting the maximum "query > budget": > * per-thread CPU time: using {{getThreadCpuTime}} to periodically check > ({{QueryTimeout.shouldExit()}}) the current CPU consumption since the start > of the query execution. > * per-thread memory allocation: using {{getThreadAllocatedBytes}}. > I ran some JMH microbenchmarks to ensure that these two methods are available > on modern OS/JVM combinations and their cost is negligible (less than 0.5 > us/call). This means that the initial implementation may call these methods > directly for every {{shouldExit()}} call without undue burden. If we decide > that this still adds too much overhead we can change this to periodic updates > in a background thread. > These two "query budget" constraints can be implemented as subclasses of > {{QueryTimeout}}. Initially we can use a similar configuration mechanism as > with {{timeAllowed}}, i.e. pass the max value as a query param, or add it to > the search handler's invariants. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Assigned] (SOLR-17158) Terminate distributed processing quickly when query limit is reached
[ https://issues.apache.org/jira/browse/SOLR-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned SOLR-17158: --- Assignee: Andrzej Bialecki > Terminate distributed processing quickly when query limit is reached > > > Key: SOLR-17158 > URL: https://issues.apache.org/jira/browse/SOLR-17158 > Project: Solr > Issue Type: Sub-task > Components: Query Limits >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > > Solr should make sure that when query limits are reached and partial results > are not needed (and not wanted) then both the processing in shards and in the > query coordinator should be terminated as quickly as possible, and Solr > should minimize wasted resources spent on eg. returning data from the > remaining shards, merging responses in the coordinator, or returning any data > back to the user. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16986) Measure and aggregate thread CPU time in distributed search
[ https://issues.apache.org/jira/browse/SOLR-16986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816424#comment-17816424 ] Andrzej Bialecki commented on SOLR-16986: - Ok. In any case, this has a bug in that it ignores all but the first time measure when there are nested requests. [~gus] and I will look into reusing {{ThreadStats}} if possible and fixing this in SOLR-17140. > Measure and aggregate thread CPU time in distributed search > --- > > Key: SOLR-16986 > URL: https://issues.apache.org/jira/browse/SOLR-16986 > Project: Solr > Issue Type: New Feature >Reporter: David Smiley >Priority: Major > Time Spent: 2h 40m > Remaining Estimate: 0h > > Solr responses include "QTime", which in retrospect might have been better > named "elapsedTime". We propose adding here a "cpuTime" to return the amount > of time consumed by > ManagementFactory.getThreadMXBean().[getThreadCpuTime|https://docs.oracle.com/en/java/javase/11/docs/api/java.management/java/lang/management/ThreadMXBean.html](). > Unlike QTime, this will need to be aggregated across distributed requests. > This work item will only do the aggregation work for distributed search, > although it could be extended for other scenarios in future work items. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-16986) Measure and aggregate thread CPU time in distributed search
[ https://issues.apache.org/jira/browse/SOLR-16986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816424#comment-17816424 ] Andrzej Bialecki edited comment on SOLR-16986 at 2/11/24 2:16 PM: -- Ok. In any case, this has a bug in that it ignores all but the first time measure when there are nested requests. [~gus] and I will look into reusing {{ThreadStats}} if possible and fixing this in SOLR-17140 so that the CPU time logged and the CPU time limit enforced by {{CpuQueryTimeLimit}} are consistent. was (Author: ab): Ok. In any case, this has a bug in that it ignores all but the first time measure when there are nested requests. [~gus] and I will look into reusing {{ThreadStats}} if possible and fixing this in SOLR-17140. > Measure and aggregate thread CPU time in distributed search > --- > > Key: SOLR-16986 > URL: https://issues.apache.org/jira/browse/SOLR-16986 > Project: Solr > Issue Type: New Feature >Reporter: David Smiley >Priority: Major > Time Spent: 2h 40m > Remaining Estimate: 0h > > Solr responses include "QTime", which in retrospect might have been better > named "elapsedTime". We propose adding here a "cpuTime" to return the amount > of time consumed by > ManagementFactory.getThreadMXBean().[getThreadCpuTime|https://docs.oracle.com/en/java/javase/11/docs/api/java.management/java/lang/management/ThreadMXBean.html](). > Unlike QTime, this will need to be aggregated across distributed requests. > This work item will only do the aggregation work for distributed search, > although it could be extended for other scenarios in future work items. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17141) Create CpuQueryLimit implementation
[ https://issues.apache.org/jira/browse/SOLR-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816644#comment-17816644 ] Andrzej Bialecki commented on SOLR-17141: - [~gus] and I discussed this issue - the way {{ThreadStats}} is used in SOLR-16986 gives incomplete results because it ignores nested queries (which use the stack in {{{}SolrRequestInfo{}}}. We would like to fix this as part of the SOLR-17138 refactoring, and to avoid potential confusion when logged CPU time is different than the CPU time limit set here. This can be done when both the {{CpuQueryTimeLimit}} and {{ThreadStats}} use the same starting point but keep track of nested requests. > Create CpuQueryLimit implementation > --- > > Key: SOLR-17141 > URL: https://issues.apache.org/jira/browse/SOLR-17141 > Project: Solr > Issue Type: Sub-task >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > This class will use `getThreadCpuTime` to determine when to signal > `shouldExit`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-17141) Create CpuQueryLimit implementation
[ https://issues.apache.org/jira/browse/SOLR-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816644#comment-17816644 ] Andrzej Bialecki edited comment on SOLR-17141 at 2/12/24 3:29 PM: -- [~gus] and I discussed this issue - the way {{ThreadStats}} is used in SOLR-16986 gives incomplete results because it ignores nested queries (which use the stack in {{SolrRequestInfo}}). We would like to fix this as part of the SOLR-17138 refactoring, and to avoid potential confusion when logged CPU time is different than the CPU time limit set here. This can be done when both the {{CpuQueryTimeLimit}} and {{ThreadStats}} use the same starting point but keep track of nested requests. was (Author: ab): [~gus] and I discussed this issue - the way {{ThreadStats}} is used in SOLR-16986 gives incomplete results because it ignores nested queries (which use the stack in {{{}SolrRequestInfo{}}}. We would like to fix this as part of the SOLR-17138 refactoring, and to avoid potential confusion when logged CPU time is different than the CPU time limit set here. This can be done when both the {{CpuQueryTimeLimit}} and {{ThreadStats}} use the same starting point but keep track of nested requests. > Create CpuQueryLimit implementation > --- > > Key: SOLR-17141 > URL: https://issues.apache.org/jira/browse/SOLR-17141 > Project: Solr > Issue Type: Sub-task >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > This class will use `getThreadCpuTime` to determine when to signal > `shouldExit`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-17141) Create CpuAllowedLimit implementation
[ https://issues.apache.org/jira/browse/SOLR-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-17141: Summary: Create CpuAllowedLimit implementation (was: Create CpuQueryLimit implementation) > Create CpuAllowedLimit implementation > - > > Key: SOLR-17141 > URL: https://issues.apache.org/jira/browse/SOLR-17141 > Project: Solr > Issue Type: Sub-task >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Time Spent: 3h 40m > Remaining Estimate: 0h > > This class will use `getThreadCpuTime` to determine when to signal > `shouldExit`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-17141) Create CpuAllowedLimit implementation
[ https://issues.apache.org/jira/browse/SOLR-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-17141: Fix Version/s: 9.6.0 > Create CpuAllowedLimit implementation > - > > Key: SOLR-17141 > URL: https://issues.apache.org/jira/browse/SOLR-17141 > Project: Solr > Issue Type: Sub-task >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: 9.6.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > This class will use `getThreadCpuTime` to determine when to signal > `shouldExit`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Resolved] (SOLR-17141) Create CpuAllowedLimit implementation
[ https://issues.apache.org/jira/browse/SOLR-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved SOLR-17141. - Resolution: Fixed > Create CpuAllowedLimit implementation > - > > Key: SOLR-17141 > URL: https://issues.apache.org/jira/browse/SOLR-17141 > Project: Solr > Issue Type: Sub-task >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: 9.6.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > This class will use `getThreadCpuTime` to determine when to signal > `shouldExit`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-17172) Add QueryLimits termination to existing heavy SearchComponent-s
Andrzej Bialecki created SOLR-17172: --- Summary: Add QueryLimits termination to existing heavy SearchComponent-s Key: SOLR-17172 URL: https://issues.apache.org/jira/browse/SOLR-17172 Project: Solr Issue Type: Sub-task Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki The purpose of this ticket is to review the existing {{SearchComponent}}-s that perform intensive tasks to see if they could be modified to check the {{QueryLimits.shouldExit()}} inside their execution. This is not meant to be included in tight loops but to prevent individual components from completing multiple stages of costly work that will be discarded anyway on the exit from the component due to the exceeded limits (SOLR-17151). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17151) Review current usage of QueryLimits to ensure complete coverage
[ https://issues.apache.org/jira/browse/SOLR-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819299#comment-17819299 ] Andrzej Bialecki commented on SOLR-17151: - Let's focus here on improving the checking between components as opposed on SOLR-17172. > Review current usage of QueryLimits to ensure complete coverage > --- > > Key: SOLR-17151 > URL: https://issues.apache.org/jira/browse/SOLR-17151 > Project: Solr > Issue Type: Sub-task > Components: Query Limits >Reporter: Andrzej Bialecki >Assignee: Gus Heck >Priority: Major > > Resource usage by a query is not limited to the actual search within > {{QueryComponent}}. Other components invoked by {{SearchHandler}} may > significantly contribute to this usage, either before or after the > {{QueryComponent}}. > Those components that already use {{QueryTimeout}} either directly or > indirectly will properly observe the limits and terminate if needed. However, > other components may be expensive or misbehaving but fail to observe the > limits imposed on the end-to-end query processing. > One such obvious place where we could add this check is where the > {{SearchHandler}} loops over {{SearchComponent}-s - it should call explicitly > {{QueryLimits.shouldExit()}} to ensure that even if previously executed > component ignored the limits they will be still enforced at the > {{SearchHandler}} level. There may be other places like this, too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-17151) Review current usage of QueryLimits to ensure complete coverage
[ https://issues.apache.org/jira/browse/SOLR-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819299#comment-17819299 ] Andrzej Bialecki edited comment on SOLR-17151 at 2/21/24 3:49 PM: -- Let's focus here on improving the checking between components as opposed to SOLR-17172. was (Author: ab): Let's focus here on improving the checking between components as opposed on SOLR-17172. > Review current usage of QueryLimits to ensure complete coverage > --- > > Key: SOLR-17151 > URL: https://issues.apache.org/jira/browse/SOLR-17151 > Project: Solr > Issue Type: Sub-task > Components: Query Limits >Reporter: Andrzej Bialecki >Assignee: Gus Heck >Priority: Major > > Resource usage by a query is not limited to the actual search within > {{QueryComponent}}. Other components invoked by {{SearchHandler}} may > significantly contribute to this usage, either before or after the > {{QueryComponent}}. > Those components that already use {{QueryTimeout}} either directly or > indirectly will properly observe the limits and terminate if needed. However, > other components may be expensive or misbehaving but fail to observe the > limits imposed on the end-to-end query processing. > One such obvious place where we could add this check is where the > {{SearchHandler}} loops over {{SearchComponent}-s - it should call explicitly > {{QueryLimits.shouldExit()}} to ensure that even if previously executed > component ignored the limits they will be still enforced at the > {{SearchHandler}} level. There may be other places like this, too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17158) Terminate distributed processing quickly when query limit is reached
[ https://issues.apache.org/jira/browse/SOLR-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819319#comment-17819319 ] Andrzej Bialecki commented on SOLR-17158: - Adding some observations from reading the code in {{{}SolrIndexSearcher}} and {{HttpShardHandler}}. It appears that currently when {{timeAllowed}} is reached it doesn’t cause termination of all other pending shard requests. I found this section in {{SolrIndexSearcher:284}}: {{ try {}} {{ super.search(query, collector);}} {{ } catch (TimeLimitingCollector.TimeExceededException}} {{ | ExitableDirectoryReader.ExitingReaderException}} {{ | CancellableCollector.QueryCancelledException x) {}} {{ log.warn("Query: [{}]; ", query, x);}} {{ qr.setPartialResults(true);}} In the case when it reaches {{timeAllowed}} limit (and our new {{QueryLimits}}, too) it simply sets {{partialResults=true}} and does NOT throw any exception, so all the layers above think that the result is a success. I suspect the reason for this was that when {{timeAllowed}} was set we still wanted to retrieve partial results when the limit was hit, and throwing an exception here would prevent that. OTOH, if we had a request param saying “discard everything when you reach a limit and cancel any ongoing requests” then we could throw an exception here, and {{ShardHandler}} would recognize this as an error and cancel all other shard requests that are still pending, so that replicas could avoid sending back their results that would be discarded anyway. > Terminate distributed processing quickly when query limit is reached > > > Key: SOLR-17158 > URL: https://issues.apache.org/jira/browse/SOLR-17158 > Project: Solr > Issue Type: Sub-task > Components: Query Limits >Reporter: Andrzej Bialecki >Assignee: Gus Heck >Priority: Major > > Solr should make sure that when query limits are reached and partial results > are not needed (and not wanted) then both the processing in shards and in the > query coordinator should be terminated as quickly as possible, and Solr > should minimize wasted resources spent on eg. returning data from the > remaining shards, merging responses in the coordinator, or returning any data > back to the user. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17158) Terminate distributed processing quickly when query limit is reached
[ https://issues.apache.org/jira/browse/SOLR-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17820030#comment-17820030 ] Andrzej Bialecki commented on SOLR-17158: - FYI, it was necessary to add this parameter in SOLR-17172, I used {{partialResults=true}} to mean that we should stop processing and return partial results with "success" code and "partialResults" flag in the response, and {{partialResults=false}} to mean that we should throw an exception and discard any partial results. > Terminate distributed processing quickly when query limit is reached > > > Key: SOLR-17158 > URL: https://issues.apache.org/jira/browse/SOLR-17158 > Project: Solr > Issue Type: Sub-task > Components: Query Limits >Reporter: Andrzej Bialecki >Assignee: Gus Heck >Priority: Major > > Solr should make sure that when query limits are reached and partial results > are not needed (and not wanted) then both the processing in shards and in the > query coordinator should be terminated as quickly as possible, and Solr > should minimize wasted resources spent on eg. returning data from the > remaining shards, merging responses in the coordinator, or returning any data > back to the user. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17158) Terminate distributed processing quickly when query limit is reached
[ https://issues.apache.org/jira/browse/SOLR-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17820781#comment-17820781 ] Andrzej Bialecki commented on SOLR-17158: - I'm not convinced we need a sysprop here... why shouldn't we use request handler's {{defaults}} and {{invariants}} sections in {{solrconfig.xml}} ? Using a sysprop effectively enforces the same default behavior for all replicas of all collections managed by this Solr node. > Terminate distributed processing quickly when query limit is reached > > > Key: SOLR-17158 > URL: https://issues.apache.org/jira/browse/SOLR-17158 > Project: Solr > Issue Type: Sub-task > Components: Query Limits >Reporter: Andrzej Bialecki >Assignee: Gus Heck >Priority: Major > > Solr should make sure that when query limits are reached and partial results > are not needed (and not wanted) then both the processing in shards and in the > query coordinator should be terminated as quickly as possible, and Solr > should minimize wasted resources spent on eg. returning data from the > remaining shards, merging responses in the coordinator, or returning any data > back to the user. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-17172) Add QueryLimits termination to existing heavy SearchComponent-s
[ https://issues.apache.org/jira/browse/SOLR-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-17172: Fix Version/s: 9.6.0 > Add QueryLimits termination to existing heavy SearchComponent-s > --- > > Key: SOLR-17172 > URL: https://issues.apache.org/jira/browse/SOLR-17172 > Project: Solr > Issue Type: Sub-task >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: 9.6.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > The purpose of this ticket is to review the existing {{SearchComponent}}-s > that perform intensive tasks to see if they could be modified to check the > {{QueryLimits.shouldExit()}} inside their execution. > This is not meant to be included in tight loops but to prevent individual > components from completing multiple stages of costly work that will be > discarded anyway on the exit from the component due to the exceeded limits > (SOLR-17151). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-17172) Add QueryLimits termination to existing heavy SearchComponent-s
[ https://issues.apache.org/jira/browse/SOLR-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-17172: Component/s: Query Limits > Add QueryLimits termination to existing heavy SearchComponent-s > --- > > Key: SOLR-17172 > URL: https://issues.apache.org/jira/browse/SOLR-17172 > Project: Solr > Issue Type: Sub-task > Components: Query Limits >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: 9.6.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > The purpose of this ticket is to review the existing {{SearchComponent}}-s > that perform intensive tasks to see if they could be modified to check the > {{QueryLimits.shouldExit()}} inside their execution. > This is not meant to be included in tight loops but to prevent individual > components from completing multiple stages of costly work that will be > discarded anyway on the exit from the component due to the exceeded limits > (SOLR-17151). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Resolved] (SOLR-17172) Add QueryLimits termination to existing heavy SearchComponent-s
[ https://issues.apache.org/jira/browse/SOLR-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved SOLR-17172. - Resolution: Fixed > Add QueryLimits termination to existing heavy SearchComponent-s > --- > > Key: SOLR-17172 > URL: https://issues.apache.org/jira/browse/SOLR-17172 > Project: Solr > Issue Type: Sub-task >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: 9.6.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > The purpose of this ticket is to review the existing {{SearchComponent}}-s > that perform intensive tasks to see if they could be modified to check the > {{QueryLimits.shouldExit()}} inside their execution. > This is not meant to be included in tight loops but to prevent individual > components from completing multiple stages of costly work that will be > discarded anyway on the exit from the component due to the exceeded limits > (SOLR-17151). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-17182) Eliminate the need for 'solr.useExitableDirectoryReader' sysprop
[ https://issues.apache.org/jira/browse/SOLR-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-17182: Component/s: Query Limits > Eliminate the need for 'solr.useExitableDirectoryReader' sysprop > > > Key: SOLR-17182 > URL: https://issues.apache.org/jira/browse/SOLR-17182 > Project: Solr > Issue Type: Sub-task > Components: Query Limits >Reporter: Chris M. Hostetter >Priority: Major > > As the {{QueryLimit}} functionality in Solr gets beefed up, and supports > multiple types of limits, it would be nice if we could find a way to > eliminate the need for the {{solr.useExitableDirectoryReader}} sysprop, and > instead just have codepaths that use the underlying IndexReader (like > faceting, spellcheck, etc...) automatically get a reader that enforces the > limits if/when limits are in use. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-17199) EnvUtils in solr-solrj is missing EnvToSyspropMappings.properties from solr-core
Andrzej Bialecki created SOLR-17199: --- Summary: EnvUtils in solr-solrj is missing EnvToSyspropMappings.properties from solr-core Key: SOLR-17199 URL: https://issues.apache.org/jira/browse/SOLR-17199 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Andrzej Bialecki Initially in SOLR-15960 {{EnvUtils}} was located in solr-core, together with its configuration resource {{EnvToSyspropMappings.properties}}. Then it has been moved from solr-core to solr-solrj but the configuration resource has been left in solr-core. This unfortunately means that {{EnvUtils}} cannot be used without dependency on solr-core, unless user adds their own copy of the configuration resource to the classpath. Right now trying to use it (or using {{PropertiesUtil}} for property substitution) results in an exception from the static initializer: {code} Caused by: java.lang.NullPointerException at java.base/java.util.Objects.requireNonNull(Objects.java:209) at org.apache.solr.common.util.EnvUtils.(EnvUtils.java:51) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17199) EnvUtils in solr-solrj is missing EnvToSyspropMappings.properties from solr-core
[ https://issues.apache.org/jira/browse/SOLR-17199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824384#comment-17824384 ] Andrzej Bialecki commented on SOLR-17199: - I didn't see it - thanks for fixing it! > EnvUtils in solr-solrj is missing EnvToSyspropMappings.properties from > solr-core > > > Key: SOLR-17199 > URL: https://issues.apache.org/jira/browse/SOLR-17199 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 9.5.0 >Reporter: Andrzej Bialecki >Assignee: Jan Høydahl >Priority: Major > Fix For: 9.6.0 > > > Initially in SOLR-15960 {{EnvUtils}} was located in solr-core, together with > its configuration resource {{EnvToSyspropMappings.properties}}. Then it has > been moved from solr-core to solr-solrj but the configuration resource has > been left in solr-core. > This unfortunately means that {{EnvUtils}} cannot be used without dependency > on solr-core, unless user adds their own copy of the configuration resource > to the classpath. Right now trying to use it (or using {{PropertiesUtil}} for > property substitution) results in an exception from the static initializer: > {code} > Caused by: java.lang.NullPointerException > at java.base/java.util.Objects.requireNonNull(Objects.java:209) > at org.apache.solr.common.util.EnvUtils.(EnvUtils.java:51) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17158) Terminate distributed processing quickly when query limit is reached
[ https://issues.apache.org/jira/browse/SOLR-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833944#comment-17833944 ] Andrzej Bialecki commented on SOLR-17158: - [~dsmiley] these are not exactly equivalent - when a limit is reached it doesn't have to be related in any way to per-shard processing. > Terminate distributed processing quickly when query limit is reached > > > Key: SOLR-17158 > URL: https://issues.apache.org/jira/browse/SOLR-17158 > Project: Solr > Issue Type: Sub-task > Components: Query Limits >Reporter: Andrzej Bialecki >Assignee: Gus Heck >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > Solr should make sure that when query limits are reached and partial results > are not needed (and not wanted) then both the processing in shards and in the > query coordinator should be terminated as quickly as possible, and Solr > should minimize wasted resources spent on eg. returning data from the > remaining shards, merging responses in the coordinator, or returning any data > back to the user. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17150) Create MemQueryLimit implementation
[ https://issues.apache.org/jira/browse/SOLR-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842339#comment-17842339 ] Andrzej Bialecki commented on SOLR-17150: - After discussing this with other people it looks like the dynamic limits would be tricky to properly set and the interaction between the occasional legitimate heavier query traffic, updates (which would trigger searcher re-open and a mem usage spike) and other factors could cause too many failures. Still, having support for a hard limit to prevent a total run-away that would result in OOM seems useful. I'll prepare another patch that contains just the hard limit. > Create MemQueryLimit implementation > --- > > Key: SOLR-17150 > URL: https://issues.apache.org/jira/browse/SOLR-17150 > Project: Solr > Issue Type: Sub-task > Components: Query Limits >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > An implementation of {{QueryTimeout}} that terminates misbehaving queries > that allocate too much memory for their execution. > This is a bit more complicated than {{CpuQueryLimits}} because the first time > a query is submitted it may legitimately allocate many sizeable objects > (caches, field values, etc). So we want to catch and terminate queries that > either exceed any reasonable threshold (eg. 2GB), or significantly exceed a > time-weighted percentile of the recent queries. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-13350) Explore collector managers for multi-threaded search
[ https://issues.apache.org/jira/browse/SOLR-13350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845258#comment-17845258 ] Andrzej Bialecki commented on SOLR-13350: - This is caused by breaking the end-to-end tracking of request context in {{{}SolrRequestInfo{}}}, which uses a thread-local deque to provide the same context for both the main and all sub-requests. This tracking is needed to setup the correct query timeout instance on the searcher ( {{QueryLimits}} ) for time-limited searches in the {{SolrIndexSearcher:727}} . However, now that this method is executed in a separate "searcherCollector" thread the {{SolrRequestInfo}} instance it obtains is empty because it doesn't match the original thread that set it. > Explore collector managers for multi-threaded search > > > Key: SOLR-13350 > URL: https://issues.apache.org/jira/browse/SOLR-13350 > Project: Solr > Issue Type: New Feature >Reporter: Ishan Chattopadhyaya >Assignee: Ishan Chattopadhyaya >Priority: Major > Attachments: SOLR-13350.patch, SOLR-13350.patch, SOLR-13350.patch > > Time Spent: 11h 20m > Remaining Estimate: 0h > > AFAICT, SolrIndexSearcher can be used only to search all the segments of an > index in series. However, using CollectorManagers, segments can be searched > concurrently and result in reduced latency. Opening this issue to explore the > effectiveness of using CollectorManagers in SolrIndexSearcher from latency > and throughput perspective. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-13350) Explore collector managers for multi-threaded search
[ https://issues.apache.org/jira/browse/SOLR-13350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848668#comment-17848668 ] Andrzej Bialecki commented on SOLR-13350: - {quote}As of now, the timeAllowed requests are anyway executed without multithreading {quote} This is based on a {{QueryCommand.timeAllowed}} flag that is set only from the {{timeAllowed}} param. However, this concept was extended in SOLR-17138 to {{QueryLimits}} that is now initialized also using other params. There is indeed some inconsistency here that's a left-over from that change, in the sense that `QueryCommand.timeAllowed` should have been either removed completely or replaced with something like {{{}queryLimits{}}}, to make sure to check the current SolrRequestInfo for QueryLimits. In any case, the minimal workaround for this could be to check {{QueryLimits.getCurrentLimits().isLimitsEnabled()}} instead of {{{}QueryCommand.timeAllowed{}}}. But a better fix would be to properly unbreak the tracking of the parent {{SolrRequestInfo}} in MT search. > Explore collector managers for multi-threaded search > > > Key: SOLR-13350 > URL: https://issues.apache.org/jira/browse/SOLR-13350 > Project: Solr > Issue Type: New Feature >Reporter: Ishan Chattopadhyaya >Assignee: Ishan Chattopadhyaya >Priority: Major > Attachments: SOLR-13350.patch, SOLR-13350.patch, SOLR-13350.patch > > Time Spent: 11.5h > Remaining Estimate: 0h > > AFAICT, SolrIndexSearcher can be used only to search all the segments of an > index in series. However, using CollectorManagers, segments can be searched > concurrently and result in reduced latency. Opening this issue to explore the > effectiveness of using CollectorManagers in SolrIndexSearcher from latency > and throughput perspective. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17416) Streaming Expressions: Exception swallowed and not propagated back to the client leading to inconsistent results
[ https://issues.apache.org/jira/browse/SOLR-17416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878554#comment-17878554 ] Andrzej Bialecki commented on SOLR-17416: - +1 for the proposed immediate fix. > Streaming Expressions: Exception swallowed and not propagated back to the > client leading to inconsistent results > - > > Key: SOLR-17416 > URL: https://issues.apache.org/jira/browse/SOLR-17416 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer, streaming expressions >Reporter: Lamine >Priority: Major > Attachments: SOLR-17416.patch > > > There appears to be a bug in the _ExportWriter/ExportBuffers_ implementation > within the Streaming Expressions plugin. Specifically, when an > InterruptedException occurs due to an ExportBuffers timeout, the exception is > swallowed and not propagated back to the client (still logged on the server > side though). > As a result, the client receives an EOF marker, thinking that it has received > the full set of results, when in fact it has only received partial results. > This leads to inconsistent search results, as the client is unaware that the > export process was interrupted and terminated prematurely. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17430) Redesign ExportWriter / ExportBuffers to work better with large batchSizes and slow consumption
[ https://issues.apache.org/jira/browse/SOLR-17430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878556#comment-17878556 ] Andrzej Bialecki commented on SOLR-17430: - Originally this design was an evolution of a single buffer-based older design, where the "filler" and "writer" phases ran sequentially in the same thread. I agree that something we initially thought would be a simple extension ended up quite complicated :) [~jbernste] and I ran several benchmarks using the old and the current design, which showed big performance improvements in the current design. I think that these speedups benefited from the bulk (buffer-based) operations for both read and write sides of the process. Using a queue definitely simplifies the design but I'm worried we may lose some of these performance gains when processing is done item-by-item and not in bulk. OTOH this may not be such a huge factor overall, and if it allows us to simplify the code and better control the flow, then it may be worth it even with some performance penalty. > Redesign ExportWriter / ExportBuffers to work better with large batchSizes > and slow consumption > --- > > Key: SOLR-17430 > URL: https://issues.apache.org/jira/browse/SOLR-17430 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Chris M. Hostetter >Priority: Major > > As mentioned in SOLR-17416, the design of the {{ExportBuffers}} class used by > the {{ExportHandler}} is brittle and the absolutely time limit on how long > the buffer swapping threads will wait for eachother isn't suitable for very > long running streaming expressions... > {quote}The problem however is that this 600 second timeout may not be enough > to account for really slow downstream consumption of the data. With really > large collections, and really complicated streaming expressions, this can > happen even when well behaved clients that are actively trying to consume > data. > {quote} > ...but another sub-optimal aspect of this buffer swapping design is that the > "writer" thread is initially completely blocked, and can't write out a single > document, until the "filler" thread has read the full {{batchSize}} of > documents into it's buffer and opted to swap. Likewise, after buffer > swapping has occured at least once, any document in the {{outputBuffer}} that > the writer has already processed hangs around, taking up ram, until the next > swap, while one of the threads is idle. If {{{}batchSize=3{}}}, and the > "filler" thread is ready to go with a full {{fillBuffer}} while the "writer" > has only been able to emit 2 of the documents in it's {{outputBuffer}} > documents before being blocked and forced to wait (due to the downstream > consumer of the output bytes) before it can emit the last document in it's > batch – that means both the "writer" thread and the "filler" thread are > stalled, taking up 2x the batchSize of ram, even though half of that is data > that is no longer needed. > The bigger the {{batchSize}} the worse the initial delay (and steady state > wasted RAM) is. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-15272) Solr Admin UI uses non-standard unit for the number of docs
Andrzej Bialecki created SOLR-15272: --- Summary: Solr Admin UI uses non-standard unit for the number of docs Key: SOLR-15272 URL: https://issues.apache.org/jira/browse/SOLR-15272 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Affects Versions: main (9.0) Reporter: Andrzej Bialecki I just noticed the following in the Admin UI / Cloud / Nodes section: {quote}gettingstarted_s1r2 (1.9mn docs) {quote} AFAIK there's no widely recognized "mn" unit :) it should be "mln" or perhaps "M" (for the "mega" prefix). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (SOLR-15300) Shard "state" flag is confusing and of limited value to outside consumers
Andrzej Bialecki created SOLR-15300: --- Summary: Shard "state" flag is confusing and of limited value to outside consumers Key: SOLR-15300 URL: https://issues.apache.org/jira/browse/SOLR-15300 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Solr API (and consequently the metric reporters, which are often used for Solr monitoring) report the shard as being in ACTIVE state even when in reality its functionality is severely compromised (eg. no replicas, all replicas down, or no leader). This reported state is technically correct because it is used only for tracking of the SPLITSHARD operations, as defined in {{Slice.State}}. However, this may be misleading and more often unhelpful than not - for constant monitoring a flag that actually reports impaired functionality of a shard would be more useful than a flag that reports a relatively uncommon SPLITSHARD operation. We could either redefine the meaning of the existing flag (and change its state according to some of the criteria I listed above), or add another flag to represent the "health" status of a shard. The value of this flag would then provide an easy way to monitor and to alert external systems of dangerous function impairment, without monitoring the state of all replicas of a collection. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-15232) Add replica(s) as a part of node startup
[ https://issues.apache.org/jira/browse/SOLR-15232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-15232: Fix Version/s: main (9.0) > Add replica(s) as a part of node startup > > > Key: SOLR-15232 > URL: https://issues.apache.org/jira/browse/SOLR-15232 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: main (9.0) > > Time Spent: 10m > Remaining Estimate: 0h > > In containerized environments it would make sense to be able to initialize a > new node (pod) and designate it immediately to hold newly created replica(s) > of specified collection/shard(s) once it's up and running. > Currently this is not easy to do, it requires the intervention of an external > agent that additionally has to first check if the node is up, all of which > makes the process needlessly complicated. > This functionality could be as simple as adding a command-line switch to > {{bin/solr start}}, which would cause it to invoke appropriate ADDREPLICA > commands once it verifies the node is up. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15232) Add replica(s) as a part of node startup
[ https://issues.apache.org/jira/browse/SOLR-15232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316328#comment-17316328 ] Andrzej Bialecki commented on SOLR-15232: - Ad 1. Solr autoscaling is not aware of how the pods are managed. When you add a new pod and want to populate it this always requires additional actions, orchestrated by an outside agent. In Solr 8x you could automate part of it (adding replicas to new empty nodes) by using {{nodeAdded}} triggers but this is gone in 9x. Ad 2. AFAIK there's no specific mechanism in k8s that would allow you to "customize" a particular pod instance on startup, i.e. to specify during pod creation that it should host a specific replica. Furthermore, external agents need to first check that the pod is up before proceeding, which complicates their design. This PR fills this void because you don't need any external agent to orchestrate the creation of replicas on new pods - you can just pass system properties to tell the pod to automatically add replicas that you want to host there, as soon as the Solr CoreContainer is up - and shut it down if it fails to initialize. This provides an easier way to auto-scale by using k8s autoscaler without modifications. Ad 3. This is again a multiple step process that requires an external agent to coordinate it. I'm not saying it can't be done (obviously), but the way I propose it could simplify the process. > Add replica(s) as a part of node startup > > > Key: SOLR-15232 > URL: https://issues.apache.org/jira/browse/SOLR-15232 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: main (9.0) > > Time Spent: 10m > Remaining Estimate: 0h > > In containerized environments it would make sense to be able to initialize a > new node (pod) and designate it immediately to hold newly created replica(s) > of specified collection/shard(s) once it's up and running. > Currently this is not easy to do, it requires the intervention of an external > agent that additionally has to first check if the node is up, all of which > makes the process needlessly complicated. > This functionality could be as simple as adding a command-line switch to > {{bin/solr start}}, which would cause it to invoke appropriate ADDREPLICA > commands once it verifies the node is up. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15300) Shard "state" flag is confusing and of limited value to outside consumers
[ https://issues.apache.org/jira/browse/SOLR-15300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316366#comment-17316366 ] Andrzej Bialecki commented on SOLR-15300: - The {{replicationFactor}} is ill-defined, at least the way it's used. It doesn't reflect anything other than the initial setup - you are free to add / remove replicas and then it no longer holds true. It doesn't reflect per shard replication either. I would go even further - we should remove it from collection state because it's misleading. Another question is "what is the intended replication factor and how to measure it"? This is not obvious either because it may depend on circumstances (eg. adding replicas during search traffic spikes and removing them afterwards). This may be a task for some external agent to figure out. I think it's much easier to focus in this issue on clearly reporting the most common abnormal states - eg. shard has replicas down/recovering, shard has no replicas, shard has no leader. Also, at the Java level you can already get all this information, so I think the scope of this issue is only what to do about the external reporting / monitoring, either via metrics or via ClusterState / Slice. As such, I think that we don't have to explicitly store this state anywhere, we can construct it on the fly for the purpose of reporting. > Shard "state" flag is confusing and of limited value to outside consumers > - > > Key: SOLR-15300 > URL: https://issues.apache.org/jira/browse/SOLR-15300 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > > Solr API (and consequently the metric reporters, which are often used for > Solr monitoring) report the shard as being in ACTIVE state even when in > reality its functionality is severely compromised (eg. no replicas, all > replicas down, or no leader). > This reported state is technically correct because it is used only for > tracking of the SPLITSHARD operations, as defined in {{Slice.State}}. > However, this may be misleading and more often unhelpful than not - for > constant monitoring a flag that actually reports impaired functionality of a > shard would be more useful than a flag that reports a relatively uncommon > SPLITSHARD operation. > We could either redefine the meaning of the existing flag (and change its > state according to some of the criteria I listed above), or add another flag > to represent the "health" status of a shard. The value of this flag would > then provide an easy way to monitor and to alert external systems of > dangerous function impairment, without monitoring the state of all replicas > of a collection. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15300) Shard "state" flag is confusing and of limited value to outside consumers
[ https://issues.apache.org/jira/browse/SOLR-15300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316419#comment-17316419 ] Andrzej Bialecki commented on SOLR-15300: - So maybe it could be as simple as adding in the CLUSTERSTATUS response a "status" property for each shard, calculated on the fly. > Shard "state" flag is confusing and of limited value to outside consumers > - > > Key: SOLR-15300 > URL: https://issues.apache.org/jira/browse/SOLR-15300 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > > Solr API (and consequently the metric reporters, which are often used for > Solr monitoring) report the shard as being in ACTIVE state even when in > reality its functionality is severely compromised (eg. no replicas, all > replicas down, or no leader). > This reported state is technically correct because it is used only for > tracking of the SPLITSHARD operations, as defined in {{Slice.State}}. > However, this may be misleading and more often unhelpful than not - for > constant monitoring a flag that actually reports impaired functionality of a > shard would be more useful than a flag that reports a relatively uncommon > SPLITSHARD operation. > We could either redefine the meaning of the existing flag (and change its > state according to some of the criteria I listed above), or add another flag > to represent the "health" status of a shard. The value of this flag would > then provide an easy way to monitor and to alert external systems of > dangerous function impairment, without monitoring the state of all replicas > of a collection. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15300) Shard "state" flag is confusing and of limited value to outside consumers
[ https://issues.apache.org/jira/browse/SOLR-15300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317022#comment-17317022 ] Andrzej Bialecki commented on SOLR-15300: - bq. Well, the intended replicationFactor for a given shard is the number of replicas currently registered with CLUSTERSTATUS That would make sense, indeed - though this has no relation whatsoever to the actual value of {{replicationFactor}} property. bq. should either be in its own sub-tree next to "collections" or clearly marked as "_live-state" or similar Agreed. I would prefer to put it into each collection's props, perhaps using a less awkward name "liveState" ? after all, we already report here other calculated data that doesn't come from state.json, such as aliases and roles. > Shard "state" flag is confusing and of limited value to outside consumers > - > > Key: SOLR-15300 > URL: https://issues.apache.org/jira/browse/SOLR-15300 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > > Solr API (and consequently the metric reporters, which are often used for > Solr monitoring) report the shard as being in ACTIVE state even when in > reality its functionality is severely compromised (eg. no replicas, all > replicas down, or no leader). > This reported state is technically correct because it is used only for > tracking of the SPLITSHARD operations, as defined in {{Slice.State}}. > However, this may be misleading and more often unhelpful than not - for > constant monitoring a flag that actually reports impaired functionality of a > shard would be more useful than a flag that reports a relatively uncommon > SPLITSHARD operation. > We could either redefine the meaning of the existing flag (and change its > state according to some of the criteria I listed above), or add another flag > to represent the "health" status of a shard. The value of this flag would > then provide an easy way to monitor and to alert external systems of > dangerous function impairment, without monitoring the state of all replicas > of a collection. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15300) Shard "state" flag is confusing and of limited value to outside consumers
[ https://issues.apache.org/jira/browse/SOLR-15300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319504#comment-17319504 ] Andrzej Bialecki commented on SOLR-15300: - Based on the Slack discussions, I propose to add the following information to the output of CLUSTERSTATUS command: * add a calculated (not stored in DocCollection) "health" property at the level of each shard and each collection. * use the following symbolic names for the health state: ** GREEN: all replicas up, leader exists, ** YELLOW: some replicas down, leader exists, ** ORANGE: many replicas down, leader exists, ** RED: most replicas down, or no leader. * use 66% and 33% of active replicas as the thresholds between yellow/orange/red. * the collection-level health status will be reported as the worst status of any shard. The notion of having a flag for a "read only" collection (when there's no leader or only PULL replicas) needs further thought, because there's already a "readOnly" flag that users can explicitly set using MODIFYCOLLECTION (this flag is also used in REINDEXCOLLECTION). > Shard "state" flag is confusing and of limited value to outside consumers > - > > Key: SOLR-15300 > URL: https://issues.apache.org/jira/browse/SOLR-15300 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > > Solr API (and consequently the metric reporters, which are often used for > Solr monitoring) report the shard as being in ACTIVE state even when in > reality its functionality is severely compromised (eg. no replicas, all > replicas down, or no leader). > This reported state is technically correct because it is used only for > tracking of the SPLITSHARD operations, as defined in {{Slice.State}}. > However, this may be misleading and more often unhelpful than not - for > constant monitoring a flag that actually reports impaired functionality of a > shard would be more useful than a flag that reports a relatively uncommon > SPLITSHARD operation. > We could either redefine the meaning of the existing flag (and change its > state according to some of the criteria I listed above), or add another flag > to represent the "health" status of a shard. The value of this flag would > then provide an easy way to monitor and to alert external systems of > dangerous function impairment, without monitoring the state of all replicas > of a collection. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15341) Lucene has removed CodecReader#ramBytesUsed
[ https://issues.apache.org/jira/browse/SOLR-15341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322213#comment-17322213 ] Andrzej Bialecki commented on SOLR-15341: - bq. I'm not 100% sure where this ram info was used in Solr It was purely informative, there should be no hard dependencies on it in Solr. > Lucene has removed CodecReader#ramBytesUsed > --- > > Key: SOLR-15341 > URL: https://issues.apache.org/jira/browse/SOLR-15341 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Jan Høydahl >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Due to LUCENE-9387 Solr no longer compiles. Accountability of CodecReader RAM > usage is removed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15019) Replica placement API needs a way to fetch existing replica metrics
[ https://issues.apache.org/jira/browse/SOLR-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326713#comment-17326713 ] Andrzej Bialecki commented on SOLR-15019: - That's a good point, which I didn't consider. I can revert this part of the change - [~ilan] WDYT? > Replica placement API needs a way to fetch existing replica metrics > --- > > Key: SOLR-15019 > URL: https://issues.apache.org/jira/browse/SOLR-15019 > Project: Solr > Issue Type: Improvement >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: main (9.0) > > Time Spent: 9h 20m > Remaining Estimate: 0h > > Replica placement API was introduced in SOLR-14613. It offers a few sample > (and simple) implementations of placement plugins. > However, this API doesn't offer support for retrieving per-replica metrics, > which are required for calculating more realistic placements. For example, > when calculating placements for ADDREPLICA on an already existing collection > the plugin should know what is the size of replica in order to avoid placing > large replicas on nodes with insufficient free disk space. > After discussing this with [~ilan] we propose the following additions to the > API: > * use the existing {{AttributeFetcher}} interface as a facade for retrieving > per-replica values (currently it only retrieves per-node values) > * add {{ShardValues}} interface to represent strongly-typed API for key > metrics, such as replica size, number of docs, number of update and search > requests. > Plugins could then use this API like this: > {code} > AttributeFetcher attributeFetcher = ... > SolrCollection solrCollection = ... > Set metricNames = ... > attributeFetcher.requestCollectionMetrics(solrCollection, > solrCollection.getShardNames(), metricNames); > AttributeValues attributeValues = attributeFetcher.fetchAttributes(); > ShardValues shardValues = > attributeValues.getShardMetrics(solrCollection.getName(), shardName); > int sizeInGB = shardValues.getSizeInGB(); // retrieves shard leader metrics > int replicaSizeInGB = shardValues.getSizeInGB(replica); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Reopened] (SOLR-15019) Replica placement API needs a way to fetch existing replica metrics
[ https://issues.apache.org/jira/browse/SOLR-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reopened SOLR-15019: - > Replica placement API needs a way to fetch existing replica metrics > --- > > Key: SOLR-15019 > URL: https://issues.apache.org/jira/browse/SOLR-15019 > Project: Solr > Issue Type: Improvement >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: main (9.0) > > Time Spent: 9h 20m > Remaining Estimate: 0h > > Replica placement API was introduced in SOLR-14613. It offers a few sample > (and simple) implementations of placement plugins. > However, this API doesn't offer support for retrieving per-replica metrics, > which are required for calculating more realistic placements. For example, > when calculating placements for ADDREPLICA on an already existing collection > the plugin should know what is the size of replica in order to avoid placing > large replicas on nodes with insufficient free disk space. > After discussing this with [~ilan] we propose the following additions to the > API: > * use the existing {{AttributeFetcher}} interface as a facade for retrieving > per-replica values (currently it only retrieves per-node values) > * add {{ShardValues}} interface to represent strongly-typed API for key > metrics, such as replica size, number of docs, number of update and search > requests. > Plugins could then use this API like this: > {code} > AttributeFetcher attributeFetcher = ... > SolrCollection solrCollection = ... > Set metricNames = ... > attributeFetcher.requestCollectionMetrics(solrCollection, > solrCollection.getShardNames(), metricNames); > AttributeValues attributeValues = attributeFetcher.fetchAttributes(); > ShardValues shardValues = > attributeValues.getShardMetrics(solrCollection.getName(), shardName); > int sizeInGB = shardValues.getSizeInGB(); // retrieves shard leader metrics > int replicaSizeInGB = shardValues.getSizeInGB(replica); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15019) Replica placement API needs a way to fetch existing replica metrics
[ https://issues.apache.org/jira/browse/SOLR-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327241#comment-17327241 ] Andrzej Bialecki commented on SOLR-15019: - Ok, I'll remove this until we actually need it. > Replica placement API needs a way to fetch existing replica metrics > --- > > Key: SOLR-15019 > URL: https://issues.apache.org/jira/browse/SOLR-15019 > Project: Solr > Issue Type: Improvement >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: main (9.0) > > Time Spent: 9h 20m > Remaining Estimate: 0h > > Replica placement API was introduced in SOLR-14613. It offers a few sample > (and simple) implementations of placement plugins. > However, this API doesn't offer support for retrieving per-replica metrics, > which are required for calculating more realistic placements. For example, > when calculating placements for ADDREPLICA on an already existing collection > the plugin should know what is the size of replica in order to avoid placing > large replicas on nodes with insufficient free disk space. > After discussing this with [~ilan] we propose the following additions to the > API: > * use the existing {{AttributeFetcher}} interface as a facade for retrieving > per-replica values (currently it only retrieves per-node values) > * add {{ShardValues}} interface to represent strongly-typed API for key > metrics, such as replica size, number of docs, number of update and search > requests. > Plugins could then use this API like this: > {code} > AttributeFetcher attributeFetcher = ... > SolrCollection solrCollection = ... > Set metricNames = ... > attributeFetcher.requestCollectionMetrics(solrCollection, > solrCollection.getShardNames(), metricNames); > AttributeValues attributeValues = attributeFetcher.fetchAttributes(); > ShardValues shardValues = > attributeValues.getShardMetrics(solrCollection.getName(), shardName); > int sizeInGB = shardValues.getSizeInGB(); // retrieves shard leader metrics > int replicaSizeInGB = shardValues.getSizeInGB(replica); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Resolved] (SOLR-15019) Replica placement API needs a way to fetch existing replica metrics
[ https://issues.apache.org/jira/browse/SOLR-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved SOLR-15019. - Resolution: Fixed > Replica placement API needs a way to fetch existing replica metrics > --- > > Key: SOLR-15019 > URL: https://issues.apache.org/jira/browse/SOLR-15019 > Project: Solr > Issue Type: Improvement >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: main (9.0) > > Time Spent: 9h 20m > Remaining Estimate: 0h > > Replica placement API was introduced in SOLR-14613. It offers a few sample > (and simple) implementations of placement plugins. > However, this API doesn't offer support for retrieving per-replica metrics, > which are required for calculating more realistic placements. For example, > when calculating placements for ADDREPLICA on an already existing collection > the plugin should know what is the size of replica in order to avoid placing > large replicas on nodes with insufficient free disk space. > After discussing this with [~ilan] we propose the following additions to the > API: > * use the existing {{AttributeFetcher}} interface as a facade for retrieving > per-replica values (currently it only retrieves per-node values) > * add {{ShardValues}} interface to represent strongly-typed API for key > metrics, such as replica size, number of docs, number of update and search > requests. > Plugins could then use this API like this: > {code} > AttributeFetcher attributeFetcher = ... > SolrCollection solrCollection = ... > Set metricNames = ... > attributeFetcher.requestCollectionMetrics(solrCollection, > solrCollection.getShardNames(), metricNames); > AttributeValues attributeValues = attributeFetcher.fetchAttributes(); > ShardValues shardValues = > attributeValues.getShardMetrics(solrCollection.getName(), shardName); > int sizeInGB = shardValues.getSizeInGB(); // retrieves shard leader metrics > int replicaSizeInGB = shardValues.getSizeInGB(replica); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-15379) Fix API incompatibility after LUCENE-9905
Andrzej Bialecki created SOLR-15379: --- Summary: Fix API incompatibility after LUCENE-9905 Key: SOLR-15379 URL: https://issues.apache.org/jira/browse/SOLR-15379 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-15379) Fix API incompatibility after LUCENE-9905
[ https://issues.apache.org/jira/browse/SOLR-15379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-15379: Attachment: SOLR-15379.patch > Fix API incompatibility after LUCENE-9905 > - > > Key: SOLR-15379 > URL: https://issues.apache.org/jira/browse/SOLR-15379 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Minor > Attachments: SOLR-15379.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-15395) Report collection / shard "health" status in the Admin UI
Andrzej Bialecki created SOLR-15395: --- Summary: Report collection / shard "health" status in the Admin UI Key: SOLR-15395 URL: https://issues.apache.org/jira/browse/SOLR-15395 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: Admin UI Affects Versions: main (9.0) Reporter: Andrzej Bialecki SOLR-15300 added a "health" status report to the output of CLUSTERSTATUS command. This should be also shown in the UI to allow users to visually check this status. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-15396) Expose collection / shard "health" state in Prometheus exporter
Andrzej Bialecki created SOLR-15396: --- Summary: Expose collection / shard "health" state in Prometheus exporter Key: SOLR-15396 URL: https://issues.apache.org/jira/browse/SOLR-15396 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: metrics Affects Versions: main (9.0) Reporter: Andrzej Bialecki SOLR-15300 added a "health" status for collections and shards. This should be also exposed via Prometheus exporter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15300) Shard "state" flag is confusing and of limited value to outside consumers
[ https://issues.apache.org/jira/browse/SOLR-15300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340137#comment-17340137 ] Andrzej Bialecki commented on SOLR-15300: - [~janhoy] I created SOLR-15395 and SOLR-15396. Cluster level "health" status is somewhat different because it should probably consider not only the state of the collections but also of the nodes. Let's discuss this in a separate Jira. > Shard "state" flag is confusing and of limited value to outside consumers > - > > Key: SOLR-15300 > URL: https://issues.apache.org/jira/browse/SOLR-15300 > Project: Solr > Issue Type: Improvement >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > Solr API (and consequently the metric reporters, which are often used for > Solr monitoring) report the shard as being in ACTIVE state even when in > reality its functionality is severely compromised (eg. no replicas, all > replicas down, or no leader). > This reported state is technically correct because it is used only for > tracking of the SPLITSHARD operations, as defined in {{Slice.State}}. > However, this may be misleading and more often unhelpful than not - for > constant monitoring a flag that actually reports impaired functionality of a > shard would be more useful than a flag that reports a relatively uncommon > SPLITSHARD operation. > We could either redefine the meaning of the existing flag (and change its > state according to some of the criteria I listed above), or add another flag > to represent the "health" status of a shard. The value of this flag would > then provide an easy way to monitor and to alert external systems of > dangerous function impairment, without monitoring the state of all replicas > of a collection. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Resolved] (SOLR-15232) Add replica(s) as a part of node startup
[ https://issues.apache.org/jira/browse/SOLR-15232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved SOLR-15232. - Resolution: Won't Do Closing as Won't Do (didn't know this was an option :) ). As noted in the PR comments this mechanism would be fragile, and there are better ways to do this in Kubernetes. > Add replica(s) as a part of node startup > > > Key: SOLR-15232 > URL: https://issues.apache.org/jira/browse/SOLR-15232 > Project: Solr > Issue Type: Improvement >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: main (9.0) > > Time Spent: 20m > Remaining Estimate: 0h > > In containerized environments it would make sense to be able to initialize a > new node (pod) and designate it immediately to hold newly created replica(s) > of specified collection/shard(s) once it's up and running. > Currently this is not easy to do, it requires the intervention of an external > agent that additionally has to first check if the node is up, all of which > makes the process needlessly complicated. > This functionality could be as simple as adding a command-line switch to > {{bin/solr start}}, which would cause it to invoke appropriate ADDREPLICA > commands once it verifies the node is up. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Resolved] (SOLR-15300) Shard "state" flag is confusing and of limited value to outside consumers
[ https://issues.apache.org/jira/browse/SOLR-15300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved SOLR-15300. - Fix Version/s: 8.9 Resolution: Fixed > Shard "state" flag is confusing and of limited value to outside consumers > - > > Key: SOLR-15300 > URL: https://issues.apache.org/jira/browse/SOLR-15300 > Project: Solr > Issue Type: Improvement >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: 8.9 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Solr API (and consequently the metric reporters, which are often used for > Solr monitoring) report the shard as being in ACTIVE state even when in > reality its functionality is severely compromised (eg. no replicas, all > replicas down, or no leader). > This reported state is technically correct because it is used only for > tracking of the SPLITSHARD operations, as defined in {{Slice.State}}. > However, this may be misleading and more often unhelpful than not - for > constant monitoring a flag that actually reports impaired functionality of a > shard would be more useful than a flag that reports a relatively uncommon > SPLITSHARD operation. > We could either redefine the meaning of the existing flag (and change its > state according to some of the criteria I listed above), or add another flag > to represent the "health" status of a shard. The value of this flag would > then provide an easy way to monitor and to alert external systems of > dangerous function impairment, without monitoring the state of all replicas > of a collection. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-14245) Validate Replica / ReplicaInfo on creation
[ https://issues.apache.org/jira/browse/SOLR-14245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346758#comment-17346758 ] Andrzej Bialecki commented on SOLR-14245: - I strongly disagree - let's not revert, instead fix the bug that caused the invalid state! {{Replica}} is a critical piece of information, if it's invalid then something seriously wrong already happened. That's the whole point of validation, to quickly catch errors that can cause long-term subtle corruption. > Validate Replica / ReplicaInfo on creation > -- > > Key: SOLR-14245 > URL: https://issues.apache.org/jira/browse/SOLR-14245 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Minor > Fix For: 8.5 > > > Replica / ReplicaInfo should be immutable and their fields should be > validated on creation. > Some users reported that very rarely during a failed collection CREATE or > DELETE, or when the Overseer task queue becomes corrupted, Solr may write to > ZK incomplete replica infos (eg. node_name = null). > This problem is difficult to reproduce but we should add safeguards anyway to > prevent writing such corrupted replica info to ZK. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-14245) Validate Replica / ReplicaInfo on creation
[ https://issues.apache.org/jira/browse/SOLR-14245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346758#comment-17346758 ] Andrzej Bialecki edited comment on SOLR-14245 at 5/18/21, 9:58 AM: --- I strongly disagree - let's not revert, instead fix the bug that caused the invalid state! {{Replica}} is a critical piece of information, if it's invalid then something seriously wrong already happened. That's the whole point of validation, to quickly catch errors that can cause long-term subtle corruption. If the validation logic is somehow faulty and there's an edge-case that it should accept, then we can fix it - but I'm against removing it. was (Author: ab): I strongly disagree - let's not revert, instead fix the bug that caused the invalid state! {{Replica}} is a critical piece of information, if it's invalid then something seriously wrong already happened. That's the whole point of validation, to quickly catch errors that can cause long-term subtle corruption. > Validate Replica / ReplicaInfo on creation > -- > > Key: SOLR-14245 > URL: https://issues.apache.org/jira/browse/SOLR-14245 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Minor > Fix For: 8.5 > > > Replica / ReplicaInfo should be immutable and their fields should be > validated on creation. > Some users reported that very rarely during a failed collection CREATE or > DELETE, or when the Overseer task queue becomes corrupted, Solr may write to > ZK incomplete replica infos (eg. node_name = null). > This problem is difficult to reproduce but we should add safeguards anyway to > prevent writing such corrupted replica info to ZK. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-14245) Validate Replica / ReplicaInfo on creation
[ https://issues.apache.org/jira/browse/SOLR-14245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346826#comment-17346826 ] Andrzej Bialecki commented on SOLR-14245: - bq. how can deal with our attitude of not caring about users who may be affected by bugs or changes we introduce [~ichattopadhyaya] and [~noble.paul]: I'm totally fed up with your ad hominem attacks. Please discuss this in a civil manner. If I were so inclined I could also point fingers to many areas of code where you both rammed through totally buggy and sloppy code, and start implying you're ignorant and careless. I could also find many significant changes you guys did without PRs or with a PR opened and committed within a couple hours, without any review. But I hope that ultimately we all have good intentions and we should fight the problem and not each other. On second thought, I agree with Noble that the validation should be more lenient (incidentally, the bug that causes {{node_name: null}} is likely related to a buggy roundtrip conversion between ReplicaInfo <-> Replica... and guess who added ReplicaInfo?). I still object to removing the validation completely, but we can make it non-fatal - as I said above, admins should be aware when Solr is using corrupted data, because we really can't be sure what other long-term consequences it may cause. > Validate Replica / ReplicaInfo on creation > -- > > Key: SOLR-14245 > URL: https://issues.apache.org/jira/browse/SOLR-14245 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Minor > Fix For: 8.5 > > > Replica / ReplicaInfo should be immutable and their fields should be > validated on creation. > Some users reported that very rarely during a failed collection CREATE or > DELETE, or when the Overseer task queue becomes corrupted, Solr may write to > ZK incomplete replica infos (eg. node_name = null). > This problem is difficult to reproduce but we should add safeguards anyway to > prevent writing such corrupted replica info to ZK. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-14245) Validate Replica / ReplicaInfo on creation
[ https://issues.apache.org/jira/browse/SOLR-14245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346834#comment-17346834 ] Andrzej Bialecki commented on SOLR-14245: - Jira is for tracking technical issues, and not discussing our attitudes, whether real or implied. "We are callous" is a blank statement to which I don't subscribe. This issue is more than a year old - I think that instead of reopening it a separate Jira should be created to discuss the proper fix. I'm unwilling to simply revert it because (as I explained above) the purpose of this change is still valid - it's important to be aware that a piece of a critical Solr state is corrupted. Let's open a new Jira and discuss the fix. > Validate Replica / ReplicaInfo on creation > -- > > Key: SOLR-14245 > URL: https://issues.apache.org/jira/browse/SOLR-14245 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Minor > Fix For: 8.5 > > > Replica / ReplicaInfo should be immutable and their fields should be > validated on creation. > Some users reported that very rarely during a failed collection CREATE or > DELETE, or when the Overseer task queue becomes corrupted, Solr may write to > ZK incomplete replica infos (eg. node_name = null). > This problem is difficult to reproduce but we should add safeguards anyway to > prevent writing such corrupted replica info to ZK. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Resolved] (SOLR-14245) Validate Replica / ReplicaInfo on creation
[ https://issues.apache.org/jira/browse/SOLR-14245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved SOLR-14245. - I'm closing this issue. Please create a separate Jira to discuss the bug and the fix - this issue is already 3 releases, and 15 months old. > Validate Replica / ReplicaInfo on creation > -- > > Key: SOLR-14245 > URL: https://issues.apache.org/jira/browse/SOLR-14245 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Minor > Fix For: 8.5 > > > Replica / ReplicaInfo should be immutable and their fields should be > validated on creation. > Some users reported that very rarely during a failed collection CREATE or > DELETE, or when the Overseer task queue becomes corrupted, Solr may write to > ZK incomplete replica infos (eg. node_name = null). > This problem is difficult to reproduce but we should add safeguards anyway to > prevent writing such corrupted replica info to ZK. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15348) revisit MetricsHistoryHandler's "could not obtain overseer" WARNings
[ https://issues.apache.org/jira/browse/SOLR-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347091#comment-17347091 ] Andrzej Bialecki commented on SOLR-15348: - With the removal of autoscaling the usefulness of this handler is questionable - perhaps we should simply remove it (and the whole metrics history collection in Solr). I'll create a separate Jira for this. > revisit MetricsHistoryHandler's "could not obtain overseer" WARNings > > > Key: SOLR-15348 > URL: https://issues.apache.org/jira/browse/SOLR-15348 > Project: Solr > Issue Type: Task >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/handler/admin/MetricsHistoryHandler.java#L339 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-15416) Consider removing metrics history collection (and MetricsHistoryHandler)
Andrzej Bialecki created SOLR-15416: --- Summary: Consider removing metrics history collection (and MetricsHistoryHandler) Key: SOLR-15416 URL: https://issues.apache.org/jira/browse/SOLR-15416 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: metrics Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Originally this functionality was meant to one day support more intelligent decisions in the autoscaling triggers that would react to the dynamics of the metrics changes. For this reason it was useful to keep track of the changes in the key metrics over time, without depending on any external systems. With the removal of autoscaling the usefulness of this handler (and the collection of metrics history inside Solr) is questionable. I propose to remove it in 9.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-15348) revisit MetricsHistoryHandler's "could not obtain overseer" WARNings
[ https://issues.apache.org/jira/browse/SOLR-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347091#comment-17347091 ] Andrzej Bialecki edited comment on SOLR-15348 at 5/18/21, 5:48 PM: --- With the removal of autoscaling the usefulness of this handler is questionable - perhaps we should simply remove it (and the whole metrics history collection in Solr). I'll create a separate Jira for this. Edit: SOLR-15416 was (Author: ab): With the removal of autoscaling the usefulness of this handler is questionable - perhaps we should simply remove it (and the whole metrics history collection in Solr). I'll create a separate Jira for this. > revisit MetricsHistoryHandler's "could not obtain overseer" WARNings > > > Key: SOLR-15348 > URL: https://issues.apache.org/jira/browse/SOLR-15348 > Project: Solr > Issue Type: Task >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/handler/admin/MetricsHistoryHandler.java#L339 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-15416) Consider removing metrics history collection (and MetricsHistoryHandler)
[ https://issues.apache.org/jira/browse/SOLR-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-15416: Fix Version/s: main (9.0) > Consider removing metrics history collection (and MetricsHistoryHandler) > > > Key: SOLR-15416 > URL: https://issues.apache.org/jira/browse/SOLR-15416 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: metrics >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: main (9.0) > > > Originally this functionality was meant to one day support more intelligent > decisions in the autoscaling triggers that would react to the dynamics of the > metrics changes. For this reason it was useful to keep track of the changes > in the key metrics over time, without depending on any external systems. > With the removal of autoscaling the usefulness of this handler (and the > collection of metrics history inside Solr) is questionable. > I propose to remove it in 9.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-15416) Consider removing metrics history collection (and MetricsHistoryHandler)
[ https://issues.apache.org/jira/browse/SOLR-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-15416: Attachment: SOLR-15416.patch > Consider removing metrics history collection (and MetricsHistoryHandler) > > > Key: SOLR-15416 > URL: https://issues.apache.org/jira/browse/SOLR-15416 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: metrics >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: main (9.0) > > Attachments: SOLR-15416.patch > > > Originally this functionality was meant to one day support more intelligent > decisions in the autoscaling triggers that would react to the dynamics of the > metrics changes. For this reason it was useful to keep track of the changes > in the key metrics over time, without depending on any external systems. > With the removal of autoscaling the usefulness of this handler (and the > collection of metrics history inside Solr) is questionable. > I propose to remove it in 9.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15416) Consider removing metrics history collection (and MetricsHistoryHandler)
[ https://issues.apache.org/jira/browse/SOLR-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348578#comment-17348578 ] Andrzej Bialecki commented on SOLR-15416: - This patch removes MetricsHistoryHandler, MetricsCollectorHandler, SolrCluster / SolrShardReporter and support for Solr-backed RRD database (with rrd4j dependency). If there are no objections I'll commit this shortly. > Consider removing metrics history collection (and MetricsHistoryHandler) > > > Key: SOLR-15416 > URL: https://issues.apache.org/jira/browse/SOLR-15416 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: metrics >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: main (9.0) > > Attachments: SOLR-15416.patch > > > Originally this functionality was meant to one day support more intelligent > decisions in the autoscaling triggers that would react to the dynamics of the > metrics changes. For this reason it was useful to keep track of the changes > in the key metrics over time, without depending on any external systems. > With the removal of autoscaling the usefulness of this handler (and the > collection of metrics history inside Solr) is questionable. > I propose to remove it in 9.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-15416) Remove metrics history collection (and MetricsHistoryHandler)
[ https://issues.apache.org/jira/browse/SOLR-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-15416: Summary: Remove metrics history collection (and MetricsHistoryHandler) (was: Consider removing metrics history collection (and MetricsHistoryHandler)) > Remove metrics history collection (and MetricsHistoryHandler) > - > > Key: SOLR-15416 > URL: https://issues.apache.org/jira/browse/SOLR-15416 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: metrics >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: main (9.0) > > Attachments: SOLR-15416.patch > > > Originally this functionality was meant to one day support more intelligent > decisions in the autoscaling triggers that would react to the dynamics of the > metrics changes. For this reason it was useful to keep track of the changes > in the key metrics over time, without depending on any external systems. > With the removal of autoscaling the usefulness of this handler (and the > collection of metrics history inside Solr) is questionable. > I propose to remove it in 9.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-15425) Upgrade to Metrics 4.2.0
Andrzej Bialecki created SOLR-15425: --- Summary: Upgrade to Metrics 4.2.0 Key: SOLR-15425 URL: https://issues.apache.org/jira/browse/SOLR-15425 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: metrics Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki In addition to many fixes and compatibility with new Java versions this release adds a {{LockFreeExponentiallyDecayingReservoir}} which substantially reduces the cost of collecting histograms, especially for multi-threaded updates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-15425) Upgrade to Metrics 4.2.0
[ https://issues.apache.org/jira/browse/SOLR-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-15425: Description: In addition to many fixes and compatibility with new Java versions this release adds a {{LockFreeExponentiallyDecayingReservoir}} which substantially reduces the cost of collecting histograms, especially for multi-threaded updates. It provides also a no-op implementation of MetricRegistry, which would further reduce the already small overheads when metrics collection is turned off. (was: In addition to many fixes and compatibility with new Java versions this release adds a {{LockFreeExponentiallyDecayingReservoir}} which substantially reduces the cost of collecting histograms, especially for multi-threaded updates.) > Upgrade to Metrics 4.2.0 > > > Key: SOLR-15425 > URL: https://issues.apache.org/jira/browse/SOLR-15425 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: metrics >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > > In addition to many fixes and compatibility with new Java versions this > release adds a {{LockFreeExponentiallyDecayingReservoir}} which substantially > reduces the cost of collecting histograms, especially for multi-threaded > updates. It provides also a no-op implementation of MetricRegistry, which > would further reduce the already small overheads when metrics collection is > turned off. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Resolved] (SOLR-15379) Fix API incompatibility after LUCENE-9905
[ https://issues.apache.org/jira/browse/SOLR-15379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved SOLR-15379. - Fix Version/s: main (9.0) Resolution: Fixed > Fix API incompatibility after LUCENE-9905 > - > > Key: SOLR-15379 > URL: https://issues.apache.org/jira/browse/SOLR-15379 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Minor > Fix For: main (9.0) > > Attachments: SOLR-15379.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Resolved] (SOLR-14749) Provide a clean API for cluster-level event processing
[ https://issues.apache.org/jira/browse/SOLR-14749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved SOLR-14749. - Resolution: Fixed > Provide a clean API for cluster-level event processing > -- > > Key: SOLR-14749 > URL: https://issues.apache.org/jira/browse/SOLR-14749 > Project: Solr > Issue Type: Improvement > Components: AutoScaling >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Labels: clean-api > Fix For: main (9.0) > > Time Spent: 22h > Remaining Estimate: 0h > > This is a companion issue to SOLR-14613 and it aims at providing a clean, > strongly typed API for the functionality formerly known as "triggers" - that > is, a component for generating cluster-level events corresponding to changes > in the cluster state, and a pluggable API for processing these events. > The 8x triggers have been removed so this functionality is currently missing > in 9.0. However, this functionality is crucial for implementing the automatic > collection repair and re-balancing as the cluster state changes (nodes going > down / up, becoming overloaded / unused / decommissioned, etc). > For this reason we need this API and a default implementation of triggers > that at least can perform automatic collection repair (maintaining the > desired replication factor in presence of live node changes). > As before, the actual changes to the collections will be executed using > existing CollectionAdmin API, which in turn may use the placement plugins > from SOLR-14613. > h3. Division of responsibility > * built-in Solr components (non-pluggable): > ** cluster state monitoring and event generation, > ** simple scheduler to periodically generate scheduled events > * plugins: > ** automatic collection repair on {{nodeLost}} events (provided by default) > ** re-balancing of replicas (periodic or on {{nodeAdded}} events) > ** reporting (eg. requesting additional node provisioning) > ** scheduled maintenance (eg. removing inactive shards after split) > h3. Other considerations > These plugins (unlike the placement plugins) need to execute on one > designated node in the cluster. Currently the easiest way to implement this > is to run them on the Overseer leader node. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15416) Remove metrics history collection (and MetricsHistoryHandler)
[ https://issues.apache.org/jira/browse/SOLR-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348805#comment-17348805 ] Andrzej Bialecki commented on SOLR-15416: - Good point - yes, we need to deprecate it. I'll prepare a patch for this too. > Remove metrics history collection (and MetricsHistoryHandler) > - > > Key: SOLR-15416 > URL: https://issues.apache.org/jira/browse/SOLR-15416 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: metrics >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: main (9.0) > > Attachments: SOLR-15416.patch > > > Originally this functionality was meant to one day support more intelligent > decisions in the autoscaling triggers that would react to the dynamics of the > metrics changes. For this reason it was useful to keep track of the changes > in the key metrics over time, without depending on any external systems. > With the removal of autoscaling the usefulness of this handler (and the > collection of metrics history inside Solr) is questionable. > I propose to remove it in 9.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15428) Integrate the OpenJDK JMH micro benchmark framework for micro benchmarks and performance comparisons and investigation.
[ https://issues.apache.org/jira/browse/SOLR-15428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349635#comment-17349635 ] Andrzej Bialecki commented on SOLR-15428: - +1! Excellent idea. > Integrate the OpenJDK JMH micro benchmark framework for micro benchmarks and > performance comparisons and investigation. > --- > > Key: SOLR-15428 > URL: https://issues.apache.org/jira/browse/SOLR-15428 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Mark Robert Miller >Priority: Major > > I’ve spent a fair amount of time over the years on work around integrating > Lucene’s benchmark framework into Solr and while I’ve used this with > additional local work off and on, JMH has become somewhat of a standard for > micro benchmarks on the JVM. I have some work that provides an initial > integration, allowing for more targeted micro benchmarks as well as more > integration type benchmarking using JettySolrRunner. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-15416) Remove metrics history collection (and MetricsHistoryHandler)
[ https://issues.apache.org/jira/browse/SOLR-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-15416: Attachment: SOLR-15416-8x.patch > Remove metrics history collection (and MetricsHistoryHandler) > - > > Key: SOLR-15416 > URL: https://issues.apache.org/jira/browse/SOLR-15416 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: metrics >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: main (9.0) > > Attachments: SOLR-15416-8x.patch, SOLR-15416.patch > > > Originally this functionality was meant to one day support more intelligent > decisions in the autoscaling triggers that would react to the dynamics of the > metrics changes. For this reason it was useful to keep track of the changes > in the key metrics over time, without depending on any external systems. > With the removal of autoscaling the usefulness of this handler (and the > collection of metrics history inside Solr) is questionable. > I propose to remove it in 9.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15416) Remove metrics history collection (and MetricsHistoryHandler)
[ https://issues.apache.org/jira/browse/SOLR-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350424#comment-17350424 ] Andrzej Bialecki commented on SOLR-15416: - The other patch contains deprecations for 8x. I'll commit this shortly. > Remove metrics history collection (and MetricsHistoryHandler) > - > > Key: SOLR-15416 > URL: https://issues.apache.org/jira/browse/SOLR-15416 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: metrics >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: main (9.0) > > Attachments: SOLR-15416-8x.patch, SOLR-15416.patch > > > Originally this functionality was meant to one day support more intelligent > decisions in the autoscaling triggers that would react to the dynamics of the > metrics changes. For this reason it was useful to keep track of the changes > in the key metrics over time, without depending on any external systems. > With the removal of autoscaling the usefulness of this handler (and the > collection of metrics history inside Solr) is questionable. > I propose to remove it in 9.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Resolved] (SOLR-15416) Remove metrics history collection (and MetricsHistoryHandler)
[ https://issues.apache.org/jira/browse/SOLR-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved SOLR-15416. - Resolution: Fixed > Remove metrics history collection (and MetricsHistoryHandler) > - > > Key: SOLR-15416 > URL: https://issues.apache.org/jira/browse/SOLR-15416 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: metrics >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: main (9.0) > > Attachments: SOLR-15416-8x.patch, SOLR-15416.patch > > > Originally this functionality was meant to one day support more intelligent > decisions in the autoscaling triggers that would react to the dynamics of the > metrics changes. For this reason it was useful to keep track of the changes > in the key metrics over time, without depending on any external systems. > With the removal of autoscaling the usefulness of this handler (and the > collection of metrics history inside Solr) is questionable. > I propose to remove it in 9.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-11882) SolrMetric registries retain references to SolrCores when closed
[ https://issues.apache.org/jira/browse/SOLR-11882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351771#comment-17351771 ] Andrzej Bialecki commented on SOLR-11882: - [~dsmiley] I think we can do even better - move all the logic related to metrics to the CoreContainer.load(). After all, if we fail to init CC the node and its metrics are unusable anyway. And when we close the CoreContainer the metrics are not available either, so we can equally well do the cleanup in CC.shutdown(). > SolrMetric registries retain references to SolrCores when closed > > > Key: SOLR-11882 > URL: https://issues.apache.org/jira/browse/SOLR-11882 > Project: Solr > Issue Type: Bug > Components: metrics, Server >Affects Versions: 7.1 >Reporter: Eros Taborelli >Assignee: Andrzej Bialecki >Priority: Major > Fix For: 7.4, 8.0 > > Attachments: SOLR-11882-7x.patch, SOLR-11882.patch, SOLR-11882.patch, > SOLR-11882.patch, SOLR-11882.patch, SOLR-11882.patch, SOLR-11882.patch, > create-cores.zip, solr-dump-full_Leak_Suspects.zip, solr.config.zip > > > *Description:* > Our setup involves using a lot of small cores (possibly hundred thousand), > but working only on a few of them at any given time. > We already followed all recommendations in this guide: > [https://wiki.apache.org/solr/LotsOfCores] > We noticed that after creating/loading around 1000-2000 empty cores, with no > documents inside, the heap consumption went through the roof despite having > set transientCacheSize to only 64 (heap size set to 12G). > All cores are correctly set to loadOnStartup=false and transient=true, and we > have verified via logs that the cores in excess are actually being closed. > However, a reference remains in the > org.apache.solr.metrics.SolrMetricManager#registries that is never removed > until a core if fully unloaded. > Restarting the JVM loads all cores in the admin UI, but doesn't populate the > ConcurrentHashMap until a core is actually fully loaded. > I reproduced the issue on a smaller scale (transientCacheSize = 5, heap size > = 512m) and made a report (attached) using eclipse MAT. > *Desired outcome:* > When a transient core is closed, the references in the SolrMetricManager > should be removed, in the same fashion the reporters for the core are also > closed and removed. > In alternative, a unloadOnClose=true|false flag could be implemented to fully > unload a transient core when closed due to the cache size. > *Note:* > The documentation mentions everywhere that the unused cores will be unloaded, > but it's misleading as the cores are never fully unloaded. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Created] (SOLR-15858) ConfigSetsHandler requires DIR entries in the uploaded ZIPs
Andrzej Bialecki created SOLR-15858: --- Summary: ConfigSetsHandler requires DIR entries in the uploaded ZIPs Key: SOLR-15858 URL: https://issues.apache.org/jira/browse/SOLR-15858 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: configset-api Affects Versions: 8.11.1 Reporter: Andrzej Bialecki If you try uploading a configset zip that contains resources in sub-folders - but doesn't contain explicit DIR entries in the zip file - the upload will fail with `NoNodeException`. This is caused by `ConfigSetsHandler.createZkNodeIfNotExistsAndSetData` which assumes the entry path doesn't contain sub-path elements. If the corresponding DIR entries are present (and they occur earlier in the zip than their child resource entries!) the handler will work properly because it recognizes DIR entries and creates ZK paths as needed. The fix would be to always check for the presence of `/` characters in the entry name and make sure the ZK path already exists. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-15858) ConfigSetsHandler requires DIR entries in the uploaded ZIPs
[ https://issues.apache.org/jira/browse/SOLR-15858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-15858: Description: If you try uploading a configset zip that contains resources in sub-folders - but doesn't contain explicit DIR entries in the zip file - the upload will fail with {{{}NoNodeException{}}}. This is caused by {{ConfigSetsHandler.createZkNodeIfNotExistsAndSetData}} which assumes the entry path doesn't contain sub-path elements. If the corresponding DIR entries are present (and they occur earlier in the zip than their child resource entries!) the handler will work properly because it recognizes DIR entries and creates ZK paths as needed. The fix would be to always check for the presence of `/` characters in the entry name and make sure the ZK path already exists. was: If you try uploading a configset zip that contains resources in sub-folders - but doesn't contain explicit DIR entries in the zip file - the upload will fail with `NoNodeException`. This is caused by `ConfigSetsHandler.createZkNodeIfNotExistsAndSetData` which assumes the entry path doesn't contain sub-path elements. If the corresponding DIR entries are present (and they occur earlier in the zip than their child resource entries!) the handler will work properly because it recognizes DIR entries and creates ZK paths as needed. The fix would be to always check for the presence of `/` characters in the entry name and make sure the ZK path already exists. > ConfigSetsHandler requires DIR entries in the uploaded ZIPs > --- > > Key: SOLR-15858 > URL: https://issues.apache.org/jira/browse/SOLR-15858 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: configset-api >Affects Versions: 8.11.1 >Reporter: Andrzej Bialecki >Priority: Major > > If you try uploading a configset zip that contains resources in sub-folders - > but doesn't contain explicit DIR entries in the zip file - the upload will > fail with {{{}NoNodeException{}}}. > This is caused by {{ConfigSetsHandler.createZkNodeIfNotExistsAndSetData}} > which assumes the entry path doesn't contain sub-path elements. If the > corresponding DIR entries are present (and they occur earlier in the zip than > their child resource entries!) the handler will work properly because it > recognizes DIR entries and creates ZK paths as needed. > The fix would be to always check for the presence of `/` characters in the > entry name and make sure the ZK path already exists. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16013) Overseer gives up election node before closing - inflight commands can be processed twice
[ https://issues.apache.org/jira/browse/SOLR-16013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493279#comment-17493279 ] Andrzej Bialecki commented on SOLR-16013: - Additionally, `OverseerElectionContext.close()` has this implementation: {code:java} @Override public synchronized void close() { this.isClosed = true; overseer.close(); } {code} So it marks itself as closed before the Overseer is closed, and I agree that it seems to me it should do it the other way around, and then simply check in `runLeaderProcess:76` if the Overseer is not closed. > Overseer gives up election node before closing - inflight commands can be > processed twice > - > > Key: SOLR-16013 > URL: https://issues.apache.org/jira/browse/SOLR-16013 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Chris M. Hostetter >Priority: Major > > {{ZkController}} shutdown currently has these two lines (in this order)... > {code:java} > customThreadPool.submit(() -> > IOUtils.closeQuietly(overseerElector.getContext())); > customThreadPool.submit(() -> IOUtils.closeQuietly(overseer)); > {code} > AFAICT this means that means that the overseer nodeX will give up it's > election node (via overseerElector) allowing some other nodeY to be elected a > new overseer, **BEFORE** Overseer nodeX shuts down it's {{Overseer}} object, > which waits for the {{OverseerThread}} to finish processing any tasks in > process. > In practice, this seems to make it possible for a single command in the > overseer queue to get processed twice. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Comment Edited] (SOLR-16013) Overseer gives up election node before closing - inflight commands can be processed twice
[ https://issues.apache.org/jira/browse/SOLR-16013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493279#comment-17493279 ] Andrzej Bialecki edited comment on SOLR-16013 at 2/16/22, 3:30 PM: --- Additionally, `OverseerElectionContext.close()` has this implementation: {code:java} @Override public synchronized void close() { this.isClosed = true; overseer.close(); } {code} So it marks itself as closed before the Overseer is closed, and I agree that it seems to me it should do it the other way around, and then simply check in `runLeaderProcess:76` if the Overseer is not closed. Edit: I think the idea in `OverseerElectionContext` was to primarily avoid re-electing this Overseer and then wait until all its tasks are completed. But this allows other overseer to be elected and keep processing the in-flight tasks as new. was (Author: ab): Additionally, `OverseerElectionContext.close()` has this implementation: {code:java} @Override public synchronized void close() { this.isClosed = true; overseer.close(); } {code} So it marks itself as closed before the Overseer is closed, and I agree that it seems to me it should do it the other way around, and then simply check in `runLeaderProcess:76` if the Overseer is not closed. > Overseer gives up election node before closing - inflight commands can be > processed twice > - > > Key: SOLR-16013 > URL: https://issues.apache.org/jira/browse/SOLR-16013 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Chris M. Hostetter >Priority: Major > > {{ZkController}} shutdown currently has these two lines (in this order)... > {code:java} > customThreadPool.submit(() -> > IOUtils.closeQuietly(overseerElector.getContext())); > customThreadPool.submit(() -> IOUtils.closeQuietly(overseer)); > {code} > AFAICT this means that means that the overseer nodeX will give up it's > election node (via overseerElector) allowing some other nodeY to be elected a > new overseer, **BEFORE** Overseer nodeX shuts down it's {{Overseer}} object, > which waits for the {{OverseerThread}} to finish processing any tasks in > process. > In practice, this seems to make it possible for a single command in the > overseer queue to get processed twice. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-16073) totalTime metric should be milliseconds (not nano)
[ https://issues.apache.org/jira/browse/SOLR-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503586#comment-17503586 ] Andrzej Bialecki commented on SOLR-16073: - Removing the conversion may have been a mistake, we should consistently report time intervals using the same units - currently we report the intervals inside histograms in milliseconds, and the elapsed times of Timers we report in nanoseconds. Changing the units may have some back-compat consequences, not sure how to address them. Also, I can't say whether this metric is useful to be included by default in the exporter - generally speaking, since exporting the metrics via Prometheus exporter is a relatively heavyweight process IMHO we should attempt to cut down the number of exported metrics to a bare minimum (whatever that means ;) ). > totalTime metric should be milliseconds (not nano) > -- > > Key: SOLR-16073 > URL: https://issues.apache.org/jira/browse/SOLR-16073 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: metrics >Reporter: David Smiley >Priority: Minor > > I observed that the "totalTime" metric has been a nanosecond number in recent > years, yet once upon a time it was milliseconds. This change was very likely > inadvertent. Our prometheus solr-exporter-config.xml shows that it thinks > it's milliseconds. It's not; RequestHandlerBase increments this counter by > "elapsed", the response of timer.stop() -- nanoseconds. Years ago it had > invoked {{MetricUtils.nsToMs}} but it appears [~ab] removed this as a part of > other changes in 2017 sometime -- > https://github.com/apache/solr/commit/d8df9f8c9963c2fc1718fd471316bf5d964125ba > Also, I question the value/purpose of this metric. Is it so useful that it > deserves to be among our relatively few metrics exported in our default > prometheus exporter config? It's been there since the initial config but I > wonder why anyone wants it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-15502) MetricsCollectorHandler deprecated warning (missing documentation)
[ https://issues.apache.org/jira/browse/SOLR-15502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17374793#comment-17374793 ] Andrzej Bialecki commented on SOLR-15502: - Yes, we can remove the annotation (which will avoid the warning in the logs), it should be enough to keep the javadoc @deprecated tag (and the RefGuide notice). I'll fix this in 8x. [~bwahlen] as Cassandra said, it's safe to ignore this warning - the warning will be removed in the next 8.x release, and the component is gone in 9.x. > MetricsCollectorHandler deprecated warning (missing documentation) > -- > > Key: SOLR-15502 > URL: https://issues.apache.org/jira/browse/SOLR-15502 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: documentation >Affects Versions: 8.9 >Reporter: Bernd Wahlen >Priority: Minor > > after upgrading from 8.8.2 to 8.9.0 i got the following warning: > MetricsCollectorHandler > Solr loaded a deprecated plugin/analysis class > [org.apache.solr.handler.admin.MetricsCollectorHandler]. Please consult > documentation how to replace it accordingly. > i found the corresponding change: > https://solr.apache.org/docs/8_9_0/changes/Changes.html#v8.9.0.other_changes > SOLR-15416 > but not how to solve it (documenaton mentioned above in the warning is > missing). > I also think link to the documentation in the release notes has changed/is > broken: > https://github.com/apache/lucene-solr/blob/master/solr/solr-ref-guide/src/solr-upgrade-notes.adoc > => > https://gitbox.apache.org/repos/asf?p=solr.git;a=blob;f=solr/solr-ref-guide/src/solr-upgrade-notes.adoc > but cannot find how to solve that warning here also. > I grep my configs but can't find anything related. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Resolved] (SOLR-15502) MetricsCollectorHandler deprecated warning (missing documentation)
[ https://issues.apache.org/jira/browse/SOLR-15502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved SOLR-15502. - Assignee: Andrzej Bialecki Resolution: Fixed I removed this annotation in branch_8x. Thanks Bernd for reporting this! > MetricsCollectorHandler deprecated warning (missing documentation) > -- > > Key: SOLR-15502 > URL: https://issues.apache.org/jira/browse/SOLR-15502 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: documentation >Affects Versions: 8.9 >Reporter: Bernd Wahlen >Assignee: Andrzej Bialecki >Priority: Minor > Fix For: 8.10 > > > after upgrading from 8.8.2 to 8.9.0 i got the following warning: > MetricsCollectorHandler > Solr loaded a deprecated plugin/analysis class > [org.apache.solr.handler.admin.MetricsCollectorHandler]. Please consult > documentation how to replace it accordingly. > i found the corresponding change: > https://solr.apache.org/docs/8_9_0/changes/Changes.html#v8.9.0.other_changes > SOLR-15416 > but not how to solve it (documenaton mentioned above in the warning is > missing). > I also think link to the documentation in the release notes has changed/is > broken: > https://github.com/apache/lucene-solr/blob/master/solr/solr-ref-guide/src/solr-upgrade-notes.adoc > => > https://gitbox.apache.org/repos/asf?p=solr.git;a=blob;f=solr/solr-ref-guide/src/solr-upgrade-notes.adoc > but cannot find how to solve that warning here also. > I grep my configs but can't find anything related. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Updated] (SOLR-15502) MetricsCollectorHandler deprecated warning (missing documentation)
[ https://issues.apache.org/jira/browse/SOLR-15502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-15502: Fix Version/s: 8.10 > MetricsCollectorHandler deprecated warning (missing documentation) > -- > > Key: SOLR-15502 > URL: https://issues.apache.org/jira/browse/SOLR-15502 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: documentation >Affects Versions: 8.9 >Reporter: Bernd Wahlen >Priority: Minor > Fix For: 8.10 > > > after upgrading from 8.8.2 to 8.9.0 i got the following warning: > MetricsCollectorHandler > Solr loaded a deprecated plugin/analysis class > [org.apache.solr.handler.admin.MetricsCollectorHandler]. Please consult > documentation how to replace it accordingly. > i found the corresponding change: > https://solr.apache.org/docs/8_9_0/changes/Changes.html#v8.9.0.other_changes > SOLR-15416 > but not how to solve it (documenaton mentioned above in the warning is > missing). > I also think link to the documentation in the release notes has changed/is > broken: > https://github.com/apache/lucene-solr/blob/master/solr/solr-ref-guide/src/solr-upgrade-notes.adoc > => > https://gitbox.apache.org/repos/asf?p=solr.git;a=blob;f=solr/solr-ref-guide/src/solr-upgrade-notes.adoc > but cannot find how to solve that warning here also. > I grep my configs but can't find anything related. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org