Rafał Harabień created SOLR-17275: ------------------------------------- Summary: Major performance regression of CloudSolrClient in Solr 9.6.0 when using aliases Key: SOLR-17275 URL: https://issues.apache.org/jira/browse/SOLR-17275 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: SolrJ Affects Versions: 9.6.0 Environment: SolrJ 9.6.0, Ubuntu 22.04, Java 17 Reporter: Rafał Harabień Attachments: image-2024-05-06-17-23-42-236.png
I observe worse performance of CloudSolrClient after upgrading from SolrJ 9.5.0 to 9.6.0, especially on p99. p99 jumped from ~25 ms to ~400 ms p90 jumped from ~9.9 ms to ~22 ms p75 jumped from ~7 ms to ~11 ms p50 jumped from ~4.5 ms to ~7.5 ms Screenshot from Grafana (at ~14:30 was deployed the new version): !image-2024-05-06-17-23-42-236.png! I've got a thread-dump and I can see many threads waiting in [ZkStateReader.forceUpdateCollection|https://github.com/apache/solr/blob/f8e5a93c11267e13b7b43005a428bfb910ac6e57/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L503]: {noformat} Thread info: "suggest-solrThreadPool-thread-52" prio=5 Id=600 BLOCKED on org.apache.solr.common.cloud.ZkStateReader@62e6bc3d owned by "suggest-solrThreadPool-thread-34" Id=582 at app//org.apache.solr.common.cloud.ZkStateReader.forceUpdateCollection(ZkStateReader.java:506) - blocked on org.apache.solr.common.cloud.ZkStateReader@62e6bc3d at app//org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.getState(ZkClientClusterStateProvider.java:155) at app//org.apache.solr.client.solrj.impl.CloudSolrClient.resolveAliases(CloudSolrClient.java:1207) at app//org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1099) at app//org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:892) at app//org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:820) at app//org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:255) at app//org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:927) ... Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@1beb7ed3 {noformat} At the same time qTime from Solr hasn't changed so I'm pretty sure it's a client regression. I've tried reproducing it locally and I can see [forceUpdateCollection|https://github.com/apache/solr/blob/f8e5a93c11267e13b7b43005a428bfb910ac6e57/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L503] function being called for every request in my application. I can see that [this|https://github.com/apache/solr/commit/8cf552aa3642be473c6a08ce44feceb9cbe396d7] commit changed the logic in ZkClientClusterStateProvider.getState so the mentioned function gets called if clusterState.getCollectionRef [returns null|https://github.com/apache/solr/blob/f8e5a93c11267e13b7b43005a428bfb910ac6e57/solr/solrj-zookeeper/src/java/org/apache/solr/client/solrj/impl/ZkClientClusterStateProvider.java#L151]. In 9.5.0 it wasn't the case (forceUpdateCollection was not called in this place). I can see in the debugger that getCollectionRef only supports collections and not aliases (collectionStates map contains only collections). In my application all collections are referenced using aliases so I guess that's why I can see the regression in Solr response time. I am not familiar with the code enough to prepare a PR but I hope this insight will be enough to fix this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org