Re: Load on Solr Nodes due to High GC
Can you please tell me about the hardware details (Server type, CPU speed and type, Disk Speed and type) and GC configuration? Also please post results of top, iotop if you can? Deepak "The greatness of a nation can be judged by the way its animals are treated - Mahatma Gandhi" +91 73500 12833 deic...@gmail.com LinkedIn: www.linkedin.com/in/deicool "Plant a Tree, Go Green" Make In India : http://www.makeinindia.com/home On Thu, Jun 20, 2024 at 11:24 AM Oleksandr Tkachuk wrote: > Use tlog+pull replicas, they will improve the situation significantly > > чт, 20 июн. 2024 г., 07:27 Saksham Gupta > : > > > Hi All, > > > > We have been facing extra load incidents due to higher gc count and gc > time > > causing higher response time and timeouts. > > > > Solr Cloud Cluster Details > > > > We use solr cloud v8.10 [with java 8 and G1 GC] with 8 shards where each > > shard is present on a single vm of 16 cores and 50 gb RAM. Size of each > > shard is ~28 gb and heap of solr is 16 gb [heap utilization only for > > filter, document, and queryResults cache each of size 512]. > > > > Problem Details > > > > We pause indexing at 11 AM during peak searching hours. Normally the > system > > remains stable during the peak hours, but when documents update count on > > solr is higher before peak hours [b/w from 5.30 AM to 11 AM], we face > > multiple load issues. The gc count and gc time increases and cpu is > > consumed in gc itself thereby increasing load and response time of the > > system. To mitigate this, we recently increased the ram on the servers > [to > > 50 gb from 42 gb previously], as to reduce the io wait for writing solr > > index on memory multiple times. Taking a step further, we also increased > > the heap of solr from 12 to 16 gb [also tried other combinations like 14 > > gb, 15 gb, 18 gb], although we found some reduction in load issues due to > > lower io wait, still the issue recurs when higher indexing is done. > > > > We have explored a few options like expunge deletes, which may help > reduce > > the deleted documents percentage, but that cannot be executed close to > peak > > hours, as it increases io wait which further spikes load and response > time > > of solr significantly. > > > > > >1. > > > >Apart from changing the expunge deletes timing, is there another > option > >which we can try to mitigate this problem? > >2. > > > >Approximately 60 million documents are updated each day i.e. ~30% of > the > >complete solr index is modified each day while serving ~20 million > > search > >requests. Would appreciate any knowledge upon how to handle such high > >indexing + searching traffic during peak hours. > > >
Re: Load on Solr Nodes due to High GC
Are you having iowait, gc pauses, or something else? Do you commit often or in one big batch? > On Jun 20, 2024, at 12:26 AM, Saksham Gupta > wrote: > > Hi All, > > We have been facing extra load incidents due to higher gc count and gc time > causing higher response time and timeouts. > > Solr Cloud Cluster Details > > We use solr cloud v8.10 [with java 8 and G1 GC] with 8 shards where each > shard is present on a single vm of 16 cores and 50 gb RAM. Size of each > shard is ~28 gb and heap of solr is 16 gb [heap utilization only for > filter, document, and queryResults cache each of size 512]. > > Problem Details > > We pause indexing at 11 AM during peak searching hours. Normally the system > remains stable during the peak hours, but when documents update count on > solr is higher before peak hours [b/w from 5.30 AM to 11 AM], we face > multiple load issues. The gc count and gc time increases and cpu is > consumed in gc itself thereby increasing load and response time of the > system. To mitigate this, we recently increased the ram on the servers [to > 50 gb from 42 gb previously], as to reduce the io wait for writing solr > index on memory multiple times. Taking a step further, we also increased > the heap of solr from 12 to 16 gb [also tried other combinations like 14 > gb, 15 gb, 18 gb], although we found some reduction in load issues due to > lower io wait, still the issue recurs when higher indexing is done. > > We have explored a few options like expunge deletes, which may help reduce > the deleted documents percentage, but that cannot be executed close to peak > hours, as it increases io wait which further spikes load and response time > of solr significantly. > > > 1. > > Apart from changing the expunge deletes timing, is there another option > which we can try to mitigate this problem? > 2. > > Approximately 60 million documents are updated each day i.e. ~30% of the > complete solr index is modified each day while serving ~20 million search > requests. Would appreciate any knowledge upon how to handle such high > indexing + searching traffic during peak hours.
Zookeeper KeeperErrorCode = NodeExists
Hi All. I am facing a weird issue while upgrading Solr8.11 to Solr9. I have everyhting up and running passing all kind of tests unit and integration on my current CD process. I have a cluster of 3 machines on SolrCloud and it's all good and working. Problem happens when machines are restarted. Either 1 or 2 servers of the cluster can't connect to zookeeper even when zookeeper reports as healthy and stable. If I restart solr then the server can connect back to the cluster and gets healthy. I check the logs and everything seems normal except the servers who tries to connect ot the cluster and fails on start. I get this error. I tried to delay the start of solr a bit just in case but no luck. Any help much appreciated. Sergio 2024-06-20 12:56:42.944 INFO (main) [ ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (0) -> (2) 2024-06-20 12:56:43.003 INFO (main) [ ] o.a.s.c.DistributedClusterStateUpdater Creating DistributedClusterStateUpdater with useDistributedStateUpdate=false. Solr will be using Overseer based cluster state updates. 2024-06-20 12:56:43.056 INFO (main) [ ] o.a.s.c.ZkController Publish node=server03:8983_solr as DOWN 2024-06-20 12:56:43.088 INFO (main) [ ] o.a.s.c.ZkController Register node as live in ZooKeeper:/live_nodes/server03:8983_solr 2024-06-20 12:56:43.111 ERROR (main) [ ] o.a.s.c.ZkController => org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:125) org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:125) ~[?:?] at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1778) ~[?:?] at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1650) ~[?:?] at org.apache.solr.common.cloud.SolrZkClient.lambda$multi$12(SolrZkClient.java:781) ~[?:?] at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:70) ~[?:?] at org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:781) ~[?:?] at org.apache.solr.cloud.ZkController.createEphemeralLiveNode(ZkController.java:1211) ~[?:?]
Solr replication delays in IndexFetcher
Hi, I'm using a traditional master/replica Solr (8.11) setup and I'm trying to tune Solr's autoCommitTimeout, autoSoftCommitTimeout on the Solr master and the pollInterval on the replicas to achieve an overall better indexing throughput while still maintaining an acceptably low indexing latency on the replicas. The indexing latencies on the replicas are much longer than I would expect and I don't understand why so I'm hoping someone here might have some insights on what the possible cause is and what can be done about it. On a test environment with a large amount of test data already indexed and replicated I make one small update which cause a couple of documents in 3 Solr cores to be updated (one update request per core sent to Solr's API). The Solr master log file shows all three /update requests coming in at 13:10:30. The 3 indexing requests are all done WITHOUT explicitly specified "commit=true" or "softCommit=true". I.e. only the solrconfig.xml specified auto commit max times should affect when commits take place. Currently the autoCommit maxTime is set to 2 and the autoSoftCommit maxTime is 2000 but I have also tried higher autoCommit maxTime values with similarly confusing results. I have a pollInterval of 00:00:10 on the replica. When making the above index updates and issuing search queries against the replica it takes several minutes before I get a corresponding search hit from the replica. In some cases 3-4 minutes, sometimes a bit less. I the following strange behavior in the logs of the Solr replica. Replica seems to notice something has changed after 25-26 seconds (ok assuming autoCommit maxTime is 20 seconds and pollInterval is 10 seconds) 2024-06-20 13:10:56.768 INFO (indexFetcher-81-thread-1) [ ] o.a.s.h.IndexFetcher Starting download (fullCopy=false) to NRTCachingDirectory(MMapDirectory@/data0/solr8/xlcore/data/index.20240620131056059 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) ... most files being skipped, "Fetched and wrote" 15 files 2024-06-20 13:10:56.841 INFO (indexFetcher-81-thread-1) [ ] o.a.s.h.IndexFetcher Total time taken for download (fullCopy=false,bytesDownloaded=225681) : 0 secs (null bytes/sec) to NRTCachingDirectory(MMapDirectory@/data0/solr8/xlcore/data/index.20240620131056059 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) So far so good, but this is only one of the three cores that was updated at 13:10:30. The second core is processed much later: 2024-06-20 13:11:12.370 INFO (indexFetcher-89-thread-1) [ ] o.a.s.h.IndexFetcher Starting download (fullCopy=false) to NRTCachingDirectory(MMapDirectory@/data0/solr8/defcore/data/index.20240620131056964 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) ... 2024-06-20 13:11:12.409 INFO (indexFetcher-89-thread-1) [ ] o.a.s.h.IndexFetcher Total time taken for download (fullCopy=false,bytesDownloaded=281548) : 15 secs (18769 bytes/sec) to NRTCachingDirectory(MMapDirectory@/data0/solr8/defcore/data/index.20240620131056964 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) and the third one even more later: 2024-06-20 13:11:35.468 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Starting download (fullCopy=false) to NRTCachingDirectory(MMapDirectory@/data0/solr8/parentcore/data/index.20240620131109083 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) ... 2024-06-20 13:11:35.498 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Total time taken for download (fullCopy=false,bytesDownloaded=221332) : 26 secs (8512 bytes/sec) to NRTCachingDirectory(MMapDirectory@/data0/solr8/parentcore/data/index.20240620131109083 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) How can I get all updated cores to be replicated within 1 autoCommit maxTime + 1 pollInterval time frame, or at the very least 2 autoCommit maxTime + 1 pollInterval? Right now it looks like only one core is being replicated, then there is 15-25 seconds of doing nothing, then replicating another core, 15-25 seconds of doing nothing etc. Kind regards, Marcus
are bots DoS'ing anyone else's Solr?
Hi all, the latest mole in the eternal whack-a-mole game with web crawlers (GPTBot) DoS'ed our Solr again & I took a closer look at the logs. Here's what it looks like is happening: - the bot is hitting a URL backed by Solr search and starts following all permutations of facets and "next page"s at a rate of 60+ hits/second. - Solr is not returning the results fast enough and the bot is dropping connections. - An INFO message is logged: jetty is "unable to write response, client closed connection or we are shutting down" -- IOException on the OutputStream: Closed. These go on for a while until: java.nio.file.FileSystemException: $PATH_TO\server\solr\preview_shard1_replica_n2\data\tlog\buffer.tlog.800034318988100: The process cannot access the file because it is being used by another process. -- Different file suffix # on every one of those And eventually an update comes in and fails with ERROR (qtp173791568-23140) [c:preview s:shard1 r:core_node4 x:preview_shard1_replica_n2] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error logging add => org.apache.solr.common.SolrException: Error logging add at org.apache.solr.update.TransactionLog.write(TransactionLog.java:420) org.apache.solr.common.SolrException: Error logging add Caused by: java.io.IOException: There is not enough space on the disk ... At this point Solr is hosed. Admin page shows "no collections available" but does respond to queries; all queries from the website client (.NET) are failing. This is Solr 8-11.2 on winders server 2022/correto JVM 11. So, questions: has anyone else seen this? Who is "buffer.tlog.xyz", do they have a size/# files cap, and are they not getting GC'ed fast enough under this kind of load? The 400GB disk is normally at ~90% empty, "not enough space on the disk" does not sound right. The logs do pile up when this happens and Java starts dumping gigabytes of stack traces, but they add up to few 100 MBs at most. There certainly was *some* free space when I got to it, and it's back to 99% free after Solr restart. Any suggestions as to how to deal with this? (Obviously, I added "Disallow: /" to robots.txt for GPTBot, but that's only good until the next bot comes along.) TIA Dima
Re: are bots DoS'ing anyone else's Solr?
solr allows you to go into page=1000 or whatever, bots will follow it, but there is rarely any business value for going so deep. You can come up with a scheme for cursormarks + caching (faster than paging) or just stop showing results past page 5-10. On Thu, Jun 20, 2024 at 11:39 AM Dmitri Maziuk wrote: > > Hi all, > > the latest mole in the eternal whack-a-mole game with web crawlers > (GPTBot) DoS'ed our Solr again & I took a closer look at the logs. > Here's what it looks like is happening: > > - the bot is hitting a URL backed by Solr search and starts following > all permutations of facets and "next page"s at a rate of 60+ hits/second. > - Solr is not returning the results fast enough and the bot is dropping > connections. > - An INFO message is logged: jetty is "unable to write response, client > closed connection or we are shutting down" -- IOException on the > OutputStream: Closed. > > These go on for a while until: > > java.nio.file.FileSystemException: > $PATH_TO\server\solr\preview_shard1_replica_n2\data\tlog\buffer.tlog.800034318988100: > The process cannot access the file because it is being used by another > process. > -- Different file suffix # on every one of those > > And eventually an update comes in and fails with > > ERROR (qtp173791568-23140) [c:preview s:shard1 r:core_node4 > x:preview_shard1_replica_n2] o.a.s.h.RequestHandlerBase > org.apache.solr.common.SolrException: Error logging add => > org.apache.solr.common.SolrException: Error logging add > at > org.apache.solr.update.TransactionLog.write(TransactionLog.java:420) > org.apache.solr.common.SolrException: Error logging add > > Caused by: java.io.IOException: There is not enough space on the disk > ... > > At this point Solr is hosed. Admin page shows "no collections available" > but does respond to queries; all queries from the website client (.NET) > are failing. > > This is Solr 8-11.2 on winders server 2022/correto JVM 11. > > So, questions: has anyone else seen this? > > Who is "buffer.tlog.xyz", do they have a size/# files cap, and are they > not getting GC'ed fast enough under this kind of load? > > The 400GB disk is normally at ~90% empty, "not enough space on the disk" > does not sound right. The logs do pile up when this happens and Java > starts dumping gigabytes of stack traces, but they add up to few 100 MBs > at most. There certainly was *some* free space when I got to it, and > it's back to 99% free after Solr restart. > > Any suggestions as to how to deal with this? > > (Obviously, I added "Disallow: /" to robots.txt for GPTBot, but that's > only good until the next bot comes along.) > > TIA > Dima >
AW: are bots DoS'ing anyone else's Solr?
I Work in a library so yes we have a similar Problem our solr ist used inderect by a Webapplikationen running in another Server WE use https://wiki.archlinux.org/title/fail2ban to Block IPs which exceed a given number of requests per Minute Von: Dmitri Maziuk Gesendet: Donnerstag, 20. Juni 2024 17:38:27 An: users@solr.apache.org Betreff: are bots DoS'ing anyone else's Solr? Hi all, the latest mole in the eternal whack-a-mole game with web crawlers (GPTBot) DoS'ed our Solr again & I took a closer look at the logs. Here's what it looks like is happening: - the bot is hitting a URL backed by Solr search and starts following all permutations of facets and "next page"s at a rate of 60+ hits/second. - Solr is not returning the results fast enough and the bot is dropping connections. - An INFO message is logged: jetty is "unable to write response, client closed connection or we are shutting down" -- IOException on the OutputStream: Closed. These go on for a while until: java.nio.file.FileSystemException: $PATH_TO\server\solr\preview_shard1_replica_n2\data\tlog\buffer.tlog.800034318988100: The process cannot access the file because it is being used by another process. -- Different file suffix # on every one of those And eventually an update comes in and fails with ERROR (qtp173791568-23140) [c:preview s:shard1 r:core_node4 x:preview_shard1_replica_n2] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error logging add => org.apache.solr.common.SolrException: Error logging add at org.apache.solr.update.TransactionLog.write(TransactionLog.java:420) org.apache.solr.common.SolrException: Error logging add Caused by: java.io.IOException: There is not enough space on the disk ... At this point Solr is hosed. Admin page shows "no collections available" but does respond to queries; all queries from the website client (.NET) are failing. This is Solr 8-11.2 on winders server 2022/correto JVM 11. So, questions: has anyone else seen this? Who is "buffer.tlog.xyz", do they have a size/# files cap, and are they not getting GC'ed fast enough under this kind of load? The 400GB disk is normally at ~90% empty, "not enough space on the disk" does not sound right. The logs do pile up when this happens and Java starts dumping gigabytes of stack traces, but they add up to few 100 MBs at most. There certainly was *some* free space when I got to it, and it's back to 99% free after Solr restart. Any suggestions as to how to deal with this? (Obviously, I added "Disallow: /" to robots.txt for GPTBot, but that's only good until the next bot comes along.) TIA Dima
Re: are bots DoS'ing anyone else's Solr?
+1 for fail2ban @Dmitri Maziuk if your Solr is behind Apache httpd then you may be interested in mod-evasive which worked well for XMLRPC attacks against Wordpress. You can combo it with fail2ban https://ejectdisc.org/2015/08/08/admin-a-wordpress-site-running-on-debian-linux-learn-how-to-protect-it-from-dos-xmlrpc-attacks-and-similar/ It sounds like your Solr is publically exposed to the web. Yikes. An alternative is to change the port that it's running on to something non standard and random. These bots scan for well-known ports. That's "security through obscurity" though and you should ideally be running Solr behind some kind of "web application firewall". On Thu, Jun 20, 2024, 4:56 PM Ohms, Jannis wrote: > I Work in a library so yes we have a similar Problem our solr ist used > inderect by a Webapplikationen running in another Server > > WE use https://wiki.archlinux.org/title/fail2ban to Block IPs which > exceed a given number of requests per Minute > > Von: Dmitri Maziuk > Gesendet: Donnerstag, 20. Juni 2024 17:38:27 > An: users@solr.apache.org > Betreff: are bots DoS'ing anyone else's Solr? > > Hi all, > > the latest mole in the eternal whack-a-mole game with web crawlers > (GPTBot) DoS'ed our Solr again & I took a closer look at the logs. > Here's what it looks like is happening: > > - the bot is hitting a URL backed by Solr search and starts following > all permutations of facets and "next page"s at a rate of 60+ hits/second. > - Solr is not returning the results fast enough and the bot is dropping > connections. > - An INFO message is logged: jetty is "unable to write response, client > closed connection or we are shutting down" -- IOException on the > OutputStream: Closed. > > These go on for a while until: > > java.nio.file.FileSystemException: > > $PATH_TO\server\solr\preview_shard1_replica_n2\data\tlog\buffer.tlog.800034318988100: > The process cannot access the file because it is being used by another > process. > -- Different file suffix # on every one of those > > And eventually an update comes in and fails with > > ERROR (qtp173791568-23140) [c:preview s:shard1 r:core_node4 > x:preview_shard1_replica_n2] o.a.s.h.RequestHandlerBase > org.apache.solr.common.SolrException: Error logging add => > org.apache.solr.common.SolrException: Error logging add > at > org.apache.solr.update.TransactionLog.write(TransactionLog.java:420) > org.apache.solr.common.SolrException: Error logging add > > Caused by: java.io.IOException: There is not enough space on the disk > ... > > At this point Solr is hosed. Admin page shows "no collections available" > but does respond to queries; all queries from the website client (.NET) > are failing. > > This is Solr 8-11.2 on winders server 2022/correto JVM 11. > > So, questions: has anyone else seen this? > > Who is "buffer.tlog.xyz", do they have a size/# files cap, and are they > not getting GC'ed fast enough under this kind of load? > > The 400GB disk is normally at ~90% empty, "not enough space on the disk" > does not sound right. The logs do pile up when this happens and Java > starts dumping gigabytes of stack traces, but they add up to few 100 MBs > at most. There certainly was *some* free space when I got to it, and > it's back to 99% free after Solr restart. > > Any suggestions as to how to deal with this? > > (Obviously, I added "Disallow: /" to robots.txt for GPTBot, but that's > only good until the next bot comes along.) > > TIA > Dima > >
Re: are bots DoS'ing anyone else's Solr?
On 6/20/24 11:17, Imran Chaudhry wrote: ... If I were running on linux I'd have them blocked at iptbales-recent too... and if I were running on bare metal I'd put it on an SSD-cached ZVOL and likely not see Java choke on nio under load. But I am not. :( It sounds like your Solr is publically exposed to the web. No: - the bot is hitting a URL backed by Solr search and starts following all permutations of facets and "next page"s at a rate of 60+ hits/second. By "URL backed by Solr search" I meant a page on the website. But anyway, it looks like it's not just us, it's a solr feature. Good to know. Thanks all Dima
Re: 150x+ performance hit when number of rows <= 50 in a simple query
I've been unable to reproduce anything like this behavior. If you're really getting queryResultCache hits for these, then the field type/etc of the field you're querying on shouldn't make a difference. type/etc of the return field (product_id) would be more likely to matter. I wonder what would happen if you fully bypassed the query cache (i.e., `q={!cache=false}product_type:"1"`? I recall that previously you had a very large number of dynamic fields. Is that the case here as well? And if so, are the dynamic fields mostly stored? docValues? On Fri, Jun 14, 2024 at 7:29 AM Oleksandr Tkachuk wrote: > > Initial data: > Doc count: 1793026 > Field: "product_type", point int, indexed true, stored false, > docvalues true. Values: > "facet_fields":{ > "product_type":["3",1069282,"2",710042,"1",13702] > }, > Single shard, single instance. > > # ./hey_linux_amd64 -n 1 -c 10 -T "application/json" > 'http://localhost:8983/solr/XXX/select?fl=product_id&wt=json&q=product_type:"1"&start=0&rows=51' > Summary: > Total:0.6374 secs > Slowest: 0.0043 secs > Fastest: 0.0003 secs > Average: 0.0006 secs > Requests/sec: 15688.5755 > > # ./hey_linux_amd64 -n 1 -c 10 -T "application/json" > 'http://localhost:8983/solr/XXX/select?fl=product_id&wt=json&q=product_type:"1"&start=0&rows=50' > Summary: > Total:101.3246 secs > Slowest: 0.2048 secs > Fastest: 0.0564 secs > Average: 0.1007 secs > Requests/sec: 98.6927 > > > 1) I've already played with queryResultWindowSize and > queryResultMaxDocsCached by setting different, high and low values and > this is probably not what I'm looking for since it gave a milliseconds difference in query performance > 2) Checked on different versions of solr (9.6.1 and 8.7.0) - no > significant changes > 3) Tried changing the field type to string - zero performance changes > 4) In both cases I see successful lookups in queryResultCache > 5) Enabling documentCache solves the problem in this case (rows<=50), > but introduces many other performance issues so it doesn't seem like a > viable option.
Re: How to bind embedded zookeeper to specific interface/ip?
For some historic reasons, Solr has always explicitly overridden the `clientPortAddress` -- but as of a few versions ago, there is a Solr setting (SOLR_ZK_EMBEDDED_HOST) that can be used to override solr's override... https://solr.apache.org/guide/solr/latest/deployment-guide/taking-solr-to-production.html#security-considerations If you're familiar with java code, the code Solr uses when instructed to run embeeded ZK server (and the logic for how that server is configured) can be found in SolrZkServer ... https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/cloud/SolrZkServer.java#L85-L122 -Hoss http://www.lucidworks.com/
Re: 150x+ performance hit when number of rows <= 50 in a simple query
FYI: There is a solution in the last paragraph, but I still ran your tests, since the solution was found by "Cut and Try" and there is no deep understanding. >I wonder what would happen if you fully bypassed the query cache (i.e., >`q={!cache=false}product_type:"1"`? It does not help, there is not even one millisecond of difference in both cases. >I recall that previously you had a very large number of dynamic fields. Is >that the case here as well? And if so, are the dynamic fields mostly stored? >docValues? This is another collection, I’ll get to the one with many many fields later :)) If this is the ~correct way to count the number of fields, then this collection has the following number of fields: curl -s "http://localhost:8983/solr/XXX/admin/luke?numTerms=0"; | grep '"type"' | wc -l 121 Of these, 88 have docvalues enabled and 33 stored. As for the two fields used in query, here's how they are defined in the schema. Changing fl= to something like a string field with stored=true without docvalues results in zero changes. I also tried this simple query on string type fields (copying the field) and got the same result. I also tried it on fields where the cardinality was different - the spread was not 150 times, but also often noticeable. In addition, I still do not fully understand the logic of this behavior ("product_type":["3",1069282,"2",710042,"1",13702]) if I do: 1) q=product_type:"1" rows=50 - qtime 150ms 2) q=product_type:"1" rows=51 - qtime 0ms 3) q=product_type:"2" rows=50 - qtime 3ms 4) q=product_type:"2" rows=51 - qtime 0ms 5) q=product_type:"3" rows=50 - qtime 1ms 6) q=product_type:"3" rows=51 - qtime 0ms I checked on other fields and get the same behavior - the fewer documents contain a given value, the slower the query becomes. If I can provide any more information, I will be glad. The problem was solved by turning off enableLazyFieldLoading. I am very surprised that this functionality continues to work when document cache is disabled and I thought that this parameter was intended only for it. In addition, we received an improvement in avg and 95% on many other types of queries, as well as some reduction in CPU load. Are there any consequences or disadvantages of such a decision? If not, then perhaps it is worth paying attention to this problem. On Thu, Jun 20, 2024 at 10:13 PM Michael Gibney wrote: > > I've been unable to reproduce anything like this behavior. If you're > really getting queryResultCache hits for these, then the field > type/etc of the field you're querying on shouldn't make a difference. > type/etc of the return field (product_id) would be more likely to > matter. I wonder what would happen if you fully bypassed the query > cache (i.e., `q={!cache=false}product_type:"1"`? > > I recall that previously you had a very large number of dynamic > fields. Is that the case here as well? And if so, are the dynamic > fields mostly stored? docValues? > > > > On Fri, Jun 14, 2024 at 7:29 AM Oleksandr Tkachuk > wrote: > > > > Initial data: > > Doc count: 1793026 > > Field: "product_type", point int, indexed true, stored false, > > docvalues true. Values: > > "facet_fields":{ > > "product_type":["3",1069282,"2",710042,"1",13702] > > }, > > Single shard, single instance. > > > > # ./hey_linux_amd64 -n 1 -c 10 -T "application/json" > > 'http://localhost:8983/solr/XXX/select?fl=product_id&wt=json&q=product_type:"1"&start=0&rows=51' > > Summary: > > Total:0.6374 secs > > Slowest: 0.0043 secs > > Fastest: 0.0003 secs > > Average: 0.0006 secs > > Requests/sec: 15688.5755 > > > > # ./hey_linux_amd64 -n 1 -c 10 -T "application/json" > > 'http://localhost:8983/solr/XXX/select?fl=product_id&wt=json&q=product_type:"1"&start=0&rows=50' > > Summary: > > Total:101.3246 secs > > Slowest: 0.2048 secs > > Fastest: 0.0564 secs > > Average: 0.1007 secs > > Requests/sec: 98.6927 > > > > > > 1) I've already played with queryResultWindowSize and > > queryResultMaxDocsCached by setting different, high and low values and > > this is probably not what I'm looking for since it gave a > milliseconds difference in query performance > > 2) Checked on different versions of solr (9.6.1 and 8.7.0) - no > > significant changes > > 3) Tried changing the field type to string - zero performance changes > > 4) In both cases I see successful lookups in queryResultCache > > 5) Enabling documentCache solves the problem in this case (rows<=50), > > but introduces many other performance issues so it doesn't seem like a > > viable option.