On 1/23/23 07:38, Dominique Bejean wrote:
On a SolrCloud 7.7 environment with 14 servers, we have one collection with
1 billion documents.
Sharding is 7 shards x 2 replicas (TLOG)
Each solr server hosts one replica.

Indexing and searching are permanent.

No idea what "permanent" could mean here.

Suddenly one of the server has CPU usage growing during 30 minutes.
Sometimes during a few minutes the CPU usage decreases on this node and
increases on other nodes.
Here is a screenshot of CPU monitoring
https://drive.google.com/file/d/1Fp9oiZ8Sl7hb97utN2JRIm7dJKh0St3H/view?usp=share_link
What CPU characteristic do each of those colors represent? Especially the dark purple. The image doesn't have that info.

WARN logs do not provide any relevant information
Customer did not generate thread dump.

How about ERROR logs? Or any other severity? Have you looked through the solr.log to see what requests were being handled at the time the problem started and/or ended? Is there software other than Solr on the same machine? Did you get a look at process performance info on the machine while it was happening ... something like top for *NIX, or resource monitor on Windows?

Any idea of what tasks can generate this kind of CPU behaviour ?

Huge merge on a shard leader won't be so long and only one node will have
to synchronize, not all.

Have you asked them what they started doing between 10:40 and 10:50? Do you have other performance graphs like number of queries per second, number of update requests per second, disk utilization, Java memory characteristics, and so on?

It's difficult to say what the problem might be from just a CPU graph.

Does the problem recur? If not, and that CPU graph is all you have from the event, it might not be possible to get to the root cause.

Thanks,
Shawn

Reply via email to