On Mon, Nov 21, 2022 at 7:02 PM Shawn Heisey <apa...@elyograg.org> wrote: > > On 11/21/22 15:01, gnandre wrote: > > I am using Solr 8.5.2 in legacy mode (non-cloud). > > > > Some of the Solr nodes are automatically getting restarted after a few > > days. There is no clear pattern to the rebooting time. Also, no pattern in > > number of incoming queries or nature of those queries. No > > particular pattern in errors found in Solr logs. > > > > I am going to turn on the debug logs to see what is happening in Solr when > > it goes down. I am not able to reproduce the issue in one of our non-prod > > performance testing environments.I am recreating the same traffic as prod > > using access logs. > > > > Any other ideas about how I should go about debugging or reproducing this > > issue? TIA. > > As shipped, if Solr dies, it will NOT restart automatically. So that > must have been something you added. > > What OS do you have it running on? > > If everything is sized correctly, Solr will never crash. Java programs > are VERY stable if written correctly and run with plenty of system > resources. > > On non-windows systems, Solr starts with an option that will cause it to > commit suicide if Java's OutOfMemoryError exception is thrown. There > are several resource depletions that can cause OOME, and some of them > are NOT related to memory. This capability will not exist on Windows > until Solr 9.2.0 is released. > > https://issues.apache.org/jira/browse/SOLR-8803 > > Most operating systems have a process that is called an "out of memory > killer" ... if available memory gets too low, this will find a program > on the system that is using a lot of memory and terminate it. On most > installs, the process using the most memory will be Solr. > > I strongly recommend NOT restarting Solr automatically if it ever dies. > Chances are that the reason it died is because the system needs some > attention, and restarting it is simply going to result in it dying > again, over and over. > > Thanks, > Shawn >
I strongly agree here -- check your system logs for OOM and restarts from the operating system + systemd (or whatever). On a lark -- also check the uptime of the entire box.