On 10/21/2023 2:31 AM, Ing. Andrea Vettori wrote:
Hello, we’re using two SOLR servers (same hw, same version of solr and java, same solr config). The SOLR version is 9.3 and JVM is Adoptium JDK 17.0.8.1 on Linux. They both were running fine since a couple years (we upgraded from SOLR 8 to 9 with full reindexing some time ago).
DEB and RPM-based distros typically make it very easy to install OpenJDK out of the box, there is no need to download something like Adoptium:
sudo apt -y install openjdk-17-jdk or sudo yum install java-17-openjdk
Yesterday one of the server died with JVM crash with the following reason (I have the full JVM trace if needed). Once restarted the server ran fine and received data updates every 15 minutes, and responded to queries during the day. Today the server died around the same time with the same JVM trace. The time it died two times is early in the morning when we upload a lot of data. Then during the day the updates are less heavy in terms of size. One strange thing is that only one of the server died, the other one is running fine and it’s receiving the same data. Another thing to note is that in solrconfig we still had the “old” caches of SOLR 8 configured. Two days ago we changed the configuration to use CaffeineCache on one of the four cores (the biggest one). Not sure if it’s related but the time is suspicious… but why would it crash only on one of the servers since they’re both identical in configuration, version and hardware? Anyway I replaced solrconfig with the old configuration to see what happens tomorrow.
Sig11 crashes that are confined to one system usually indicate bad hardware. It could be a bad DIMM, a bad motherboard, or a bad CPU ... in that order, with the DIMM being the most likely problem.
Sig11 can be caused by badly written software, or a corrupted software install, but most systems these days have mechanisms to protect against software getting corrupted during install. If the on-disk binaries get corrupted after install, that also points to bad hardware.
Solr 9.3 includes the workaround for the caffeine-related Java crash, and the version of Java that you are running doesn't have that bug anyway.
If there is a software problem here, it is most likely either in Java or Linux, and I would bet on Java more than Linux. I would recommend that you remove Adoptium and install OpenJDK, see if that helps at all. You could also try wiping the system and reinstalling from scratch ... but if a new Java install doesn't fix it, I would bet on bad hardware.
I have Solr running with OpenJDK 17 on an Ubuntu 22 instance in AWS. It has never had a problem that wasn't user error or a Solr bug.
Thanks, Shawn