Hi All, We have cluster of 30 nodes and each node has 750gb of data. There are 420 Shards. Shards and data are well distributed with all nodes. JVM Settings ->
JDK :Amazon.com Inc. OpenJDK 64-Bit Server VM 17.0.1 17.0.1+12-LTS Processor : 48 JVM Args: Args -DSTOP.KEY=solrrocks -DSTOP.PORT=7983 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.port=8986 -Dcom.sun.management.jmxremote.ssl=false -Denable.packages=true -Denable.runtime.lib=true -Djava.net.preferIPv4Stack=true -Djetty.home=/prod/solrCI/8.11.1-191/solr-8.11.1/server -Djetty.port=8983 -Djute.maxbuffer=10000000 -Dsolr.data.home= -Dsolr.data.home=/prod/solr_data/inst1 -Dsolr.default.confdir=/prod/solrCI/8.11.1-191/solr-8.11.1/server/solr/configsets/_default/conf -Dsolr.environment=prod,label=PROD2+PRODUCTION,color=#c9fdd6-Dsolr.install.dir=/prod/solrCI/8.11.1-191/solr-8.11.1 -Dsolr.jetty.inetaccess.excludes= -Dsolr.jetty.inetaccess.includes= -Dsolr.log.dir=/prod/solrCI/8.11.1-191/solr-8.11.1/server/logs -Dsolr.solr.home=/prod/solr_home/inst1 -Duser.timezone=UTC -DzkClientTimeout=30000 -DzkHost=<zookeeper_string>-XX:+UseNUMA-XX:+UseZGC -XX:-OmitStackTraceInFastThrow -XX:CompileCommand=exclude,com.github.benmanes.caffeine.cache.BoundedLocalCache::put -XX:OnOutOfMemoryError=/prod/solrCI/8.11.1-191/solr-8.11.1/bin/oom_solr.sh 8983 /prod/solrCI/8.11.1-191/solr-8.11.1/server/logs -XX:SoftMaxHeapSize=64g-Xlog:gc*:file=/prod/solrCI/8.11.1-191/solr-8.11.1/server/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M -Xms88g -Xmx88g -Xss256k What we observe is only one node shows high usage of heap and other nodes are well below threshold. You can see in attached image. [cid:image001.png@01D93D2D.6330B760] Even if we bounce the node or entire cluster same issue comes back and it will be the same node which will report high heap usage. We also try to reload collection but that does not help. It is also weird that it is only one node which will get all hit and sometimes it just dies. We compared that machine with all other machine and made sure there is nothing different. If anyone has any pointers to help then it is greatly appreciated. Please let me know if you need more information. Thanks, Jigar Gajjar