In logs I could see this WARN. 2021-08-30 13:15:52.301 WARN (zkCallback-12-thread-3) [c:quoteStore s:shard1 r:core_node6 x:quoteStore_shard1_replica_n5] o.a.s.c.RecoveryStrategy Stopping recovery for core=[quoteStore_shard1_replica_n5] coreNodeName=[core_node6]
On Mon, Aug 30, 2021 at 6:43 PM HariBabu kuruva <hari2708.kur...@gmail.com> wrote: > Hi Zisis, > > Thanks for your email. > > We are suspecting the issue with one particular solr collection(or > store). Wherever the replicas of that store are present that nodes are > going down. > > Also now that shard is in recovery mode and Leader is not elected. Could > you please suggest something to bring up this store. > > On Mon, Aug 30, 2021 at 1:55 PM Zisis Tachtsidis <zist...@runbox.com> > wrote: > >> My guess is that the Solr/Zookeeper communication issues are due to GC >> pauses. You are saying that you end up with OOM problems. High memory usage >> puts pressure on GC. Long GC pauses lead to timeouts in Solr/Zookeeper >> communication. We've seen that happening. >> >> First thing I'd do is to get a heap dump once the OOM is triggered and >> analyze that to see what is occupying the memory. Otherwise we are blind >> here. Is it due to heavy indexing? Heavy querying? Both of them? Do you >> have customizations in the analysis chain that might generate more objects >> than usual? >> >> Zisis >> >> On Mon, 30 Aug 2021 03:44:13 -0400, Dave <hastings.recurs...@gmail.com> >> wrote: >> >> > I can’t help beyond such, I don’t like solr cloud nor zookeeper, I will >> always, if I can help it, stick to standalone solr instance. >> > >> > > On Aug 30, 2021, at 3:23 AM, HariBabu kuruva < >> hari2708.kur...@gmail.com> wrote: >> > > >> > > Hi Dave >> > > >> > > We tried setting the memory as per your suggestions. >> > > >> > > But still I see that the solr is going down in a couple of minutes >> with an >> > > OOM error. Also in the solr logs it says below connectivity issue >> between >> > > solr and zookeeper. Please advise. >> > > >> > > Zookeeper is running fine. >> > > >> > > >> > > 2021-08-30 06:24:13.070 WARN (main-SendThread( >> > > lxeisprdas06.corp.equinix.com:2181)) [ ] o.a.z.ClientCnxn Client >> session >> > > timed out, have not heard from server in 65584ms for session id >> > > 0x1000019354b021b >> > > 2021-08-30 06:24:13.071 WARN (main-SendThread( >> > > lxeisprdas06.corp.equinix.com:2181)) [ ] o.a.z.ClientCnxn Session >> > > 0x1000019354b021b for sever >> lxeisprdas06.corp.equinix.com/10.**.*.*:2181, >> > > Closing socket connection. Attempting reconnect except it is a >> > > SessionExpiredException. => >> > > org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client >> session >> > > timed out, have not heard from server in 65584ms for session id >> > > 0x1000019354b021b >> > > at >> > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243) >> > > org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client >> session >> > > timed out, have not heard from server in 65584ms for session id >> > > 0x1000019354b021b >> > > at >> > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243) >> > > ~[zookeeper-3.6.2.jar:3.6.2] >> > > 2021-08-30 06:24:26.182 ERROR (qtp1198197478-540) [ ] >> > > o.a.s.s.PKIAuthenticationPlugin Invalid key request timestamp: >> > > 1630304577209 , received timestamp: 1630304666181 , TTL: 15000 >> > > 2021-08-30 06:24:26.182 ERROR (qtp1198197478-531) [ ] >> > > o.a.s.s.PKIAuthenticationPlugin Invalid key request timestamp: >> > > 1630304527726 , received timestamp: 1630304600766 , TTL: 15000 >> > > 2021-08-30 06:26:36.014 WARN >> (zkConnectionManagerCallback-13-thread-1) [ >> > > ] o.a.s.c.c.ConnectionManager Watcher >> > > org.apache.solr.common.cloud.ConnectionManager@e31302e name: >> > > ZooKeeperConnection Watcher:zookeeper2.corp.equinix.com:2181, >> > > zookeeper1.corp.equinix.com:2182,zookeeper3.corp.equinix.com:2183, >> > > zookeeper4.corp.equinix.com:2184,zookeeper5.corp.equinix.com:2185 >> got event >> > > WatchedEvent state:Disconnected type:None path:null path: null type: >> None >> > > 2021-08-30 06:26:36.014 WARN >> (zkConnectionManagerCallback-13-thread-1) [ >> > > ] o.a.s.c.c.ConnectionManager zkClient has disconnected >> > > 2021-08-30 07:06:32.484 WARN (main-SendThread( >> zookeeper5.corp.equ.com:2185)) >> > > [ ] o.a.z.ClientCnxn Client session timed out, have not heard from >> server >> > > in 1851316ms for session id 0x1000019354b021b >> > > >> > >> On Sun, Aug 29, 2021 at 11:38 PM Dave <hastings.recurs...@gmail.com> >> wrote: >> > >> >> > >> Yes. Don’t set those memory restrictions, just xms and xmx, both to >> 31 >> > >> gigs. Java has problems past that line and will make the gc go into >> a bad >> > >> loop. I can send you a link as to why >> > >> >> https://community.datastax.com/questions/3661/why-is-a-32-gb-heap-allocation-not-recommended.html >> > >> >> > >> But this is almost like a protected secret >> > >> >> > >>>> On Aug 29, 2021, at 1:52 PM, Shawn Heisey <apa...@elyograg.org> >> wrote: >> > >>> >> > >>> On 8/29/2021 2:38 AM, HariBabu kuruva wrote: >> > >>>> Is it required to define both the parameters SOLR_HEAP and >> > >> SOLR_JAVA_MEM. >> > >>>> or can i comment SOLR_HEAP and only define SOLR_JAVA_MEM. >> > >>>> Also what highest value of Xmx value i can go if i receive OOM >> with >> > >> 31gb. >> > >>>> I have only solr running on that node. >> > >>> >> > >>> If both are defined, I do not know which one will actually take >> effect. >> > >> Figuring that out would require looking at the startup script and >> doing >> > >> some experiments to see what Java actually does. >> > >>> >> > >>> I would personally remove SOLR_JAVA_MEM and only go with SOLR_HEAP. >> Then >> > >> you can do something very simple like the following, and the Solr >> startup >> > >> script will set both -Xms and -Xmx java options to that value: >> > >>> >> > >>> SOLR_HEAP=4g >> > >>> >> > >>>> And could you please let me know the reason to disable swap memory. >> > >>> >> > >>> If a system starts actively swapping, its performance in general >> will be >> > >> extremely low. If that happens, it is an indication that there is >> not >> > >> enough physical memory and the system needs more, or that >> configurations >> > >> need to be adjusted to require less memory. >> > >>> >> > >>> Disabling swap makes it impossible for the OS to try and use disk >> space >> > >> as memory. In situations where programs are asking for too much >> memory and >> > >> you have swap completely disabled, either Java or the OS will simply >> kill >> > >> the process that's asking for too much memory, rather than letting >> it run >> > >> and destroy overall performance. >> > >>> >> > >>> --- >> > >>> >> > >>> Responding to something in the OP: >> > >>> >> > >>> It is completely normal to see 100 percent memory utilization on >> just >> > >> about any server, whether it's running Solr or not. The OS will use >> all >> > >> available memory for caching purposes, to speed everything up. The >> only >> > >> time you won't see 100 percent memory usage is when you have far more >> > >> memory than the system actually needs. For instance, if you had >> 512GB of >> > >> memory on a system that only handles megabytes of data. >> > >>> >> > >>> >> https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems >> > >>> >> > >>> (disclaimer: I wrote the wiki page linked here. Any errors are >> mine.) >> > >>> >> > >>> Thanks, >> > >>> Shawn >> > >> >> > > >> > > >> > > -- >> > > >> > > Thanks and Regards, >> > > Hari >> > > Mobile:9790756568 >> >> >> > > -- > > Thanks and Regards, > Hari > Mobile:9790756568 > -- Thanks and Regards, Hari Mobile:9790756568