This deserves a JIRA ticket please. (I assume the sending host is randomly choosing the bad IP and blocking on it for some period of time, causing other tasks to pile up, but it should be investigated as a regression).
On Tue, Jun 7, 2022 at 7:52 AM Gil Ganz <gilg...@gmail.com> wrote: > Yes, I know the issue with the peers table, we had it in different > clusters, in this case it appears the cause of the problem was indeed a bad > ip in the seed list. > After removing it from all nodes and reloading seeds, running a rolling > restart does not cause any gossip issues, and in general the number of > gossip pending tasks is 0 all the time, vs jumping to 2-5 pending tasks > every once in a while before this change. > > Interesting that this bad ip didn't cause an issue in 3.11.9, I guess > something in the way gossip works in c*4 made it so it caused a real issue > after the upgrade. > > On Tue, Jun 7, 2022 at 12:04 PM Bowen Song <bo...@bso.ng> wrote: > >> Regarding the "ghost IP", you may want to check the system.peers_v2 table >> by doing "select * from system.peers_v2 where peer = '123.456.789.012';" >> >> I've seen this (non-)issue many times, and I had to do "delete from >> system.peers_v2 where peer=..." to fix it, as on our client side, the >> Python cassandra-driver, reads the token ring information from this table >> and uses it for routing requests. >> On 07/06/2022 05:22, Gil Ganz wrote: >> >> Only errors I see in the logs prior to gossip pending issue are things >> like this >> >> INFO [Messaging-EventLoop-3-32] 2022-06-02 20:29:44,833 >> NoSpamLogger.java:92 - /X:7000->/Y:7000-URGENT_MESSAGES-[no-channel] failed >> to connect >> io.netty.channel.AbstractChannel$AnnotatedConnectException: >> finishConnect(..) failed: No route to host: /Y:7000 >> Caused by: java.net.ConnectException: finishConnect(..) failed: No route >> to host >> at >> io.netty.channel.unix.Errors.throwConnectException(Errors.java:124) >> at io.netty.channel.unix.Socket.finishConnect(Socket.java:251) >> at >> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673) >> at >> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650) >> at >> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530) >> at >> io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470) >> at >> io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) >> at >> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) >> at >> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) >> at >> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) >> at java.lang.Thread.run(Thread.java:748) >> >> Remote ip mentioned here is an ip that is appearing in the seed list >> (there are 20 other valid ip addresses in the seed clause), but it's no >> longer a valid ip, it's an old ip of an existing server (it's not in the >> peers table). I will try to reproduce the issue with this this ip removed >> from seed list >> >> >> On Mon, Jun 6, 2022 at 9:39 PM C. Scott Andreas <sc...@paradoxica.net> >> wrote: >> >>> Hi Gil, thanks for reaching out. >>> >>> Can you check Cassandra's logs to see if any uncaught exceptions are >>> being thrown? What you described suggests the possibility of an uncaught >>> exception being thrown in the Gossiper thread, preventing further tasks >>> from making progress; however I'm not aware of any open issues in 4.0.4 >>> that would result in this. >>> >>> Would be eager to investigate immediately if so. >>> >>> – Scott >>> >>> On Jun 6, 2022, at 11:04 AM, Gil Ganz <gilg...@gmail.com> wrote: >>> >>> >>> Hey >>> We have a big cluster (>500 nodes, onprem, multiple datacenters, most >>> with vnodes=32, but some with 128), that was recently upgraded from 3.11.9 >>> to 4.0.4. Servers are all centos 7. >>> >>> We have been dealing with a few issues related to gossip since : >>> 1 - The moment the last node in the cluster was up with 4.0.4, and all >>> nodes were in the same version, gossip pending tasks started to climb to >>> very high numbers (>1M) in all nodes in the cluster, and quickly the >>> cluster was practically down. Took us a few hours of stopping/starting up >>> nodes, and adding more nodes to the seed list, to finally get the cluster >>> back up. >>> 2 - We notice that pending gossip tasks go up to very high >>> numbers (50k), in random nodes in the cluster, without any meaningful event >>> that happened and it doesn't look like it will go down on its own. After a >>> few hours we restart those nodes and it goes back to 0. >>> 3 - Doing a rolling restart to a list of servers is now an issue, more >>> often then not, what will happen is one of the nodes we restart goes up >>> with gossip issues, and we need a 2nd restart to get the gossip pending >>> tasks to 0. >>> >>> Is there a known issue related to gossip in big clusters, in recent >>> versions? >>> Is there any tuning that can be done? >>> >>> Just to give a sense of how big the gossip information in this cluster, >>> "*nodetool >>> gossipinfo*" output size is ~300kb >>> >>> gil >>> >>> >>>