If you’re correct that the issue you linked to is the bug you are hitting, then it was fixed in 3.11.3. You may have no choice but to upgrade. From the discussion it doesn’t read as if any tuning tweaks avoided the issue, just the patch fixed it.
If you do, I’d suggest going to at least 3.11.5. Note that usable memory for a setting > 31 gb may not be what you think. At 32gb you cross a boundary that triggers object pointers to double in size. The only way you really win is when an app has only a modest number of objects, but some of those objects have large non-object-granularity allocations, e.g. like a few huge byte arrays. C* does use some large buffers, but it also generates a lot of small objects. I’d consider TCP tunings a likely red herring in this, if you are correct about the leak. Doesn’t mean you can’t have better settings per suggestions made, just that it seems like it could be a case of refining behavior on the periphery of the problem, not anything directly addressing it. From: Surbhi Gupta <surbhi.gupt...@gmail.com> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> Date: Saturday, May 9, 2020 at 11:51 AM To: "user@cassandra.apache.org" <user@cassandra.apache.org> Subject: Re: Bootstraping is failing Message from External Sender I tried to change the heap size from 31GB to 62GB on the bootstrapping node because , I noticed that , when it reached the mid way of bootstrapping , heap reached to around 90% or more and node just freeze . But still it is the same behavior , it again reached midway and heap again reached 90% or more and node just freeze and none of the node tool command returns the output, other node also removed this node from the joining as they were not able to gossip. We are on 3.11.0 . I tried to take heap dump when the node had 90% + heap utilization of 62GB heap size and opened the leak report and found 3 leak suspect and out of three 2 were as below: 1. The thread io.netty.util.concurrent.FastThreadLocalThread @ 0x7fbe9533bf98 StreamReceiveTask:26 keeps local variables with total size 16,898,023,552 (31.10%)bytes. The memory is accumulated in one instance of "io.netty.util.Recycler$DefaultHandle[]" loaded by "sun.misc.Launcher$AppClassLoader @ 0x7fb917c76dc8". 2. The thread io.netty.util.concurrent.FastThreadLocalThread @ 0x7fbb846fb800 StreamReceiveTask:29 keeps local variables with total size 11,696,214,424 (21.53%)bytes. The memory is accumulated in one instance of "io.netty.util.Recycler$DefaultHandle[]" loaded by "sun.misc.Launcher$AppClassLoader @ 0x7fb917c76dc8". Am I getting hit by https://issues.apache.org/jira/browse/CASSANDRA-13929<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D13929&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=LYdnPGldpP4IB6pDevPWk1Scr0tsTaFmsqx5uslKvCo&s=P-0rAJBdSvwhDOArjtaJ1LvgjJ56dTlvzIEcBZGbo8Y&e=> I haven't changed the tcp settings . My tcp settings are more than recommended, what I wanted to understand , how tcp settings can effect the bootstrapping process ? Thanks Surbhi On Thu, 7 May 2020 at 17:01, Surbhi Gupta <surbhi.gupt...@gmail.com<mailto:surbhi.gupt...@gmail.com>> wrote: When we are starting the node, it is starting bootstrap automatically and restreaming the whole data again. It is not resuming . On Thu, May 7, 2020 at 4:47 PM Adam Scott <adam.c.sc...@gmail.com<mailto:adam.c.sc...@gmail.com>> wrote: I think you want to run `nodetool bootstrap resume` (https://cassandra.apache.org/doc/latest/tools/nodetool/bootstrap.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__cassandra.apache.org_doc_latest_tools_nodetool_bootstrap.html&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=LYdnPGldpP4IB6pDevPWk1Scr0tsTaFmsqx5uslKvCo&s=hQxh8KK3IQK5yln8hl6kjyHW6bJlzCQMxzHhy3E6zYU&e=>) to pick up where it last left off. Sorry for the late reply. On Thu, May 7, 2020 at 2:22 PM Surbhi Gupta <surbhi.gupt...@gmail.com<mailto:surbhi.gupt...@gmail.com>> wrote: So after failed bootstrapped , if we start cassandra again on the new node , will it resume bootstrap or will it start over? On Thu, 7 May 2020 at 13:32, Adam Scott <adam.c.sc...@gmail.com<mailto:adam.c.sc...@gmail.com>> wrote: I recommend it on all nodes. This will eliminate that as a source of trouble further on down the road. On Thu, May 7, 2020 at 1:30 PM Surbhi Gupta <surbhi.gupt...@gmail.com<mailto:surbhi.gupt...@gmail.com>> wrote: streaming_socket_timeout_in_ms is 24 hour. So tcp settings should be changed on the new bootstrap node or on all nodes ? On Thu, 7 May 2020 at 13:23, Adam Scott <adam.c.sc...@gmail.com<mailto:adam.c.sc...@gmail.com>> wrote: edit /etc/sysctl.conf net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_probes=3 net.ipv4.tcp_keepalive_intvl=10 then run sysctl -p to cause the kernel to reload the settings 5 minutes (300) seconds is probably too long. On Thu, May 7, 2020 at 1:09 PM Surbhi Gupta <surbhi.gupt...@gmail.com<mailto:surbhi.gupt...@gmail.com>> wrote: [root@abc cassandra]# cat /proc/sys/net/ipv4/tcp_keepalive_time 300 [root@abc cassandra]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl 30 [root@abc cassandra]# cat /proc/sys/net/ipv4/tcp_keepalive_probes 9 On Thu, 7 May 2020 at 12:32, Adam Scott <adam.c.sc...@gmail.com<mailto:adam.c.sc...@gmail.com>> wrote: Maybe a firewall killing a connection? What does the following show? cat /proc/sys/net/ipv4/tcp_keepalive_time cat /proc/sys/net/ipv4/tcp_keepalive_intvl cat /proc/sys/net/ipv4/tcp_keepalive_probes On Thu, May 7, 2020 at 10:31 AM Surbhi Gupta <surbhi.gupt...@gmail.com<mailto:surbhi.gupt...@gmail.com>> wrote: Hi, We are trying to expand a datacenter and trying to add nodes but when node is bootstrapping , it goes half way through and then fail with below error, We have increased stremthroughput from 200 to 400 when we were trying for the 2nd time but still it failed. We are on 3.11.0 , using G1GC with 31GB heap. ERROR [MessagingService-Incoming-/10.X.X.X] 2020-05-07 09:42:38,933 CassandraDaemon.java:228 - Exception in thread Thread[MessagingService-Incoming-/10.X.X.X,main] java.io.IOError: java.io.EOFException: Stream ended prematurely at org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer$1.computeNext(UnfilteredRowIteratorSerializer.java:227) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer$1.computeNext(UnfilteredRowIteratorSerializer.java:215) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:839) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:814) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:425) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:434) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:371) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:192) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:180) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94) ~[apache-cassandra-3.11.0.jar:3.11.0] Caused by: java.io.EOFException: Stream ended prematurely at net.jpountz.lz4.LZ4BlockInputStream.readFully(LZ4BlockInputStream.java:218) ~[lz4-1.3.0.jar:na] at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:150) ~[lz4-1.3.0.jar:na] at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:117) ~[lz4-1.3.0.jar:na] at java.io.DataInputStream.readFully(DataInputStream.java:195) ~[na:1.8.0_242] at java.io.DataInputStream.readFully(DataInputStream.java:169) ~[na:1.8.0_242] at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:402) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.marshal.AbstractType.readValue(AbstractType.java:437) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.rows.Cell$Serializer.deserialize(Cell.java:245) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.rows.UnfilteredSerializer.readComplexColumn(UnfilteredSerializer.java:665) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.rows.UnfilteredSerializer.lambda$deserializeRowBody$1(UnfilteredSerializer.java:606) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.utils.btree.BTree.applyForwards(BTree.java:1242) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.utils.btree.BTree.apply(BTree.java:1197) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.Columns.apply(Columns.java:377) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.rows.UnfilteredSerializer.deserializeRowBody(UnfilteredSerializer.java:600) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.rows.UnfilteredSerializer.deserializeOne(UnfilteredSerializer.java:475) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.rows.UnfilteredSerializer.deserialize(UnfilteredSerializer.java:431) ~[apache-cassandra-3.11.0.jar:3.11.0] at org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer$1.computeNext(UnfilteredRowIteratorSerializer.java:222) ~[apache-cassandra-3.11.0.jar:3.11.0] ... 11 common frames omitted Thanks Surbhi