1% packet loss can definitely lead to drops. At higher speeds, that's enough to limit TCP throughput to the point that cross-node communication can't keep up. TCP_BBR will do better than other strategies at maintaining high throughput despite single-digit packet loss, but you'll also want to track down the actual cause.
I'd be a bit hesitant to tune the transport threads any further until you've solved the packet loss problem. On Wed, Jan 13, 2021 at 8:53 AM MyWorld <timeplus.1...@gmail.com> wrote: > Hi, > > We are currently using apache cassandra 3.11.6 in our production > environment with single DC of 4 nodes. > > 2 nodes have configuration : Ssd 24 cores, 64gb ram, 20gb heap size > > Other 2 nodes have: Ssd 32cores, 64gb ram, 20gb heap size > > I have several questions around this. > > 1. Does different configuration nodes(cores) in single dc have any impact ? > > 2. Can we have different heap size in single DC on different nodes? > > 3. Which is better : single partition disk or multiple partition disk? > > 4. Currently we have 200 writes and around 5000 reads per sec per node (In > 4 node cluster). How to determine max node capacity? > > 5. We are getting read/write operation timeout intermittently. There is no > GC issue. However we have observed 1% packet loss between nodes. Can this > be the cause of timeout issue? > > 6. Currently we are getting 1100 established connections from client side. > Shall we increase native_transport_max_threads to 1000+? Currently we have > increased it from default 128 to 512 after finding pending NTR requests > during timeout issue. > > 7. Have found below h/w production recommendation from dse site. How much > this is helpful for apache cassandra ? > > net.ipv4.tcp_keepalive_time=60 > net.ipv4.tcp_keepalive_probes=3 > net.ipv4.tcp_keepalive_intvl=10 > net.core.rmem_max=16777216 > net.core.wmem_max=16777216 > net.core.rmem_default=16777216 > net.core.wmem_default=16777216 > net.core.optmem_max=40960 > net.ipv4.tcp_rmem=4096 87380 16777216 > net.ipv4.tcp_wmem=4096 65536 16777216 > >