Hi daemeon We have check network and it is ok, in fact the nodes are connecting between themselves with a dedicated network.
From: daemeon reiydelle [mailto:daeme...@gmail.com] Sent: maandag 4 april 2016 18:42 To: user@cassandra.apache.org Subject: Re: all the nost are not reacheable when running massive deletes Network issues. Could be jumbo frames not consistent or other. sent from my mobile sent from my mobile Daemeon C.M. Reiydelle USA 415.501.0198 London +44.0.20.8144.9872 On Apr 4, 2016 5:34 AM, "Paco Trujillo" <f.truji...@genetwister.nl<mailto:f.truji...@genetwister.nl>> wrote: Hi everyone We are having problems with our cluster (7 nodes version 2.0.17) when running “massive deletes” on one of the nodes (via cql command line). At the beginning everything is fine, but after a while we start getting constant NoHostAvailableException using the datastax driver: Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.31.7.243:9042<http://172.31.7.243:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.245:9042<http://172.31.7.245:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.246:9042<http://172.31.7.246:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.247:9042<http://172.31.7.247:9042>, /172.31.7.232:9042<http://172.31.7.232:9042>, /172.31.7.233:9042<http://172.31.7.233:9042>, /172.31.7.244:9042<http://172.31.7.244:9042> [only showing errors of first 3 hosts, use getErrors() for more details]) All the nodes are running: UN 172.31.7.244 152.21 GB 256 14.5% 58abea69-e7ba-4e57-9609-24f3673a7e58 RAC1 UN 172.31.7.245 168.4 GB 256 14.5% bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752 RAC1 UN 172.31.7.246 177.71 GB 256 13.7% 8dc7bb3d-38f7-49b9-b8db-a622cc80346c RAC1 UN 172.31.7.247 158.57 GB 256 14.1% 94022081-a563-4042-81ab-75ffe4d13194 RAC1 UN 172.31.7.243 176.83 GB 256 14.6% 0dda3410-db58-42f2-9351-068bdf68f530 RAC1 UN 172.31.7.233 159 GB 256 13.6% 01e013fb-2f57-44fb-b3c5-fd89d705bfdd RAC1 UN 172.31.7.232 166.05 GB 256 15.0% 4d009603-faa9-4add-b3a2-fe24ec16a7c1 but two of them have high cpu load, especially the 232 because I am running a lot of deletes using cqlsh in that node. I know that deletes generate tombstones, but with 7 nodes in the cluster I do not think is normal that all the host are not accesible. We have a replication factor of 3 and for the deletes I am not using any consistency (so it is using the default ONE). I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 1.6% (using only 3 GB from the 10 which have assigned). But looking at the thread pool stats, the mutation stages pending column grows without stop, could be that the problem? I cannot find the reason that originates the timeouts. I already have increased the timeouts, but It do not think that is a solution because the timeouts indicated another type of error. Anyone have a tip to try to determine where is the problem? Thanks in advance