Hello! I recently migrated Cassandra from 3.11.x to 4.0 and got strange freezes on heavy load. It looks like some nodes in DC stopped responding and got DN status. I cannot check status directly on node via nodetool status or even restart Cassandra with `systemctl restart cassandra` command. Only viable method is to `kill -9` hanging process and restart Cassandra again. On 3.11.x there are no such problems.
I have 2 DC with 8 nodes each deployed on good hardware servers, on CentOS 7 and Java 11 environments with slightly changed default settings inherited from 3.11.x installation. The problem shows randomly, I can't determine its source, in the system.log and debug.log the last event that I could trace does not have anything to do with the hang. The service just stops responding and freezes. I can reproduce this with a huge load using dsbulk, but still can't determine the cause of the problem. Has anyone encountered a similar problem, and is there any way other than rolling back to the previous version? Here my config: cluster_name: 'mycluster' num_tokens: '256' allocate_tokens_for_local_replication_factor: 3 hinted_handoff_enabled: true max_hint_window_in_ms: 10800000 hinted_handoff_throttle_in_kb: 1024 max_hints_delivery_threads: 2 hints_directory: /data/cassandra/hints hints_flush_period_in_ms: 10000 max_hints_file_size_in_mb: 128 batchlog_replay_throttle_in_kb: 1024 authenticator: PasswordAuthenticator authorizer: AllowAllAuthorizer role_manager: CassandraRoleManager network_authorizer: AllowAllNetworkAuthorizer roles_validity_in_ms: 2000 permissions_validity_in_ms: 2000 credentials_validity_in_ms: 2000 partitioner: org.apache.cassandra.dht.Murmur3Partitioner data_file_directories: - /data/cassandra/data commitlog_directory: /data/cassandra/commitlog cdc_enabled: false disk_failure_policy: stop commit_failure_policy: stop prepared_statements_cache_size_mb: key_cache_size_in_mb: key_cache_save_period: 14400 row_cache_size_in_mb: 0 row_cache_save_period: 0 counter_cache_size_in_mb: counter_cache_save_period: 7200 saved_caches_directory: /data/cassandra/saved_caches commitlog_sync: periodic commitlog_sync_period_in_ms: 10000 commitlog_segment_size_in_mb: 32 seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: 'node1-1,node1-4,node2-1,node2-4' concurrent_reads: 32 concurrent_writes: 32 concurrent_counter_writes: 32 concurrent_materialized_view_writes: 32 file_cache_size_in_mb: '1024' memtable_allocation_type: heap_buffers index_summary_capacity_in_mb: index_summary_resize_interval_in_minutes: 60 trickle_fsync: false trickle_fsync_interval_in_kb: 10240 storage_port: 7000 ssl_storage_port: 7001 listen_address: start_native_transport: true native_transport_port: 9042 native_transport_allow_older_protocols: true rpc_address: rpc_keepalive: true incremental_backups: false snapshot_before_compaction: false auto_snapshot: true snapshot_links_per_second: 0 column_index_size_in_kb: 64 column_index_cache_size_in_kb: 2 concurrent_compactors: 5 concurrent_materialized_view_builders: 1 compaction_throughput_mb_per_sec: 200 sstable_preemptive_open_interval_in_mb: 50 read_request_timeout_in_ms: 5000 range_request_timeout_in_ms: 10000 write_request_timeout_in_ms: 2000 counter_write_request_timeout_in_ms: 5000 cas_contention_timeout_in_ms: 1000 truncate_request_timeout_in_ms: 60000 request_timeout_in_ms: 10000 slow_query_log_timeout_in_ms: 500 endpoint_snitch: GossipingPropertyFileSnitch dynamic_snitch_update_interval_in_ms: 100 dynamic_snitch_reset_interval_in_ms: 600000 dynamic_snitch_badness_threshold: 1.0 server_encryption_options: internode_encryption: none enable_legacy_ssl_storage_port: false keystore: conf/.keystore keystore_password: cassandra require_client_auth: false truststore: conf/.truststore truststore_password: cassandra require_endpoint_verification: false client_encryption_options: enabled: false keystore: conf/.keystore keystore_password: cassandra require_client_auth: false internode_compression: dc inter_dc_tcp_nodelay: false tracetype_query_ttl: 86400 tracetype_repair_ttl: 604800 enable_user_defined_functions: false enable_scripted_user_defined_functions: false windows_timer_interval: 1 transparent_data_encryption_options: enabled: false chunk_length_kb: 64 cipher: AES/CBC/PKCS5Padding key_alias: testing:1 key_provider: - class_name: org.apache.cassandra.security.JKSKeyProvider parameters: - keystore: conf/.keystore keystore_password: cassandra store_type: JCEKS key_password: cassandra tombstone_warn_threshold: 1000 tombstone_failure_threshold: 100000 replica_filtering_protection: cached_rows_warn_threshold: 2000 cached_rows_fail_threshold: 32000 batch_size_warn_threshold_in_kb: 5 batch_size_fail_threshold_in_kb: 50 unlogged_batch_across_partitions_warn_threshold: 10 compaction_large_partition_warning_threshold_mb: 100 audit_logging_options: enabled: true logger: - class_name: BinAuditLogger excluded_categories: DML,QUERY,PREPARE max_log_size: 1073741824 diagnostic_events_enabled: false repaired_data_tracking_for_range_reads_enabled: false repaired_data_tracking_for_partition_reads_enabled: false report_unconfirmed_repaired_data_mismatches: false enable_materialized_views: true enable_sasi_indexes: false enable_transient_replication: false enable_drop_compact_storage: false -- >From Siberia with Love!