> I can reproduce this with a huge load using dsbulk, but still can't determine the cause of the problem.
Can you get a thread dump (jstack <pid>) when the system freezes? This might be helpful to determine the cause of the freeze. Also, can you reproduce this in a simpler environment (ccm + dsbulk)? Em sex., 25 de fev. de 2022 às 07:03, Bowen Song <bo...@bso.ng> escreveu: > Okay, that ruled it out. Anything interesting in the GC logs? Was > Cassandra stuck at a GC safepoint? You may need to enable the detailed > GC logs to see these. > > On 25/02/2022 10:02, Azamat Hackimov wrote: > > Hello! > > > > No, I have a directly attached NVMe disk, and there are no IO or network > issues. > > > > пт, 25 февр. 2022 г. в 12:50, Bowen Song <bo...@bso.ng>: > >> Do you have any network based mountpoint, such as NFS or samba? I have > >> seen similar behaviour on other Java based applications at GC safepoint > >> when the network based filesystem loses their connection and reconnects. > >> > >> On 25/02/2022 06:09, Azamat Hackimov wrote: > >>> Hello! > >>> > >>> I recently migrated Cassandra from 3.11.x to 4.0 and got strange > >>> freezes on heavy load. It looks like some nodes in DC stopped > >>> responding and got DN status. > >>> I cannot check status directly on node via nodetool status or even > >>> restart Cassandra with `systemctl restart cassandra` command. Only > >>> viable method is to `kill -9` hanging process and restart Cassandra > >>> again. On 3.11.x there are no such problems. > >>> > >>> I have 2 DC with 8 nodes each deployed on good hardware servers, on > >>> CentOS 7 and Java 11 environments with slightly changed default > >>> settings inherited from 3.11.x installation. > >>> > >>> The problem shows randomly, I can't determine its source, in the > >>> system.log and debug.log the last event that I could trace does not > >>> have anything to do with the hang. The service just stops responding > >>> and freezes. I can reproduce this with a huge load using dsbulk, but > >>> still can't determine the cause of the problem. > >>> > >>> Has anyone encountered a similar problem, and is there any way other > >>> than rolling back to the previous version? > >>> > >>> Here my config: > >>> > >>> cluster_name: 'mycluster' > >>> num_tokens: '256' > >>> allocate_tokens_for_local_replication_factor: 3 > >>> hinted_handoff_enabled: true > >>> max_hint_window_in_ms: 10800000 > >>> hinted_handoff_throttle_in_kb: 1024 > >>> max_hints_delivery_threads: 2 > >>> hints_directory: /data/cassandra/hints > >>> hints_flush_period_in_ms: 10000 > >>> max_hints_file_size_in_mb: 128 > >>> batchlog_replay_throttle_in_kb: 1024 > >>> authenticator: PasswordAuthenticator > >>> authorizer: AllowAllAuthorizer > >>> role_manager: CassandraRoleManager > >>> network_authorizer: AllowAllNetworkAuthorizer > >>> roles_validity_in_ms: 2000 > >>> permissions_validity_in_ms: 2000 > >>> credentials_validity_in_ms: 2000 > >>> partitioner: org.apache.cassandra.dht.Murmur3Partitioner > >>> data_file_directories: > >>> - /data/cassandra/data > >>> commitlog_directory: /data/cassandra/commitlog > >>> cdc_enabled: false > >>> disk_failure_policy: stop > >>> commit_failure_policy: stop > >>> prepared_statements_cache_size_mb: > >>> key_cache_size_in_mb: > >>> key_cache_save_period: 14400 > >>> row_cache_size_in_mb: 0 > >>> row_cache_save_period: 0 > >>> counter_cache_size_in_mb: > >>> counter_cache_save_period: 7200 > >>> saved_caches_directory: /data/cassandra/saved_caches > >>> commitlog_sync: periodic > >>> commitlog_sync_period_in_ms: 10000 > >>> commitlog_segment_size_in_mb: 32 > >>> seed_provider: > >>> - class_name: org.apache.cassandra.locator.SimpleSeedProvider > >>> parameters: > >>> - seeds: 'node1-1,node1-4,node2-1,node2-4' > >>> concurrent_reads: 32 > >>> concurrent_writes: 32 > >>> concurrent_counter_writes: 32 > >>> concurrent_materialized_view_writes: 32 > >>> file_cache_size_in_mb: '1024' > >>> memtable_allocation_type: heap_buffers > >>> index_summary_capacity_in_mb: > >>> index_summary_resize_interval_in_minutes: 60 > >>> trickle_fsync: false > >>> trickle_fsync_interval_in_kb: 10240 > >>> storage_port: 7000 > >>> ssl_storage_port: 7001 > >>> listen_address: > >>> start_native_transport: true > >>> native_transport_port: 9042 > >>> native_transport_allow_older_protocols: true > >>> rpc_address: > >>> rpc_keepalive: true > >>> incremental_backups: false > >>> snapshot_before_compaction: false > >>> auto_snapshot: true > >>> snapshot_links_per_second: 0 > >>> column_index_size_in_kb: 64 > >>> column_index_cache_size_in_kb: 2 > >>> concurrent_compactors: 5 > >>> concurrent_materialized_view_builders: 1 > >>> compaction_throughput_mb_per_sec: 200 > >>> sstable_preemptive_open_interval_in_mb: 50 > >>> read_request_timeout_in_ms: 5000 > >>> range_request_timeout_in_ms: 10000 > >>> write_request_timeout_in_ms: 2000 > >>> counter_write_request_timeout_in_ms: 5000 > >>> cas_contention_timeout_in_ms: 1000 > >>> truncate_request_timeout_in_ms: 60000 > >>> request_timeout_in_ms: 10000 > >>> slow_query_log_timeout_in_ms: 500 > >>> endpoint_snitch: GossipingPropertyFileSnitch > >>> dynamic_snitch_update_interval_in_ms: 100 > >>> dynamic_snitch_reset_interval_in_ms: 600000 > >>> dynamic_snitch_badness_threshold: 1.0 > >>> server_encryption_options: > >>> internode_encryption: none > >>> enable_legacy_ssl_storage_port: false > >>> keystore: conf/.keystore > >>> keystore_password: cassandra > >>> require_client_auth: false > >>> truststore: conf/.truststore > >>> truststore_password: cassandra > >>> require_endpoint_verification: false > >>> client_encryption_options: > >>> enabled: false > >>> keystore: conf/.keystore > >>> keystore_password: cassandra > >>> require_client_auth: false > >>> internode_compression: dc > >>> inter_dc_tcp_nodelay: false > >>> tracetype_query_ttl: 86400 > >>> tracetype_repair_ttl: 604800 > >>> enable_user_defined_functions: false > >>> enable_scripted_user_defined_functions: false > >>> windows_timer_interval: 1 > >>> transparent_data_encryption_options: > >>> enabled: false > >>> chunk_length_kb: 64 > >>> cipher: AES/CBC/PKCS5Padding > >>> key_alias: testing:1 > >>> key_provider: > >>> - class_name: org.apache.cassandra.security.JKSKeyProvider > >>> parameters: > >>> - keystore: conf/.keystore > >>> keystore_password: cassandra > >>> store_type: JCEKS > >>> key_password: cassandra > >>> tombstone_warn_threshold: 1000 > >>> tombstone_failure_threshold: 100000 > >>> replica_filtering_protection: > >>> cached_rows_warn_threshold: 2000 > >>> cached_rows_fail_threshold: 32000 > >>> batch_size_warn_threshold_in_kb: 5 > >>> batch_size_fail_threshold_in_kb: 50 > >>> unlogged_batch_across_partitions_warn_threshold: 10 > >>> compaction_large_partition_warning_threshold_mb: 100 > >>> > >>> audit_logging_options: > >>> enabled: true > >>> logger: > >>> - class_name: BinAuditLogger > >>> excluded_categories: DML,QUERY,PREPARE > >>> max_log_size: 1073741824 > >>> > >>> diagnostic_events_enabled: false > >>> repaired_data_tracking_for_range_reads_enabled: false > >>> repaired_data_tracking_for_partition_reads_enabled: false > >>> report_unconfirmed_repaired_data_mismatches: false > >>> > >>> enable_materialized_views: true > >>> enable_sasi_indexes: false > >>> enable_transient_replication: false > >>> enable_drop_compact_storage: false > >>> > > > > >