Nevermind, we resolved the issue....  JVM heap settings were misconfigured

Martin

On Fri, Mar 23, 2018 at 1:18 PM, Martin Mačura <m.mac...@gmail.com> wrote:
> Hi all,
>
> We have a cluster of 3 nodes with RF 3 that ran fine until we upgraded
> it to 3.11.2.
>
> Each node has 32 GB RAM, 8 GB Cassandra heap size.
>
> After the upgrade, clients started reporting connection issues:
>
> cassandra | [ERROR] Closing established connection pool to host
> <redacted> because of the following error: Read error 'connection
> reset by peer' (src/pool.cpp:384)
> cassandra | [ERROR] Unable to establish a control connection to host
> <redacted> because of the following error: Error: 'Request timed out'
> (0x0100000E) (src/control_connection.cpp:263)
>
>
> Cassandra logs are full of garbage collection warnings:
>
> WARN  [Service Thread] 2018-03-23 05:04:17,780 GCInspector.java:282 -
> ConcurrentMarkSweep GC in 7858ms.  Par Eden Space: 6871908352 ->
> 1774446288; Par Survivor Space: 858980344 -> 0
> INFO  [Service Thread] 2018-03-23 05:04:17,780 StatusLogger.java:47 -
> Pool Name                    Active   Pending      Completed   Blocked
>  All Time Blocked
> INFO  [Service Thread] 2018-03-23 05:04:17,784 StatusLogger.java:51 -
> MutationStage                    10         9        2526002         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,784 StatusLogger.java:51 -
> ViewMutationStage                 0         0              0         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,784 StatusLogger.java:51 -
> ReadStage                         2         2         943544         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,784 StatusLogger.java:51 -
> RequestResponseStage              0         0        1666876         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,785 StatusLogger.java:51 -
> ReadRepairStage                   0         0          10362         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,785 StatusLogger.java:51 -
> CounterMutationStage              0         0              0         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,785 StatusLogger.java:51 -
> MiscStage                         0         0              0         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,785 StatusLogger.java:51 -
> CompactionExecutor                0         0           3076         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,785 StatusLogger.java:51 -
> MemtableReclaimMemory             0         0             44         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,786 StatusLogger.java:51 -
> PendingRangeCalculator            0         0              4         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,786 StatusLogger.java:51 -
> GossipStage                       0         0          14287         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,786 StatusLogger.java:51 -
> SecondaryIndexManagement          0         0              0         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,786 StatusLogger.java:51 -
> HintsDispatcher                   0         0              1         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,804 StatusLogger.java:51 -
> PerDiskMemtableFlushWriter_1         0         0             37
>  0                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,805 StatusLogger.java:51 -
> PerDiskMemtableFlushWriter_2         0         0             37
>  0                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,805 StatusLogger.java:51 -
> MigrationStage                    0         0              2         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,806 StatusLogger.java:51 -
> MemtablePostFlush                 0         0             72         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,806 StatusLogger.java:51 -
> PerDiskMemtableFlushWriter_0         0         0             44
>  0                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,806 StatusLogger.java:51 -
> ValidationExecutor                0         0              0         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,806 StatusLogger.java:51 -
> Sampler                           0         0              0         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,807 StatusLogger.java:51 -
> MemtableFlushWriter               0         0             44         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,807 StatusLogger.java:51 -
> PerDiskMemtableFlushWriter_5         0         0             37
>  0                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,807 StatusLogger.java:51 -
> InternalResponseStage             0         0              0         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,819 StatusLogger.java:51 -
> PerDiskMemtableFlushWriter_3         0         0             37
>  0                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,819 StatusLogger.java:51 -
> PerDiskMemtableFlushWriter_4         0         0             37
>  0                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,820 StatusLogger.java:51 -
> AntiEntropyStage                  0         0              0         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,820 StatusLogger.java:51 -
> CacheCleanupExecutor              0         0              0         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,820 StatusLogger.java:51 -
> Native-Transport-Requests        67         0        3627740         0
>            625945
> INFO  [Service Thread] 2018-03-23 05:04:17,820 StatusLogger.java:51 -
> UserDefinedFunctions              2        36      295998373         0
>                 0
> INFO  [Service Thread] 2018-03-23 05:04:17,820 StatusLogger.java:61 -
> CompactionManager                 0         0
> INFO  [Service Thread] 2018-03-23 05:04:17,821 StatusLogger.java:73 -
> MessagingService                n/a       1/1
> INFO  [Service Thread] 2018-03-23 05:04:17,821 StatusLogger.java:83 -
> Cache Type                     Size                 Capacity
>     KeysToSave
> INFO  [Service Thread] 2018-03-23 05:04:17,821 StatusLogger.java:85 -
> KeyCache                    9066584                104857600
>            all
> INFO  [Service Thread] 2018-03-23 05:04:17,821 StatusLogger.java:91 -
> RowCache                          0                        0
>            all
> INFO  [Service Thread] 2018-03-23 05:04:17,821 StatusLogger.java:98 -
> Table                       Memtable ops,data
> INFO  [Service Thread] 2018-03-23 05:04:17,821 StatusLogger.java:101 -
> system_distributed.parent_repair_history                 0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,821 StatusLogger.java:101 -
> system_distributed.repair_history                 0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,821 StatusLogger.java:101 -
> system_distributed.view_build_status                 0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,821 StatusLogger.java:101 -
> system.compaction_history             18,5064
> INFO  [Service Thread] 2018-03-23 05:04:17,821 StatusLogger.java:101 -
> system.schema_aggregates                  0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,822 StatusLogger.java:101 -
> system.schema_triggers                    0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,822 StatusLogger.java:101 -
> system.size_estimates            27018,940475
> INFO  [Service Thread] 2018-03-23 05:04:17,822 StatusLogger.java:101 -
> system.paxos                              0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,822 StatusLogger.java:101 -
> system.views_builds_in_progress                 0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,822 StatusLogger.java:101 -
> system.batches                            0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,822 StatusLogger.java:101 -
> system.schema_keyspaces                   0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,822 StatusLogger.java:101 -
> system.sstable_activity        118152,2241249
> INFO  [Service Thread] 2018-03-23 05:04:17,822 StatusLogger.java:101 -
> system.batchlog                           0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,822 StatusLogger.java:101 -
> system.schema_columns                     0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,822 StatusLogger.java:101 -
> system.hints                              0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,822 StatusLogger.java:101 -
> system.IndexInfo                          0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,823 StatusLogger.java:101 -
> system.schema_columnfamilies                 0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,823 StatusLogger.java:101 -
> system.schema_functions                   0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,823 StatusLogger.java:101 -
> system.built_views                        0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,823 StatusLogger.java:101 -
> system.peer_events                        0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,823 StatusLogger.java:101 -
> system.range_xfers                        0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,823 StatusLogger.java:101 -
> system.peers                              0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,823 StatusLogger.java:101 -
> system.transferred_ranges                 0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,823 StatusLogger.java:101 -
> system.schema_usertypes                   0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,823 StatusLogger.java:101 -
> system.local                              0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,823 StatusLogger.java:101 -
> system.available_ranges                   0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,823 StatusLogger.java:101 -
> system.prepared_statements            67,13574
> INFO  [Service Thread] 2018-03-23 05:04:17,824 StatusLogger.java:101 -
> system_schema.columns                     0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,824 StatusLogger.java:101 -
> system_schema.types                       0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,824 StatusLogger.java:101 -
> system_schema.indexes                     0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,824 StatusLogger.java:101 -
> system_schema.keyspaces                   0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,824 StatusLogger.java:101 -
> system_schema.dropped_columns                 0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,824 StatusLogger.java:101 -
> system_schema.aggregates                  0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,824 StatusLogger.java:101 -
> system_schema.triggers                    0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,824 StatusLogger.java:101 -
> system_schema.tables                      0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,825 StatusLogger.java:101 -
> system_schema.views                       0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,825 StatusLogger.java:101 -
> system_schema.functions                   0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,825 StatusLogger.java:101 -
> <redacted>.event_postmortem          2855,140051
> INFO  [Service Thread] 2018-03-23 05:04:17,825 StatusLogger.java:101 -
> <redacted>.group_meta                        0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,825 StatusLogger.java:101 -
> <redacted>.group_postmortem          5715,381267
> INFO  [Service Thread] 2018-03-23 05:04:17,826 StatusLogger.java:101 -
> <redacted>.event_source           49336,24272521
> INFO  [Service Thread] 2018-03-23 05:04:17,826 StatusLogger.java:101 -
> <redacted>.event_alert                1592,78328
> INFO  [Service Thread] 2018-03-23 05:04:17,826 StatusLogger.java:101 -
> <redacted>.event_group            59896,29310038
> INFO  [Service Thread] 2018-03-23 05:04:17,826 StatusLogger.java:101 -
> <redacted>.group_alert               3185,212161
> INFO  [Service Thread] 2018-03-23 05:04:17,826 StatusLogger.java:101 -
> <redacted>.event_child             17319,9226867
> INFO  [Service Thread] 2018-03-23 05:04:17,826 StatusLogger.java:101 -
> <redacted>.event_output         123016,242316519
> INFO  [Service Thread] 2018-03-23 05:04:17,826 StatusLogger.java:101 -
> <redacted>.event                 137939,66815911
> INFO  [Service Thread] 2018-03-23 05:04:17,826 StatusLogger.java:101 -
> <redacted>.event_meta                       1,57
> INFO  [Service Thread] 2018-03-23 05:04:17,827 StatusLogger.java:101 -
> <redacted>.postmortem               4929,4353986
> INFO  [Service Thread] 2018-03-23 05:04:17,827 StatusLogger.java:101 -
> <redacted>.audit_type                   8003,2624883
> INFO  [Service Thread] 2018-03-23 05:04:17,827 StatusLogger.java:101 -
> <redacted>.audit                        8005,2934624
> INFO  [Service Thread] 2018-03-23 05:04:17,827 StatusLogger.java:101 -
> system_auth.roles                         0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,827 StatusLogger.java:101 -
> system_auth.role_members                  0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,827 StatusLogger.java:101 -
> system_auth.resource_role_permissons_index                 0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,827 StatusLogger.java:101 -
> system_auth.role_permissions                 0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,827 StatusLogger.java:101 -
> system_traces.sessions                    0,0
> INFO  [Service Thread] 2018-03-23 05:04:17,827 StatusLogger.java:101 -
> system_traces.events                      0,0
>
> Also, there are many warnings about UDFs executing too slow:
>
> WARN  [Native-Transport-Requests-80] 2018-03-23 05:08:41,327
> UDFunction.java:436 - User defined function asm_log.ttime_state :
> (frozen<tuple<bigint, bigint>>, int, int, int) -> frozen<tuple<bigint,
> bigint>> ran longer than 500ms
>
> And sometimes the node crashes with the following:
> ERROR [NonPeriodicTasks:1] 2018-03-23 05:04:18,027
> JVMStabilityInspector.java:162 - JVM state determined to be unstable.
> Exiting forcefully due to:
> java.util.concurrent.TimeoutException: User defined function
> <redacted>.checks_state : (frozen<tuple<bigint>>, int, int) ->
> frozen<tuple<bigint>> ran longer than 1500ms - will stop Cassandra VM
>
> There is no substantial increase in load average.
>
> We increased heap size to 12 GB, that did not help.  Is there anything
> else we can try, besides downgrading?
>
>
>
> Thanks,
>
> Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Reply via email to