Andrey N. Gura created IGNITE-12523: ---------------------------------------
Summary: Continuously generated thread dumps in failure processor slow down the whole system Key: IGNITE-12523 URL: https://issues.apache.org/jira/browse/IGNITE-12523 Project: Ignite Issue Type: Improvement Reporter: Andrey N. Gura Assignee: Andrey N. Gura Fix For: 2.9 A lot of threads (hundreds) build indexes. checkpoint-thread tries acquire write lock but can’t because some threads hold read lock. Moreover, some threads try to acquire read lock too. Failure types SYSTEM_WORKER_BLOCKED and SYSTEM_CRITICAL_OPERATION_TIMEOUT are ignored. checkpoint-thread treated as blocked critical system worker. So failure processor gets thread dump. Threads that waiting on read lock reports about SYSTEM_CRITICAL_OPERATION_TIMEOUT and also get thread dump. Thread dump generation takes from 500 to 1000 ms. All this activity leads to stop-the-world pause and triggers other timeouts. It could take long time because many threads are active and half time is thread dump generation. Root cause problem here is checkpoint read-write lock. Discussed with [~agoncharuk]Alexey Goncharuk and it seems only implementation of fuzzy checkpoint could solve the problem. But it requires big effort. *Solution* Andrey Gura December 20, 2019, 3:18 PM Edited Final solution and implementation: - New system property IGNITE_DUMP_THREADS_ON_FAILURE_THROTTLING_TIMEOUT added. Default value is failure detection timeout. - Each call of FailureProcessor#process(FailureContext, FailureHandler) method checka throttling timeout before thread dump generation. - There is no need to check that failure type is ignored. Throttling will be useful for all cases when context is not invalidated (FailureProcessor.failureCtx != null). - For throttled thread dump we log info message “Thread dump is hidden due to throttling settings. Set IGNITE_DUMP_THREADS_ON_FAILURE_THROTTLING_TIMEOUT property to 0 to see all thread dumps". -- This message was sent by Atlassian Jira (v8.3.4#803005)