[
https://issues.apache.org/jira/browse/IGNITE-19239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ilya Shishkov updated IGNITE-19239:
-----------------------------------
Description:
There may be possible error messages about checkpoint read lock acquisition
timeouts and critical threads blocking during snapshot restore process (just
after caches start):
{quote}
[2023-04-06T10:55:46,561][ERROR]\[ttl-cleanup-worker-#475%node%][CheckpointTimeoutLock]
Checkpoint read lock acquisition has been timed out.
{quote}
{quote}
[2023-04-06T10:55:47,487][ERROR]\[tcp-disco-msg-worker-[crd]-#23%node%-#446%node%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour \[workerName=db-checkpoint-thread,
threadName=db-checkpoint-thread-#457%snapshot.BlockingThreadsOnSnapshotRestoreReproducerTest0%,
{color:red}blockedFor=100s{color}]
{quote}
Also there are active exchange process, which finishes with such timings
(timing will be approximatelly equal to blocking time of threads):
{quote}
[2023-04-06T10:55:52,211][INFO
]\[exchange-worker-#450%node%][GridDhtPartitionsExchangeFuture] Exchange
timings [startVer=AffinityTopologyVersion [topVer=1, minorTopVer=5],
resVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], stage="Waiting in
exchange queue" (0 ms), ..., stage="Restore partition states"
({color:red}100163 ms{color}), ..., stage="Total time" ({color:red}100334
ms{color})]
{quote}
How to reproduce:
# Set checkpoint frequency less than failure detection timeout.
# Ensure, that cache groups partitions states restoring lasts more than failure
detection timeout, i.e. it is actual to sufficiently large caches.
Reproducer: [^BlockingThreadsOnSnapshotRestoreReproducerTest.patch]
was:
There may be possible error messages about checkpoint read lock acquisition
timeouts and critical threads blocking during snapshot restore process (just
after caches start):
{quote}
[2023-04-06T10:55:46,561][ERROR]\[ttl-cleanup-worker-#475%node%][CheckpointTimeoutLock]
Checkpoint read lock acquisition has been timed out.
{quote}
{quote}
[2023-04-06T10:55:47,487][ERROR]\[tcp-disco-msg-worker-[crd]-#23%node%-#446%node%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [workerName=db-checkpoint-thread,
threadName=db-checkpoint-thread-#457%snapshot.BlockingThreadsOnSnapshotRestoreReproducerTest0%,
{color:red}blockedFor=100s{color}]
{quote}
Also there are active exchange process, which finishes with such timings
(timing will be approximatelly equal to blocking time of threads):
{quote}
[2023-04-06T10:55:52,211][INFO
]\[exchange-worker-#450%node%][GridDhtPartitionsExchangeFuture] Exchange
timings [startVer=AffinityTopologyVersion [topVer=1, minorTopVer=5],
resVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], stage="Waiting in
exchange queue" (0 ms), ..., stage="Restore partition states"
({color:red}100163 ms{color}), ..., stage="Total time" ({color:red}100334
ms{color})]
{quote}
How to reproduce:
# Set checkpoint frequency less than failure detection timeout.
# Ensure, that cache groups partitions states restoring lasts more than failure
detection timeout, i.e. it is actual to sufficiently large caches.
Reproducer: [^BlockingThreadsOnSnapshotRestoreReproducerTest.patch]
> Checkpoint read lock acquisition timeouts during snapshot restore
> -----------------------------------------------------------------
>
> Key: IGNITE-19239
> URL: https://issues.apache.org/jira/browse/IGNITE-19239
> Project: Ignite
> Issue Type: Bug
> Reporter: Ilya Shishkov
> Priority: Minor
> Labels: iep-43, ise
> Attachments: BlockingThreadsOnSnapshotRestoreReproducerTest.patch
>
>
> There may be possible error messages about checkpoint read lock acquisition
> timeouts and critical threads blocking during snapshot restore process (just
> after caches start):
> {quote}
> [2023-04-06T10:55:46,561][ERROR]\[ttl-cleanup-worker-#475%node%][CheckpointTimeoutLock]
> Checkpoint read lock acquisition has been timed out.
> {quote}
> {quote}
> [2023-04-06T10:55:47,487][ERROR]\[tcp-disco-msg-worker-[crd]-#23%node%-#446%node%][G]
> Blocked system-critical thread has been detected. This can lead to
> cluster-wide undefined behaviour \[workerName=db-checkpoint-thread,
> threadName=db-checkpoint-thread-#457%snapshot.BlockingThreadsOnSnapshotRestoreReproducerTest0%,
> {color:red}blockedFor=100s{color}]
> {quote}
> Also there are active exchange process, which finishes with such timings
> (timing will be approximatelly equal to blocking time of threads):
> {quote}
> [2023-04-06T10:55:52,211][INFO
> ]\[exchange-worker-#450%node%][GridDhtPartitionsExchangeFuture] Exchange
> timings [startVer=AffinityTopologyVersion [topVer=1, minorTopVer=5],
> resVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], stage="Waiting in
> exchange queue" (0 ms), ..., stage="Restore partition states"
> ({color:red}100163 ms{color}), ..., stage="Total time" ({color:red}100334
> ms{color})]
> {quote}
> How to reproduce:
> # Set checkpoint frequency less than failure detection timeout.
> # Ensure, that cache groups partitions states restoring lasts more than
> failure detection timeout, i.e. it is actual to sufficiently large caches.
> Reproducer: [^BlockingThreadsOnSnapshotRestoreReproducerTest.patch]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)