Re: Checkpoint fail due to timeout

2021-03-30 Thread Alexey Trenikhun
wojski Sent: Tuesday, March 23, 2021 5:31 AM To: Alexey Trenikhun Cc: Arvid Heise ; ChangZhuo Chen (陳昌倬) ; ro...@apache.org ; Flink User Mail List Subject: Re: Checkpoint fail due to timeout Hi Alexey, You should definitely investigate why the job is stuck. 1. First of all, is it completely s

Re: Checkpoint fail due to timeout

2021-03-30 Thread Alexey Trenikhun
uring next performance run. Thanks, Alexey From: Roman Khachatryan Sent: Tuesday, March 23, 2021 12:17 AM To: Alexey Trenikhun Cc: ChangZhuo Chen (陳昌倬) ; Flink User Mail List Subject: Re: Checkpoint fail due to timeout Unfortunately, the lock can't be chang

Re: Checkpoint fail due to timeout

2021-03-23 Thread Piotr Nowojski
...@apache.org < > ro...@apache.org>; Flink User Mail List > *Subject:* Re: Checkpoint fail due to timeout > > Hi Alexey, > > rescaling from unaligned checkpoints will be supported with the upcoming > 1.13 release (expected at the end of April). > > Best, > >

Re: Checkpoint fail due to timeout

2021-03-23 Thread Roman Khachatryan
> Thanks, > Alexey > ____ > From: Roman Khachatryan > Sent: Monday, March 22, 2021 1:36 AM > To: ChangZhuo Chen (陳昌倬) > Cc: Alexey Trenikhun ; Flink User Mail List > > Subject: Re: Checkpoint fail due to timeout > > Thanks for sharin

Re: Checkpoint fail due to timeout

2021-03-22 Thread Alexey Trenikhun
hread.run(SourceStreamTask.java:263) Thanks, Alexey From: Roman Khachatryan Sent: Monday, March 22, 2021 1:36 AM To: ChangZhuo Chen (陳昌倬) Cc: Alexey Trenikhun ; Flink User Mail List Subject: Re: Checkpoint fail due to timeout Thanks for sharing the thread dump. It

Re: Checkpoint fail due to timeout

2021-03-22 Thread Alexey Trenikhun
: ChangZhuo Chen (陳昌倬) Cc: Alexey Trenikhun ; Flink User Mail List Subject: Re: Checkpoint fail due to timeout Thanks for sharing the thread dump. It shows that the source thread is indeed back-pressured (checkpoint lock is held by a thread which is trying to emit but unable to acquire any free buffers

Re: Checkpoint fail due to timeout

2021-03-22 Thread Alexey Trenikhun
checkpoint still times out after 3hr. From: Arvid Heise Sent: Monday, March 22, 2021 6:58:20 AM To: ChangZhuo Chen (陳昌倬) Cc: Alexey Trenikhun ; ro...@apache.org ; Flink User Mail List Subject: Re: Checkpoint fail due to timeout Hi Alexey, rescaling from

Re: Checkpoint fail due to timeout

2021-03-22 Thread Arvid Heise
Hi Alexey, rescaling from unaligned checkpoints will be supported with the upcoming 1.13 release (expected at the end of April). Best, Arvid On Wed, Mar 17, 2021 at 8:29 AM ChangZhuo Chen (陳昌倬) wrote: > On Wed, Mar 17, 2021 at 05:45:38AM +, Alexey Trenikhun wrote: > > In my opinion looks

Re: Checkpoint fail due to timeout

2021-03-22 Thread Roman Khachatryan
Thanks for sharing the thread dump. It shows that the source thread is indeed back-pressured (checkpoint lock is held by a thread which is trying to emit but unable to acquire any free buffers). The lock is per task, so there can be several locks per TM. @ChangZhuo Chen (陳昌倬) , in the thread you

Re: Checkpoint fail due to timeout

2021-03-17 Thread Alexey Trenikhun
("hdfs:///checkpoints-data/")); Difference to Savepoints ci.apache.org From: ChangZhuo Chen (陳昌倬) Sent: Wednesday, March 17, 2021 12:29 AM To: Alexey Trenikhun Cc: ro...@apache.org; Flink User Mail List Subject: Re: Checkpoint fail due to timeout On Wed, Ma

Re: Checkpoint fail due to timeout

2021-03-17 Thread 陳昌倬
On Wed, Mar 17, 2021 at 05:45:38AM +, Alexey Trenikhun wrote: > In my opinion looks similar. Were you able to tune-up Flink to make it work? > I'm stuck with it, I wanted to scale up hoping to reduce backpressure, but to > rescale I need to take savepoint, which never completes (at least take

Re: Checkpoint fail due to timeout

2021-03-16 Thread Alexey Trenikhun
From: ChangZhuo Chen (陳昌倬) Sent: Tuesday, March 16, 2021 6:59 AM To: Alexey Trenikhun Cc: ro...@apache.org; Flink User Mail List Subject: Re: Checkpoint fail due to timeout On Tue, Mar 16, 2021 at 02:32:54AM +, Alexey Trenikhun wrote: > Hi Roman, > I took thread dump: > "Source:

Re: Checkpoint fail due to timeout

2021-03-16 Thread 陳昌倬
On Tue, Mar 16, 2021 at 02:32:54AM +, Alexey Trenikhun wrote: > Hi Roman, > I took thread dump: > "Source: digital-itx-eastus2 -> Filter (6/6)#0" Id=200 BLOCKED on > java.lang.Object@5366a0e2 owned by "Legacy Source Thread - Source: > digital-itx-eastus2 -> Filter (6/6)#0" Id=202 > at >

Re: Checkpoint fail due to timeout

2021-03-15 Thread Alexey Trenikhun
k or per TM? I see multiple threads in SynchronizedStreamTaskActionExecutor.runThrowing blocked on different Objects. Thanks, Alexey From: Roman Khachatryan Sent: Monday, March 15, 2021 2:16 AM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: Checkpoint fail due to timeout Hello Alexey,

Re: Checkpoint fail due to timeout

2021-03-15 Thread Roman Khachatryan
2.2 with same results > > Thanks, > Alexey > > From: Roman Khachatryan > Sent: Thursday, March 11, 2021 11:49 PM > To: Alexey Trenikhun > Cc: Flink User Mail List > Subject: Re: Checkpoint fail due to timeout > > Hello, > >

Re: Checkpoint fail due to timeout

2021-03-11 Thread Roman Khachatryan
Hello, This can be caused by several reasons such as back-pressure, large snapshots or bugs. Could you please share: - the stats of the previous (successful) checkpoints - back-pressure metrics for sources - which Flink version do you use? Regards, Roman On Thu, Mar 11, 2021 at 7:03 AM Alexey