Re: Checkpoint failures due to other subtasks sharing the ChannelState file (Caused the Job to Stall)

2024-08-02 Thread Dhruv Patel
We have also enabled unaligned checkpoints. Could it be because of that? We were experience slowness and intermittent packet loss when this issue occurred. On Wed, Jul 31, 2024 at 7:43 PM Dhruv Patel wrote: > Hi Everyone, > > We are observing an interesting issue with continuous checkpoint >

Re: checkpoint upload thread

2024-08-01 Thread Yanfei Lei
d but with applied probability? > > Thanks. > > > -- 原始邮件 -- > 发件人: "Yanfei Lei" ; > 发送时间: 2024年7月30日(星期二) 下午5:15 > 收件人: "Enric Ott"<243816...@qq.com>; > 抄送: "user"; > 主题: Re: checkpoint upload thread > > Hi Enric

Re: checkpoint upload thread

2024-07-30 Thread Yanfei Lei
Hi Enric, If I understand correctly, one subtask would use one `asyncOperationsThreadPool`[1,2], it is possible to use the same connection for an operator chain. [1] https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask

Re: Checkpoint RMM

2023-11-27 Thread xiangyu feng
Hi Oscar, > but we don't understand why this incremental checkpoint keeps increasing AFAIK, when performing incremental checkpoint, the RocksDBStateBackend will upload the new created SST files to remote storage. The total size of these files is the incremental checkpoint size. However, the new c

Re: Checkpoint jitter?

2023-09-13 Thread Őrhidi Mátyás
tate_backends/#enabling-changelog > > > > > > *From:* Őrhidi Mátyás > *Sent:* Wednesday, September 13, 2023 2:47 PM > *To:* Gyula Fóra > *Cc:* Hangxiang Yu ; user@flink.apache.org > *Subject:* Re: Checkpoint jitter? > > > > Correct, thanks for the clarif

RE: Checkpoint jitter?

2023-09-13 Thread Schwalbe Matthias
: Wednesday, September 13, 2023 2:47 PM To: Gyula Fóra Cc: Hangxiang Yu ; user@flink.apache.org Subject: Re: Checkpoint jitter? Correct, thanks for the clarification Gyula! On Wed, Sep 13, 2023 at 1:39 AM Gyula Fóra mailto:gyula.f...@gmail.com>> wrote: No, I think what he means is to trigg

Re: Checkpoint jitter?

2023-09-13 Thread Őrhidi Mátyás
Correct, thanks for the clarification Gyula! On Wed, Sep 13, 2023 at 1:39 AM Gyula Fóra wrote: > No, I think what he means is to trigger the checkpoint at slightly > different times at the different sources so the different parts of the > pipeline would not checkpoint at the same time. > > Gyula

Re: Checkpoint jitter?

2023-09-13 Thread Gyula Fóra
No, I think what he means is to trigger the checkpoint at slightly different times at the different sources so the different parts of the pipeline would not checkpoint at the same time. Gyula On Wed, Sep 13, 2023 at 10:32 AM Hangxiang Yu wrote: > Hi, Matyas. > Do you mean something like adjusti

Re: Checkpoint jitter?

2023-09-12 Thread Hangxiang Yu
Hi, Matyas. Do you mean something like adjusting checkpoint intervals dynamically or frequency of uploading files according to the pressure of the durable storage ? On Wed, Sep 13, 2023 at 9:12 AM Őrhidi Mátyás wrote: > Hey folks, > > Is it possible to add some sort of jitter to the checkpointin

RE: Checkpoint/savepoint _metadata

2023-08-29 Thread Schwalbe Matthias
Hi Frederic, I’ve once (upon a time 😊) had a similar situation when we changed from Flink 1.8 to Flink 1.13 … It took me a long time to figure out. Some hints where to start to look: * _metadata file is used for * Job manager state * Smallish keyed state (in order to avoid too

Re: Checkpoint size smaller than Savepoint size

2023-07-19 Thread Shammon FY
Hi Neha, The HOP window will increase the size of the checkpoint and I'm sorry that I'm not very familiar with the HOP window. If the configurations are all right, and you want to confirm if it's a HOP window issue, I think you can submit a flink job without HOP window but with regular agg operat

Re: Checkpoint size smaller than Savepoint size

2023-07-18 Thread Neha . via user
Hi Shammon, These configs exist in Flink WebUI. We have set exeEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime); Do you think it can create some issues for the HOP(proctime, some interval, some interval) and not releasing the state for checkpoints? I am really confused about why s

Re: Checkpoint size smaller than Savepoint size

2023-07-17 Thread Shammon FY
Hi Neha, I think you can first check whether the options `state.backend` and `state.backend.incremental` you mentioned above exist in `JobManager`->`Configuration` in Flink webui. If they do not exist, you may be using the wrong conf file. Best, Shammon FY On Mon, Jul 17, 2023 at 5:04 PM Neha .

Re: Checkpoint size smaller than Savepoint size

2023-07-17 Thread Neha . via user
Hi Shammon, state.backend: rocksdb state.backend.incremental: true This is already set in the Flink-conf. Anything else that should be taken care of for the incremental checkpointing? Is there any related bug in Flink 1.16.1? we have recently migrated to Flink 1.16.1 from Flink 1.13.6. What can b

Re: Checkpoint size smaller than Savepoint size

2023-07-16 Thread Shammon FY
Hi Neha, I noticed that the `Checkpointed Data Size` is always equals to `Full Checkpoint Data Size`, I think the job is using full checkpoint instead of incremental checkpoint, you can check it Best, Shammon FY On Mon, Jul 17, 2023 at 10:25 AM Neha . wrote: > Hello Shammon, > > Thank you for

Re: Checkpoint size smaller than Savepoint size

2023-07-16 Thread Neha . via user
Hello Shammon, Thank you for your assistance. I have already enabled the incremental checkpointing, Attaching the screenshot. Can you please elaborate on what makes you think it is not enabled, It might hint towards the issue. The problem is checkpoint size is not going down and keeps on increasin

Re: Checkpoint size smaller than Savepoint size

2023-07-16 Thread Shammon FY
Hi Neha, I think it is normal for the data size of a savepoint to be smaller than the full data of a checkpoint. Flink uses rocksdb to store checkpointed data, which is an LSM structured storage where the same key will have multiple version records, while savepoint will traverse all keys and store

RE: Checkpoint directories not cleared as TaskManagers run

2022-05-19 Thread Schwalbe Matthias
es Sandys-Lumsdaine Sent: Wednesday, May 18, 2022 2:53 PM To: Schwalbe Matthias Cc: Hangxiang Yu ; user@flink.apache.org Subject: Re: Checkpoint directories not cleared as TaskManagers run Hello Matthias, Thanks for your reply. Yes indeed your are correct. My /tmp path is private so you hav

Re: Checkpoint directories not cleared as TaskManagers run

2022-05-18 Thread James Sandys-Lumsdaine
in order to work - as your jobmanager can not access the checkpoint files of it can also not clean-up those files Hope that helps Regards Thias From: James Sandys-Lumsdaine Sent: Tuesday, May 17, 2022 3:55 PM To: Hangxiang Yu ; user@flink.apache.org Subject: Re: Checkpoint directories not

RE: Checkpoint directories not cleared as TaskManagers run

2022-05-17 Thread Schwalbe Matthias
jobmanager can not access the checkpoint files of it can also not clean-up those files Hope that helps Regards Thias From: James Sandys-Lumsdaine Sent: Tuesday, May 17, 2022 3:55 PM To: Hangxiang Yu ; user@flink.apache.org Subject: Re: Checkpoint directories not cleared as TaskManagers run

Re: Checkpoint directories not cleared as TaskManagers run

2022-05-17 Thread James Sandys-Lumsdaine
___ From: Hangxiang Yu Sent: 17 May 2022 14:38 To: James Sandys-Lumsdaine ; user@flink.apache.org Subject: Re: Checkpoint directories not cleared as TaskManagers run Hi, James. I may not get what the problem is. All checkpoints will store in the address as you set. IIUC, TMs will write som

Re: Checkpoint directories not cleared as TaskManagers run

2022-05-17 Thread Hangxiang Yu
Hi, James. I may not get what the problem is. All checkpoints will store in the address as you set. IIUC, TMs will write some checkpoint info in their local dir and then upload them to the address and then delete local one. JM will write some metas of checkpoint to the address and also do the entir

Re: Checkpoint directories not cleared as TaskManagers run

2022-05-17 Thread James Sandys-Lumsdaine
Some further Googling says on a StackOverflow posting it is the jobmanager that does the deletion and not the taskmanagers. Currently my taskmanagers are writing their checkpoints to their own private disks (/tmp) rather than a share - so my suspicion is the jobmanager can't access the folder o

Re: Checkpoint Timeout Troubleshooting

2022-05-05 Thread Sam Ch
Thank you for the help. To follow up, the issue went away when we reverted back to flink 1.13. May be related to flink-27481. Before reverting, we tested unaligned checkpoints with a timeout of 10 minutes, which timed out. Thanks. On Thu, Apr 28, 2022, 5:38 PM Guowei Ma wrote: > Hi Sam > > I thi

Re: Checkpoint Timeout Troubleshooting

2022-04-28 Thread Guowei Ma
Hi Sam I think the first step is to see which part of your Flink APP is blocking the completion of Checkpoint. Specifically, you can refer to the "Checkpoint Details" section of the document [1]. Using these methods, you should be able to observe where the checkpoint is blocked, for example, it ma

Re: Checkpoint failures without exceptions

2021-10-30 Thread Arvid Heise
Hi Patrick, do you even have so much backpressure that unaligned checkpoints are necessary? You seem to have only one network exchange where unaligned checkpoint helps. The Flink 1.11 implementation of unaligned checkpoint was still experimental and it might cause unexpected side-effects. Afaik, w

Re: Checkpoint failures without exceptions

2021-10-27 Thread Yun Gao
Hi Patrick, Could you also have a look at the stack of the tasks of the second function to see what the main thread and netty thread is doing during the checkpoint period ? Best, Yun --Original Mail -- Sender: Send Date:Wed Oct 27 22:05:40 2021 Recipients:Fli

Re: Checkpoint size increasing even i enable increasemental checkpoint

2021-10-12 Thread Vijay Bhaskar
Since state size is small, you can try FileState Backend, rather than RocksDB. You can check once. Thumb rule is if FileStateBackend Performs worse, RocksDB is good. Regards Bhasakar On Tue, Oct 12, 2021 at 1:47 PM Yun Tang wrote: > Hi Lei, > > RocksDB state-backend's checkpoint is composited b

Re: Checkpoint size increasing even i enable increasemental checkpoint

2021-10-12 Thread Yun Tang
Hi Lei, RocksDB state-backend's checkpoint is composited by RocksDB's own files (unmodified compressed SST format files) and incremental checkpoints means Flink does not upload files which were uploaded before. As you can see, incremental checkpoints highly depend on the RocksDB's own mechanism

Re: Checkpoint size increasing even i enable increasemental checkpoint

2021-10-11 Thread Caizhi Weng
Hi! Checkpoint sizes are highly related to your job. Incremental checkpointing will help only when the values in the state are converging (for example a distinct count aggregation). If possible, could you provide your user code or explain what jobs are you running? Lei Wang 于2021年10月11日周一 下午4:1

Re: Checkpoint is timing out - inspecting state

2021-06-16 Thread Dan Hill
Hi Yun. The UI was not useful for this case. I had a feeling before hand about what the issue was. We refactored the state and now the checkpoint is 10x faster. On Mon, Jun 14, 2021 at 5:47 AM Yun Gao wrote: > Hi Dan, > > Flink should already have integrate a tool in the web UI to monitor > t

Re: Checkpoint loading failure

2021-06-16 Thread Guowei Ma
Hi Padarn Will there be these errors if the jobgraph is not modified? In addition, is this error stack all? Is it possible that other errors caused the stream to be closed? Best, Guowei On Tue, Jun 15, 2021 at 9:54 PM Padarn Wilson wrote: > Hi all, > > We have a job that has a medium size state

Re: Checkpoint is timing out - inspecting state

2021-06-14 Thread Yun Gao
Hi Dan, Flink should already have integrate a tool in the web UI to monitor the detailed statistics of the checkpoint [1]. It would show the time consumed in each part and each task, thus it could be used to debug the checkpoint timeout. Best, Yun [1] https://ci.apache.org/projects/flink/fli

Re: Checkpoint error - "The job has failed"

2021-04-28 Thread Dan Hill
3 not 1.11.1. > > [1] https://issues.apache.org/jira/browse/FLINK-16753 > > Best > Yun Tang > -- > *From:* Dan Hill > *Sent:* Tuesday, April 27, 2021 7:50 > *To:* Yun Tang > *Cc:* Robert Metzger ; user > *Subject:* Re: Checkpoint error - "The jo

Re: Checkpoint error - "The job has failed"

2021-04-28 Thread Yun Tang
n Tang Cc: Robert Metzger ; user Subject: Re: Checkpoint error - "The job has failed" Hey Yun and Robert, I'm using Flink v1.11.1. Robert, I'll send you a separate email with the logs. On Mon, Apr 26, 2021 at 12:46 AM Yun Tang mailto:myas...@live.com>> wrote: Hi Dan,

Re: Checkpoint error - "The job has failed"

2021-04-26 Thread Dan Hill
Flink-1.10.3. > > > [1] https://issues.apache.org/jira/browse/FLINK-16753 > > Best > Yun Tang > -- > *From:* Robert Metzger > *Sent:* Monday, April 26, 2021 14:46 > *To:* Dan Hill > *Cc:* user > *Subject:* Re: Checkpoint error - "

Re: Checkpoint error - "The job has failed"

2021-04-26 Thread Yun Tang
Hill Cc: user Subject: Re: Checkpoint error - "The job has failed" Hi Dan, can you provide me with the JobManager logs to take a look as well? (This will also tell me which Flink version you are using) On Mon, Apr 26, 2021 at 7:20 AM Dan Hill mailto:quietgol...@gmail.com>>

Re: Checkpoint error - "The job has failed"

2021-04-25 Thread Robert Metzger
Hi Dan, can you provide me with the JobManager logs to take a look as well? (This will also tell me which Flink version you are using) On Mon, Apr 26, 2021 at 7:20 AM Dan Hill wrote: > My Flink job failed to checkpoint with a "The job has failed" error. The > logs contained no other recent e

Re: Checkpoint timeouts at times of high load

2021-04-06 Thread Robert Metzger
has been configured with 4GB of memory, > there is a sliding window of 10 seconds with a slide of 1 second, and the > cluster setup is using flink native. > > > Any hints would be much appreciated! > > > Regards, > > M. > > > -- >

Re: Checkpoint timeouts at times of high load

2021-04-05 Thread Geldenhuys, Morgan Karl
appreciated! Regards, M. ________ From: Guowei Ma Sent: 01 April 2021 14:19 To: Geldenhuys, Morgan Karl Cc: user Subject: Re: Checkpoint timeouts at times of high load Hi, I think there are many reasons that could lead to the checkpoint timeout. Would you like to shar

Re: Checkpoint timeouts at times of high load

2021-04-01 Thread Guowei Ma
Hi, I think there are many reasons that could lead to the checkpoint timeout. Would you like to share some detailed information of checkpoint? For example, the detailed checkpoint information from the web.[1] And which Flink version do you use? [1] https://ci.apache.org/projects/flink/flink-docs-

Re: Checkpoint fail due to timeout

2021-03-30 Thread Alexey Trenikhun
wojski Sent: Tuesday, March 23, 2021 5:31 AM To: Alexey Trenikhun Cc: Arvid Heise ; ChangZhuo Chen (陳昌倬) ; ro...@apache.org ; Flink User Mail List Subject: Re: Checkpoint fail due to timeout Hi Alexey, You should definitely investigate why the job is stuck. 1. First of all, is it completely s

Re: Checkpoint fail due to timeout

2021-03-30 Thread Alexey Trenikhun
uring next performance run. Thanks, Alexey From: Roman Khachatryan Sent: Tuesday, March 23, 2021 12:17 AM To: Alexey Trenikhun Cc: ChangZhuo Chen (陳昌倬) ; Flink User Mail List Subject: Re: Checkpoint fail due to timeout Unfortunately, the lock can't be chang

Re: Checkpoint fail due to timeout

2021-03-23 Thread Piotr Nowojski
kubernetes metrics report very little CPU usage by container, but unaligned > checkpoint still times out after 3hr. > > -- > *From:* Arvid Heise > *Sent:* Monday, March 22, 2021 6:58:20 AM > *To:* ChangZhuo Chen (陳昌倬) > *Cc:* Alexey Trenikhun ; ro

Re: Checkpoint fail due to timeout

2021-03-23 Thread Roman Khachatryan
> Thanks, > Alexey > > From: Roman Khachatryan > Sent: Monday, March 22, 2021 1:36 AM > To: ChangZhuo Chen (陳昌倬) > Cc: Alexey Trenikhun ; Flink User Mail List > > Subject: Re: Checkpoint fail due to timeout > > Thanks for sharin

Re: Checkpoint fail due to timeout

2021-03-22 Thread Alexey Trenikhun
hread.run(SourceStreamTask.java:263) Thanks, Alexey From: Roman Khachatryan Sent: Monday, March 22, 2021 1:36 AM To: ChangZhuo Chen (陳昌倬) Cc: Alexey Trenikhun ; Flink User Mail List Subject: Re: Checkpoint fail due to timeout Thanks for sharing the thread dump. It

Re: Checkpoint fail due to timeout

2021-03-22 Thread Alexey Trenikhun
: ChangZhuo Chen (陳昌倬) Cc: Alexey Trenikhun ; Flink User Mail List Subject: Re: Checkpoint fail due to timeout Thanks for sharing the thread dump. It shows that the source thread is indeed back-pressured (checkpoint lock is held by a thread which is trying to emit but unable to acquire any free buffers

Re: Checkpoint fail due to timeout

2021-03-22 Thread Alexey Trenikhun
checkpoint still times out after 3hr. From: Arvid Heise Sent: Monday, March 22, 2021 6:58:20 AM To: ChangZhuo Chen (陳昌倬) Cc: Alexey Trenikhun ; ro...@apache.org ; Flink User Mail List Subject: Re: Checkpoint fail due to timeout Hi Alexey, rescaling from

Re: Checkpoint fail due to timeout

2021-03-22 Thread Arvid Heise
Hi Alexey, rescaling from unaligned checkpoints will be supported with the upcoming 1.13 release (expected at the end of April). Best, Arvid On Wed, Mar 17, 2021 at 8:29 AM ChangZhuo Chen (陳昌倬) wrote: > On Wed, Mar 17, 2021 at 05:45:38AM +, Alexey Trenikhun wrote: > > In my opinion looks

Re: Checkpoint fail due to timeout

2021-03-22 Thread Roman Khachatryan
Thanks for sharing the thread dump. It shows that the source thread is indeed back-pressured (checkpoint lock is held by a thread which is trying to emit but unable to acquire any free buffers). The lock is per task, so there can be several locks per TM. @ChangZhuo Chen (陳昌倬) , in the thread you

Re: Checkpoint fail due to timeout

2021-03-17 Thread Alexey Trenikhun
("hdfs:///checkpoints-data/")); Difference to Savepoints ci.apache.org From: ChangZhuo Chen (陳昌倬) Sent: Wednesday, March 17, 2021 12:29 AM To: Alexey Trenikhun Cc: ro...@apache.org; Flink User Mail List Subject: Re: Checkpoint fail due to timeout On Wed, Ma

Re: Checkpoint fail due to timeout

2021-03-17 Thread 陳昌倬
On Wed, Mar 17, 2021 at 05:45:38AM +, Alexey Trenikhun wrote: > In my opinion looks similar. Were you able to tune-up Flink to make it work? > I'm stuck with it, I wanted to scale up hoping to reduce backpressure, but to > rescale I need to take savepoint, which never completes (at least take

Re: Checkpoint fail due to timeout

2021-03-16 Thread Alexey Trenikhun
From: ChangZhuo Chen (陳昌倬) Sent: Tuesday, March 16, 2021 6:59 AM To: Alexey Trenikhun Cc: ro...@apache.org; Flink User Mail List Subject: Re: Checkpoint fail due to timeout On Tue, Mar 16, 2021 at 02:32:54AM +, Alexey Trenikhun wrote: > Hi Roman, > I took thread dump: > "Source:

Re: Checkpoint fail due to timeout

2021-03-16 Thread 陳昌倬
On Tue, Mar 16, 2021 at 02:32:54AM +, Alexey Trenikhun wrote: > Hi Roman, > I took thread dump: > "Source: digital-itx-eastus2 -> Filter (6/6)#0" Id=200 BLOCKED on > java.lang.Object@5366a0e2 owned by "Legacy Source Thread - Source: > digital-itx-eastus2 -> Filter (6/6)#0" Id=202 > at >

Re: Checkpoint fail due to timeout

2021-03-15 Thread Alexey Trenikhun
k or per TM? I see multiple threads in SynchronizedStreamTaskActionExecutor.runThrowing blocked on different Objects. Thanks, Alexey From: Roman Khachatryan Sent: Monday, March 15, 2021 2:16 AM To: Alexey Trenikhun Cc: Flink User Mail List Subject: Re: Checkpoint fail due to timeout Hello Alexey,

Re: Checkpoint fail due to timeout

2021-03-15 Thread Roman Khachatryan
2.2 with same results > > Thanks, > Alexey > > From: Roman Khachatryan > Sent: Thursday, March 11, 2021 11:49 PM > To: Alexey Trenikhun > Cc: Flink User Mail List > Subject: Re: Checkpoint fail due to timeout > > Hello, > >

Re: Checkpoint fail due to timeout

2021-03-11 Thread Roman Khachatryan
Hello, This can be caused by several reasons such as back-pressure, large snapshots or bugs. Could you please share: - the stats of the previous (successful) checkpoints - back-pressure metrics for sources - which Flink version do you use? Regards, Roman On Thu, Mar 11, 2021 at 7:03 AM Alexey

Re: Re: Re: Checkpoint Error

2021-03-10 Thread Till Rohrmann
og? > > Also, have you enabled concurrent checkpoint? > > Best, > Yun > > > --Original Mail -- > *Sender:*Navneeth Krishnan > *Send Date:*Mon Mar 8 13:10:46 2021 > *Recipients:*Yun Gao > *CC:*user > *Subject:*Re: Re: Checkpoint

Re: Re: Re: Checkpoint Error

2021-03-08 Thread Yun Gao
:46 2021 Recipients:Yun Gao CC:user Subject:Re: Re: Checkpoint Error Hi Yun, Thanks for the response. I checked the mounts and only the JM's and TM's are mounted with this EFS. Not sure how to debug this. Thanks On Sun, Mar 7, 2021 at 8:29 PM Yun Gao wrote: Hi Navneeth, It seem

Re: Re: Checkpoint Error

2021-03-07 Thread Navneeth Krishnan
*Navneeth Krishnan > *Send Date:*Sun Mar 7 15:44:59 2021 > *Recipients:*user > *Subject:*Re: Checkpoint Error > >> Hi All, >> >> Any suggestions? >> >> Thanks >> >> On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan < >> reachnavnee.

Re: Re: Checkpoint Error

2021-03-07 Thread Yun Gao
Hi Navneeth, It seems from the stack that the exception is caused by the underlying EFS problems ? Have you checked if there are errors reported for EFS, or if there might be duplicate mounting for the same EFS and others have ever deleted the directory? Best, Yun --Original

Re: Checkpoint Error

2021-03-06 Thread Navneeth Krishnan
Hi All, Any suggestions? Thanks On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan wrote: > Hi All, > > We are running our streaming job on flink 1.7.2 and we are noticing the > below error. Not sure what's causing it, any pointers would help. We have > 10 TM's checkpointing to AWS EFS. > > Asy

Re: Checkpoint problem in 1.12.0

2021-02-03 Thread Till Rohrmann
Thanks for reaching out to the Flink community. I will respond on the JIRA ticket. Cheers, Till On Wed, Feb 3, 2021 at 1:59 PM simpleusr wrote: > Hi > > I am trying to upgrade from 1.5.5 to 1.12 and checkpointing mechanism seems > to be broken in our kafka connector sourced datastream jobs. > >

Re: Re: checkpoint delay consume message

2021-01-05 Thread Arvid Heise
t;‪ >>> yungao...@aliyun.com‬‏>:‬ >>> >>>> Hi nick, >>>> >>>>Sorry I initially think that the data is also write into Kafka with >>>> flink . So it could be ensured that there is no delay in the write side, >>>> rig

Re: Re: checkpoint delay consume message

2020-12-26 Thread nick toker
Hi nick, >>> >>>Sorry I initially think that the data is also write into Kafka with >>> flink . So it could be ensured that there is no delay in the write side, >>> right ? Does the delay in the read side keeps existing ? >>> >>> Best,

Re: Re: checkpoint delay consume message

2020-12-23 Thread lec ssmi
write side, >> right ? Does the delay in the read side keeps existing ? >> >> Best, >> Yun >> >> >> >> --Original Mail ------ >> *Sender:*nick toker >> *Send Date:*Tue Dec 22 01:43:50 2020 >> *Recipients:*Yun

Re: Re: checkpoint delay consume message

2020-12-22 Thread nick toker
delay in the write side, > right ? Does the delay in the read side keeps existing ? > > Best, > Yun > > > > --Original Mail -- > *Sender:*nick toker > *Send Date:*Tue Dec 22 01:43:50 2020 > *Recipients:*Yun Gao > *CC:*user > *Su

Re: Re: checkpoint delay consume message

2020-12-21 Thread Yun Gao
Hi nick, Sorry I initially think that the data is also write into Kafka with flink . So it could be ensured that there is no delay in the write side, right ? Does the delay in the read side keeps existing ? Best, Yun --Original Mail -- Sender:nick toker

Re: checkpoint delay consume message

2020-12-21 Thread nick toker
hi i am confused the delay in in the source when reading message not on the sink nick ‫בתאריך יום ב׳, 21 בדצמ׳ 2020 ב-18:12 מאת ‪Yun Gao‬‏ <‪yungao...@aliyun.com ‬‏>:‬ > Hi Nick, > > Are you using EXACTLY_ONCE semantics ? If so the sink would use > transactions, and only commit the transa

Re: checkpoint delay consume message

2020-12-21 Thread Yun Gao
Hi Nick, Are you using EXACTLY_ONCE semantics ? If so the sink would use transactions, and only commit the transaction on checkpoint complete to ensure end-to-end exactly-once. A detailed description could be find in [1] Best, Yun [1] https://flink.apache.org/features/2018/03/01/end-t

Re: checkpoint interval and hdfs file capacity

2020-11-11 Thread Congxian Qiu
Hi Currently, checkpoint discard logic was executed in Executor[1], maybe it will not be deleted so quickly [1] https://github.com/apache/flink/blob/91404f435f20c5cd6714ee18bf4ccf95c81fb73e/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointsCleaner.java#L45 Best, Congx

Re: Checkpoint growth

2020-11-10 Thread Akshay Aggarwal
Hi Rex, As per my understanding there are multiple levels of compactions (with RocksDB), and files which are not compacted recently would remain in older checkpoint directories, and there will be references to those files in the current checkpoint. There is no clear way of identifying these refere

Re: checkpoint interval and hdfs file capacity

2020-11-09 Thread lec ssmi
Thanks. I have some jobs with the checkpoint interval 1000ms. And the HDFS files grow too large to work normally . What I am curious about is, are writing and deleting performed synchronously? Is it possible to add too fast to delete old files? Congxian Qiu 于2020年11月10日周二 下午2:16写道: > Hi >

Re: checkpoint interval and hdfs file capacity

2020-11-09 Thread Congxian Qiu
Hi No matter what interval you set, Flink will take care of the checkpoints(remove the useless checkpoint when it can), but when you set a very small checkpoint interval, there may be much high pressure for the storage system(here is RPC pressure of HDFS NN). Best, Congxian lec ssmi 于2020年1

Re: checkpoint fail

2020-10-10 Thread Yun Tang
Hi Song Flink-1.4.2 is a bit too old, and I think this error is caused by FLINK-8876 [1][2] which should be fixed after Flink-1.5, please consider to upgrade Flink version. [1] https://issues.apache.org/jira/browse/FLINK-8876 [2] https://issues.apache.org/jira/browse/FLINK-8836 Best Yun Tang

Re: Checkpoint dir is not cleaned up after cancel the job with monitoring API

2020-10-02 Thread Eleanore Jin
Thanks a lot for the confirmation. Eleanore On Fri, Oct 2, 2020 at 2:42 AM Chesnay Schepler wrote: > Yes, the patch call only triggers the cancellation. > You can check whether it is complete by polling the job status via > jobs/ and checking whether state is CANCELED. > > On 9/27/2020 7:02 PM,

Re: Checkpoint dir is not cleaned up after cancel the job with monitoring API

2020-10-02 Thread Chesnay Schepler
Yes, the patch call only triggers the cancellation. You can check whether it is complete by polling the job status via jobs/ and checking whether state is CANCELED. On 9/27/2020 7:02 PM, Eleanore Jin wrote: I have noticed this: if I have Thread.sleep(1500); after the patch call returned 202, t

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-29 Thread Till Rohrmann
Great, thanks Klou! Cheers, Till On Mon, Sep 28, 2020 at 5:07 PM Kostas Kloudas wrote: > Hi all, > > I will have a look. > > Kostas > > On Mon, Sep 28, 2020 at 3:56 PM Till Rohrmann > wrote: > > > > Hi Cristian, > > > > thanks for reporting this issue. It looks indeed like a very critical > pr

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-28 Thread Kostas Kloudas
Hi all, I will have a look. Kostas On Mon, Sep 28, 2020 at 3:56 PM Till Rohrmann wrote: > > Hi Cristian, > > thanks for reporting this issue. It looks indeed like a very critical problem. > > The problem seems to be that the ApplicationDispatcherBootstrap class > produces an exception (that th

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-28 Thread Till Rohrmann
Hi Cristian, thanks for reporting this issue. It looks indeed like a very critical problem. The problem seems to be that the ApplicationDispatcherBootstrap class produces an exception (that the request job can no longer be found because of a lost ZooKeeper connection) which will be interpreted as

Re: Checkpoint dir is not cleaned up after cancel the job with monitoring API

2020-09-27 Thread Eleanore Jin
I have noticed this: if I have Thread.sleep(1500); after the patch call returned 202, then the directory gets cleaned up, in the meanwhile, it shows the job-manager pod is in completed state before getting terminated: see screenshot: https://ibb.co/3F8HsvG So the patch call is async to terminate t

Re: Checkpoint dir is not cleaned up after cancel the job with monitoring API

2020-09-27 Thread Eleanore Jin
Hi Congxian, I am making rest call to get the checkpoint config: curl -X GET \ http://localhost:8081/jobs/d2c91a44f23efa2b6a0a89b9f1ca5a3d/checkpoints/config and here is the response: { "mode": "at_least_once", "interval": 3000, "timeout": 1, "min_pause": 1000, "max_concur

Re: Checkpoint dir is not cleaned up after cancel the job with monitoring API

2020-09-26 Thread Congxian Qiu
Hi Eleanore What the `CheckpointRetentionPolicy`[1] did you set for your job? if `ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION` is set, then the checkpoint will be kept when canceling a job. PS the image did not show [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/stat

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-09 Thread Yang Wang
> The job sub directory will be cleaned up when the job finished/canceled/failed. Since we could submit multiple jobs into a Flink session, what i mean is when a job reached to the terminal state, the sub node(e.g. /flink/application_/running_job_registry/4d255397c7aeb5327adb567238c983c1) on th

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-08 Thread Cristian
> The job sub directory will be cleaned up when the job > finished/canceled/failed. What does this mean? Also, to clarify: I'm a very sloppy developer. My jobs crash ALL the time... and yet, the jobs would ALWAYS resume from the last checkpoint. The only cases where I expect Flink to clean u

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-08 Thread Yang Wang
AFAIK, the HA data, including Zookeeper meta data and real data on DFS, will only be cleaned up when the Flink cluster reached terminated state. So if you are using a session cluster, the root cluster node on Zk will be cleaned up after you manually stop the session cluster. The job sub directory

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-08 Thread Cristian
I'm using the standalone script to start the cluster. As far as I can tell, it's not easy to reproduce. We found that zookeeper lost a node around the time this happened, but all of our other 75 Flink jobs which use the same setup, version and zookeeper, didn't have any issues. They didn't eve

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-08 Thread Robert Metzger
Thanks a lot for reporting this problem here Cristian! I am not super familiar with the involved components, but the behavior you are describing doesn't sound right to me. Which entrypoint are you using? This is logged at the beginning, like this: "2020-09-08 14:45:32,807 INFO org.apache.flink.ru

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-07 Thread Husky Zeng
Hi Cristian, I don't know if it was designed to be like this deliberately. So I have already submitted an issue ,and wait for somebody to response. https://issues.apache.org/jira/browse/FLINK-19154 -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-07 Thread Cristian
That's an excellent question. I can't explain that. All I know is this: - the job was upgraded and resumed from a savepoint - After hours of working fine, it failed (like it shows in the logs) - the Metadata was cleaned up, again as shown in the logs - because I run this in Kubernetes, the conta

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-07 Thread Husky Zeng
I means that checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints). You could use "ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION" to save your checkpoint when te cames to failure. When your zookeeper lost connect

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-05 Thread Cristian
> If you want to save your checkPoint,you could refer to this document What do you mean? We already persist our savepoints, and we do not delete them explicitly ever. The problem is that Flink deleted the data from zookeeper when it shouldn't have. Is it possible to start a job from a checkpo

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-05 Thread Husky Zeng
Hi Cristian, >From this code , we could see that the Exception or Error was ignored in dispatcher.shutDownCluster(applicationStatus) . `` org.apache.flink.runtime.dispatcher.DispatcherGateway#shutDownCluster return applicationCompletionFuture .handle((r, t) -> {

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-04 Thread Cristian
My suspicion is that somewhere in the path were it fails to connect yo zookeeper, the exception is swallowed, so instead of running the shutdown path for when the job fails, the general shutdown path is taken. This was fortunately a job for which we had a savepoint from yesterday. Otherwise

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-04 Thread Qingdong Zeng
Hi Cristian, In the log,we can see it went to the method shutDownAsync(applicationStatus,null,true); `` 2020-09-04 17:32:07,950 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Shutting StandaloneApplicationClusterEntryPoint down w

Re: Checkpoint failed because of TimeWindow cannot be cast to VoidNamespace

2020-07-12 Thread Si-li Liu
Someone told me that maybe this issue is Mesos specific. I'm kind of a newbie in Flink, and I digged into the code but can not get a conclusion. Here I just wanna have a better JoinWindow that emits the result and delete it from the window state immediately when joined successfully, is there any ot

Re: Checkpoint failed because of TimeWindow cannot be cast to VoidNamespace

2020-07-11 Thread Congxian Qiu
Hi Si-li Thanks for the notice. I just want to double-check is the original problem has been solved? As I found that the created issue FLINK-18464 has been closed with reason "can not reproduce". Am I missing something here? Best, Congxian Si-li Liu 于2020年7月10日周五 下午6:06写道: > Sorry > > I can'

Re: Checkpoint failed because of TimeWindow cannot be cast to VoidNamespace

2020-07-10 Thread Si-li Liu
Sorry I can't reproduce it with reduce/aggregate/fold/apply and due to some limitations in my working environment, I can't use flink 1.10 or 1.11. Congxian Qiu 于2020年7月5日周日 下午6:21写道: > Hi > > First, Could you please try this problem still there if use flink 1.10 or > 1.11? > > It seems strange,

Re: Checkpoint is disable, will history data in rocksdb be leak when job restart?

2020-07-05 Thread Congxian Qiu
that > problem should be first considered. > > Best > Yun Tang > -- > *From:* SmileSmile > *Sent:* Friday, July 3, 2020 14:30 > *To:* Yun Tang > *Cc:* 'user@flink.apache.org' > *Subject:* Re: Checkpoint is disable, will history d

Re: Checkpoint failed because of TimeWindow cannot be cast to VoidNamespace

2020-07-05 Thread Congxian Qiu
Hi First, Could you please try this problem still there if use flink 1.10 or 1.11? It seems strange, from the error message, here is an error when trying to convert a non-Window state(VoidNameSpace) to a Window State (serializer is the serializer of Window state, but the state is non-Window state

  1   2   3   >