We have also enabled unaligned checkpoints. Could it be because of that? We
were experience slowness and intermittent packet loss when this issue
occurred.
On Wed, Jul 31, 2024 at 7:43 PM Dhruv Patel wrote:
> Hi Everyone,
>
> We are observing an interesting issue with continuous checkpoint
>
d but with applied probability?
>
> Thanks.
>
>
> -- 原始邮件 --
> 发件人: "Yanfei Lei" ;
> 发送时间: 2024年7月30日(星期二) 下午5:15
> 收件人: "Enric Ott"<243816...@qq.com>;
> 抄送: "user";
> 主题: Re: checkpoint upload thread
>
> Hi Enric
Hi Enric,
If I understand correctly, one subtask would use one
`asyncOperationsThreadPool`[1,2], it is possible to use the same
connection for an operator chain.
[1]
https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask
Hi Oscar,
> but we don't understand why this incremental checkpoint keeps increasing
AFAIK, when performing incremental checkpoint, the RocksDBStateBackend will
upload the new created SST files to remote storage. The total size of these
files is the incremental checkpoint size. However, the new c
tate_backends/#enabling-changelog
>
>
>
>
>
> *From:* Őrhidi Mátyás
> *Sent:* Wednesday, September 13, 2023 2:47 PM
> *To:* Gyula Fóra
> *Cc:* Hangxiang Yu ; user@flink.apache.org
> *Subject:* Re: Checkpoint jitter?
>
>
>
> Correct, thanks for the clarif
: Wednesday, September 13, 2023 2:47 PM
To: Gyula Fóra
Cc: Hangxiang Yu ; user@flink.apache.org
Subject: Re: Checkpoint jitter?
Correct, thanks for the clarification Gyula!
On Wed, Sep 13, 2023 at 1:39 AM Gyula Fóra
mailto:gyula.f...@gmail.com>> wrote:
No, I think what he means is to trigg
Correct, thanks for the clarification Gyula!
On Wed, Sep 13, 2023 at 1:39 AM Gyula Fóra wrote:
> No, I think what he means is to trigger the checkpoint at slightly
> different times at the different sources so the different parts of the
> pipeline would not checkpoint at the same time.
>
> Gyula
No, I think what he means is to trigger the checkpoint at slightly
different times at the different sources so the different parts of the
pipeline would not checkpoint at the same time.
Gyula
On Wed, Sep 13, 2023 at 10:32 AM Hangxiang Yu wrote:
> Hi, Matyas.
> Do you mean something like adjusti
Hi, Matyas.
Do you mean something like adjusting checkpoint intervals dynamically
or frequency of uploading files according to the pressure of the durable
storage ?
On Wed, Sep 13, 2023 at 9:12 AM Őrhidi Mátyás
wrote:
> Hey folks,
>
> Is it possible to add some sort of jitter to the checkpointin
Hi Frederic,
I’ve once (upon a time 😊) had a similar situation when we changed from Flink
1.8 to Flink 1.13 … It took me a long time to figure out.
Some hints where to start to look:
* _metadata file is used for
* Job manager state
* Smallish keyed state (in order to avoid too
Hi Neha,
The HOP window will increase the size of the checkpoint and I'm sorry
that I'm not very familiar with the HOP window.
If the configurations are all right, and you want to confirm if it's a HOP
window issue, I think you can submit a flink job without HOP window but
with regular agg operat
Hi Shammon,
These configs exist in Flink WebUI. We have set
exeEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime); Do
you think it can create some issues for the HOP(proctime, some interval,
some interval) and not releasing the state for checkpoints?
I am really confused about why s
Hi Neha,
I think you can first check whether the options `state.backend` and
`state.backend.incremental` you mentioned above exist in
`JobManager`->`Configuration` in Flink webui. If they do not exist, you may
be using the wrong conf file.
Best,
Shammon FY
On Mon, Jul 17, 2023 at 5:04 PM Neha .
Hi Shammon,
state.backend: rocksdb
state.backend.incremental: true
This is already set in the Flink-conf. Anything else that should be taken
care of for the incremental checkpointing? Is there any related bug in
Flink 1.16.1? we have recently migrated to Flink 1.16.1 from Flink 1.13.6.
What can b
Hi Neha,
I noticed that the `Checkpointed Data Size` is always equals to `Full
Checkpoint Data Size`, I think the job is using full checkpoint instead of
incremental checkpoint, you can check it
Best,
Shammon FY
On Mon, Jul 17, 2023 at 10:25 AM Neha . wrote:
> Hello Shammon,
>
> Thank you for
Hello Shammon,
Thank you for your assistance.
I have already enabled the incremental checkpointing, Attaching the
screenshot. Can you please elaborate on what makes you think it is not
enabled, It might hint towards the issue. The problem is checkpoint size is
not going down and keeps on increasin
Hi Neha,
I think it is normal for the data size of a savepoint to be smaller than
the full data of a checkpoint. Flink uses rocksdb to store checkpointed
data, which is an LSM structured storage where the same key will have
multiple version records, while savepoint will traverse all keys and store
es Sandys-Lumsdaine
Sent: Wednesday, May 18, 2022 2:53 PM
To: Schwalbe Matthias
Cc: Hangxiang Yu ; user@flink.apache.org
Subject: Re: Checkpoint directories not cleared as TaskManagers run
Hello Matthias,
Thanks for your reply. Yes indeed your are correct. My /tmp path is private so
you hav
in order to work
- as your jobmanager can not access the checkpoint files of it can also not
clean-up those files
Hope that helps
Regards
Thias
From: James Sandys-Lumsdaine
Sent: Tuesday, May 17, 2022 3:55 PM
To: Hangxiang Yu ; user@flink.apache.org
Subject: Re: Checkpoint directories not
jobmanager can not access the checkpoint files of it can also not
clean-up those files
Hope that helps
Regards
Thias
From: James Sandys-Lumsdaine
Sent: Tuesday, May 17, 2022 3:55 PM
To: Hangxiang Yu ; user@flink.apache.org
Subject: Re: Checkpoint directories not cleared as TaskManagers run
___
From: Hangxiang Yu
Sent: 17 May 2022 14:38
To: James Sandys-Lumsdaine ; user@flink.apache.org
Subject: Re: Checkpoint directories not cleared as TaskManagers run
Hi, James.
I may not get what the problem is.
All checkpoints will store in the address as you set.
IIUC, TMs will write som
Hi, James.
I may not get what the problem is.
All checkpoints will store in the address as you set.
IIUC, TMs will write some checkpoint info in their local dir and then
upload them to the address and then delete local one.
JM will write some metas of checkpoint to the address and also do the
entir
Some further Googling says on a StackOverflow posting it is the jobmanager that
does the deletion and not the taskmanagers.
Currently my taskmanagers are writing their checkpoints to their own private
disks (/tmp) rather than a share - so my suspicion is the jobmanager can't
access the folder o
Thank you for the help. To follow up, the issue went away when we reverted
back to flink 1.13. May be related to flink-27481. Before reverting, we
tested unaligned checkpoints with a timeout of 10 minutes, which timed out.
Thanks.
On Thu, Apr 28, 2022, 5:38 PM Guowei Ma wrote:
> Hi Sam
>
> I thi
Hi Sam
I think the first step is to see which part of your Flink APP is blocking
the completion of Checkpoint. Specifically, you can refer to the
"Checkpoint Details" section of the document [1]. Using these methods, you
should be able to observe where the checkpoint is blocked, for example, it
ma
Hi Patrick,
do you even have so much backpressure that unaligned checkpoints are
necessary? You seem to have only one network exchange where unaligned
checkpoint helps.
The Flink 1.11 implementation of unaligned checkpoint was still
experimental and it might cause unexpected side-effects. Afaik, w
Hi Patrick,
Could you also have a look at the stack of the tasks of the
second function to see what the main thread and netty
thread is doing during the checkpoint period ?
Best,
Yun
--Original Mail --
Sender:
Send Date:Wed Oct 27 22:05:40 2021
Recipients:Fli
Since state size is small, you can try FileState Backend, rather than
RocksDB. You can check once. Thumb rule is if FileStateBackend Performs
worse, RocksDB is good.
Regards
Bhasakar
On Tue, Oct 12, 2021 at 1:47 PM Yun Tang wrote:
> Hi Lei,
>
> RocksDB state-backend's checkpoint is composited b
Hi Lei,
RocksDB state-backend's checkpoint is composited by RocksDB's own files
(unmodified compressed SST format files) and incremental checkpoints means
Flink does not upload files which were uploaded before. As you can see,
incremental checkpoints highly depend on the RocksDB's own mechanism
Hi!
Checkpoint sizes are highly related to your job. Incremental checkpointing
will help only when the values in the state are converging (for example a
distinct count aggregation).
If possible, could you provide your user code or explain what jobs are you
running?
Lei Wang 于2021年10月11日周一 下午4:1
Hi Yun. The UI was not useful for this case. I had a feeling before hand
about what the issue was. We refactored the state and now the checkpoint
is 10x faster.
On Mon, Jun 14, 2021 at 5:47 AM Yun Gao wrote:
> Hi Dan,
>
> Flink should already have integrate a tool in the web UI to monitor
> t
Hi Padarn
Will there be these errors if the jobgraph is not modified?
In addition, is this error stack all? Is it possible that other errors
caused the stream to be closed?
Best,
Guowei
On Tue, Jun 15, 2021 at 9:54 PM Padarn Wilson wrote:
> Hi all,
>
> We have a job that has a medium size state
Hi Dan,
Flink should already have integrate a tool in the web UI to monitor
the detailed statistics of the checkpoint [1]. It would show the time
consumed in each part and each task, thus it could be used to debug
the checkpoint timeout.
Best,
Yun
[1]
https://ci.apache.org/projects/flink/fli
3 not 1.11.1.
>
> [1] https://issues.apache.org/jira/browse/FLINK-16753
>
> Best
> Yun Tang
> --
> *From:* Dan Hill
> *Sent:* Tuesday, April 27, 2021 7:50
> *To:* Yun Tang
> *Cc:* Robert Metzger ; user
> *Subject:* Re: Checkpoint error - "The jo
n Tang
Cc: Robert Metzger ; user
Subject: Re: Checkpoint error - "The job has failed"
Hey Yun and Robert,
I'm using Flink v1.11.1.
Robert, I'll send you a separate email with the logs.
On Mon, Apr 26, 2021 at 12:46 AM Yun Tang
mailto:myas...@live.com>> wrote:
Hi Dan,
Flink-1.10.3.
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-16753
>
> Best
> Yun Tang
> --
> *From:* Robert Metzger
> *Sent:* Monday, April 26, 2021 14:46
> *To:* Dan Hill
> *Cc:* user
> *Subject:* Re: Checkpoint error - "
Hill
Cc: user
Subject: Re: Checkpoint error - "The job has failed"
Hi Dan,
can you provide me with the JobManager logs to take a look as well? (This will
also tell me which Flink version you are using)
On Mon, Apr 26, 2021 at 7:20 AM Dan Hill
mailto:quietgol...@gmail.com>>
Hi Dan,
can you provide me with the JobManager logs to take a look as well? (This
will also tell me which Flink version you are using)
On Mon, Apr 26, 2021 at 7:20 AM Dan Hill wrote:
> My Flink job failed to checkpoint with a "The job has failed" error. The
> logs contained no other recent e
has been configured with 4GB of memory,
> there is a sliding window of 10 seconds with a slide of 1 second, and the
> cluster setup is using flink native.
>
>
> Any hints would be much appreciated!
>
>
> Regards,
>
> M.
>
>
> --
>
appreciated!
Regards,
M.
________
From: Guowei Ma
Sent: 01 April 2021 14:19
To: Geldenhuys, Morgan Karl
Cc: user
Subject: Re: Checkpoint timeouts at times of high load
Hi,
I think there are many reasons that could lead to the checkpoint timeout.
Would you like to shar
Hi,
I think there are many reasons that could lead to the checkpoint timeout.
Would you like to share some detailed information of checkpoint? For
example, the detailed checkpoint information from the web.[1] And which
Flink version do you use?
[1]
https://ci.apache.org/projects/flink/flink-docs-
wojski
Sent: Tuesday, March 23, 2021 5:31 AM
To: Alexey Trenikhun
Cc: Arvid Heise ; ChangZhuo Chen (陳昌倬) ;
ro...@apache.org ; Flink User Mail List
Subject: Re: Checkpoint fail due to timeout
Hi Alexey,
You should definitely investigate why the job is stuck.
1. First of all, is it completely s
uring next performance run.
Thanks,
Alexey
From: Roman Khachatryan
Sent: Tuesday, March 23, 2021 12:17 AM
To: Alexey Trenikhun
Cc: ChangZhuo Chen (陳昌倬) ; Flink User Mail List
Subject: Re: Checkpoint fail due to timeout
Unfortunately, the lock can't be chang
kubernetes metrics report very little CPU usage by container, but unaligned
> checkpoint still times out after 3hr.
>
> --
> *From:* Arvid Heise
> *Sent:* Monday, March 22, 2021 6:58:20 AM
> *To:* ChangZhuo Chen (陳昌倬)
> *Cc:* Alexey Trenikhun ; ro
> Thanks,
> Alexey
>
> From: Roman Khachatryan
> Sent: Monday, March 22, 2021 1:36 AM
> To: ChangZhuo Chen (陳昌倬)
> Cc: Alexey Trenikhun ; Flink User Mail List
>
> Subject: Re: Checkpoint fail due to timeout
>
> Thanks for sharin
hread.run(SourceStreamTask.java:263)
Thanks,
Alexey
From: Roman Khachatryan
Sent: Monday, March 22, 2021 1:36 AM
To: ChangZhuo Chen (陳昌倬)
Cc: Alexey Trenikhun ; Flink User Mail List
Subject: Re: Checkpoint fail due to timeout
Thanks for sharing the thread dump.
It
: ChangZhuo Chen (陳昌倬)
Cc: Alexey Trenikhun ; Flink User Mail List
Subject: Re: Checkpoint fail due to timeout
Thanks for sharing the thread dump.
It shows that the source thread is indeed back-pressured
(checkpoint lock is held by a thread which is trying to emit but
unable to acquire any free buffers
checkpoint
still times out after 3hr.
From: Arvid Heise
Sent: Monday, March 22, 2021 6:58:20 AM
To: ChangZhuo Chen (陳昌倬)
Cc: Alexey Trenikhun ; ro...@apache.org ;
Flink User Mail List
Subject: Re: Checkpoint fail due to timeout
Hi Alexey,
rescaling from
Hi Alexey,
rescaling from unaligned checkpoints will be supported with the upcoming
1.13 release (expected at the end of April).
Best,
Arvid
On Wed, Mar 17, 2021 at 8:29 AM ChangZhuo Chen (陳昌倬)
wrote:
> On Wed, Mar 17, 2021 at 05:45:38AM +, Alexey Trenikhun wrote:
> > In my opinion looks
Thanks for sharing the thread dump.
It shows that the source thread is indeed back-pressured
(checkpoint lock is held by a thread which is trying to emit but
unable to acquire any free buffers).
The lock is per task, so there can be several locks per TM.
@ChangZhuo Chen (陳昌倬) , in the thread you
("hdfs:///checkpoints-data/"));
Difference to Savepoints
ci.apache.org
From: ChangZhuo Chen (陳昌倬)
Sent: Wednesday, March 17, 2021 12:29 AM
To: Alexey Trenikhun
Cc: ro...@apache.org; Flink User Mail List
Subject: Re: Checkpoint fail due to timeout
On Wed, Ma
On Wed, Mar 17, 2021 at 05:45:38AM +, Alexey Trenikhun wrote:
> In my opinion looks similar. Were you able to tune-up Flink to make it work?
> I'm stuck with it, I wanted to scale up hoping to reduce backpressure, but to
> rescale I need to take savepoint, which never completes (at least take
From: ChangZhuo Chen (陳昌倬)
Sent: Tuesday, March 16, 2021 6:59 AM
To: Alexey Trenikhun
Cc: ro...@apache.org; Flink User Mail List
Subject: Re: Checkpoint fail due to timeout
On Tue, Mar 16, 2021 at 02:32:54AM +, Alexey Trenikhun wrote:
> Hi Roman,
> I took thread dump:
> "Source:
On Tue, Mar 16, 2021 at 02:32:54AM +, Alexey Trenikhun wrote:
> Hi Roman,
> I took thread dump:
> "Source: digital-itx-eastus2 -> Filter (6/6)#0" Id=200 BLOCKED on
> java.lang.Object@5366a0e2 owned by "Legacy Source Thread - Source:
> digital-itx-eastus2 -> Filter (6/6)#0" Id=202
> at
>
k or per TM? I see multiple
threads in SynchronizedStreamTaskActionExecutor.runThrowing blocked on
different Objects.
Thanks,
Alexey
From: Roman Khachatryan
Sent: Monday, March 15, 2021 2:16 AM
To: Alexey Trenikhun
Cc: Flink User Mail List
Subject: Re: Checkpoint fail due to timeout
Hello Alexey,
2.2 with same results
>
> Thanks,
> Alexey
>
> From: Roman Khachatryan
> Sent: Thursday, March 11, 2021 11:49 PM
> To: Alexey Trenikhun
> Cc: Flink User Mail List
> Subject: Re: Checkpoint fail due to timeout
>
> Hello,
>
>
Hello,
This can be caused by several reasons such as back-pressure, large
snapshots or bugs.
Could you please share:
- the stats of the previous (successful) checkpoints
- back-pressure metrics for sources
- which Flink version do you use?
Regards,
Roman
On Thu, Mar 11, 2021 at 7:03 AM Alexey
og?
>
> Also, have you enabled concurrent checkpoint?
>
> Best,
> Yun
>
>
> --Original Mail --
> *Sender:*Navneeth Krishnan
> *Send Date:*Mon Mar 8 13:10:46 2021
> *Recipients:*Yun Gao
> *CC:*user
> *Subject:*Re: Re: Checkpoint
:46 2021
Recipients:Yun Gao
CC:user
Subject:Re: Re: Checkpoint Error
Hi Yun,
Thanks for the response. I checked the mounts and only the JM's and TM's are
mounted with this EFS. Not sure how to debug this.
Thanks
On Sun, Mar 7, 2021 at 8:29 PM Yun Gao wrote:
Hi Navneeth,
It seem
*Navneeth Krishnan
> *Send Date:*Sun Mar 7 15:44:59 2021
> *Recipients:*user
> *Subject:*Re: Checkpoint Error
>
>> Hi All,
>>
>> Any suggestions?
>>
>> Thanks
>>
>> On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan <
>> reachnavnee.
Hi Navneeth,
It seems from the stack that the exception is caused by the underlying EFS
problems ? Have you checked
if there are errors reported for EFS, or if there might be duplicate mounting
for the same EFS and others
have ever deleted the directory?
Best,
Yun
--Original
Hi All,
Any suggestions?
Thanks
On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan
wrote:
> Hi All,
>
> We are running our streaming job on flink 1.7.2 and we are noticing the
> below error. Not sure what's causing it, any pointers would help. We have
> 10 TM's checkpointing to AWS EFS.
>
> Asy
Thanks for reaching out to the Flink community. I will respond on the JIRA
ticket.
Cheers,
Till
On Wed, Feb 3, 2021 at 1:59 PM simpleusr wrote:
> Hi
>
> I am trying to upgrade from 1.5.5 to 1.12 and checkpointing mechanism seems
> to be broken in our kafka connector sourced datastream jobs.
>
>
t;
>>> yungao...@aliyun.com>:
>>>
>>>> Hi nick,
>>>>
>>>>Sorry I initially think that the data is also write into Kafka with
>>>> flink . So it could be ensured that there is no delay in the write side,
>>>> rig
Hi nick,
>>>
>>>Sorry I initially think that the data is also write into Kafka with
>>> flink . So it could be ensured that there is no delay in the write side,
>>> right ? Does the delay in the read side keeps existing ?
>>>
>>> Best,
write side,
>> right ? Does the delay in the read side keeps existing ?
>>
>> Best,
>> Yun
>>
>>
>>
>> --Original Mail ------
>> *Sender:*nick toker
>> *Send Date:*Tue Dec 22 01:43:50 2020
>> *Recipients:*Yun
delay in the write side,
> right ? Does the delay in the read side keeps existing ?
>
> Best,
> Yun
>
>
>
> --Original Mail --
> *Sender:*nick toker
> *Send Date:*Tue Dec 22 01:43:50 2020
> *Recipients:*Yun Gao
> *CC:*user
> *Su
Hi nick,
Sorry I initially think that the data is also write into Kafka with flink .
So it could be ensured that there is no delay in the write side, right ? Does
the delay in the read side keeps existing ?
Best,
Yun
--Original Mail --
Sender:nick toker
hi
i am confused
the delay in in the source when reading message not on the sink
nick
בתאריך יום ב׳, 21 בדצמ׳ 2020 ב-18:12 מאת Yun Gao <yungao...@aliyun.com
>:
> Hi Nick,
>
> Are you using EXACTLY_ONCE semantics ? If so the sink would use
> transactions, and only commit the transa
Hi Nick,
Are you using EXACTLY_ONCE semantics ? If so the sink would use
transactions, and only commit the transaction on checkpoint complete to ensure
end-to-end exactly-once. A detailed description could be find in [1]
Best,
Yun
[1]
https://flink.apache.org/features/2018/03/01/end-t
Hi
Currently, checkpoint discard logic was executed in Executor[1], maybe
it will not be deleted so quickly
[1]
https://github.com/apache/flink/blob/91404f435f20c5cd6714ee18bf4ccf95c81fb73e/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointsCleaner.java#L45
Best,
Congx
Hi Rex,
As per my understanding there are multiple levels of compactions (with
RocksDB), and files which are not compacted recently would remain in older
checkpoint directories, and there will be references to those files in the
current checkpoint. There is no clear way of identifying these refere
Thanks.
I have some jobs with the checkpoint interval 1000ms. And the HDFS
files grow too large to work normally .
What I am curious about is, are writing and deleting performed
synchronously? Is it possible to add too fast to delete old files?
Congxian Qiu 于2020年11月10日周二 下午2:16写道:
> Hi
>
Hi
No matter what interval you set, Flink will take care of the
checkpoints(remove the useless checkpoint when it can), but when you set a
very small checkpoint interval, there may be much high pressure for the
storage system(here is RPC pressure of HDFS NN).
Best,
Congxian
lec ssmi 于2020年1
Hi Song
Flink-1.4.2 is a bit too old, and I think this error is caused by FLINK-8876
[1][2] which should be fixed after Flink-1.5, please consider to upgrade Flink
version.
[1] https://issues.apache.org/jira/browse/FLINK-8876
[2] https://issues.apache.org/jira/browse/FLINK-8836
Best
Yun Tang
Thanks a lot for the confirmation.
Eleanore
On Fri, Oct 2, 2020 at 2:42 AM Chesnay Schepler wrote:
> Yes, the patch call only triggers the cancellation.
> You can check whether it is complete by polling the job status via
> jobs/ and checking whether state is CANCELED.
>
> On 9/27/2020 7:02 PM,
Yes, the patch call only triggers the cancellation.
You can check whether it is complete by polling the job status via
jobs/ and checking whether state is CANCELED.
On 9/27/2020 7:02 PM, Eleanore Jin wrote:
I have noticed this: if I have Thread.sleep(1500); after the patch
call returned 202, t
Great, thanks Klou!
Cheers,
Till
On Mon, Sep 28, 2020 at 5:07 PM Kostas Kloudas wrote:
> Hi all,
>
> I will have a look.
>
> Kostas
>
> On Mon, Sep 28, 2020 at 3:56 PM Till Rohrmann
> wrote:
> >
> > Hi Cristian,
> >
> > thanks for reporting this issue. It looks indeed like a very critical
> pr
Hi all,
I will have a look.
Kostas
On Mon, Sep 28, 2020 at 3:56 PM Till Rohrmann wrote:
>
> Hi Cristian,
>
> thanks for reporting this issue. It looks indeed like a very critical problem.
>
> The problem seems to be that the ApplicationDispatcherBootstrap class
> produces an exception (that th
Hi Cristian,
thanks for reporting this issue. It looks indeed like a very critical
problem.
The problem seems to be that the ApplicationDispatcherBootstrap class
produces an exception (that the request job can no longer be found because
of a lost ZooKeeper connection) which will be interpreted as
I have noticed this: if I have Thread.sleep(1500); after the patch call
returned 202, then the directory gets cleaned up, in the meanwhile, it
shows the job-manager pod is in completed state before getting terminated:
see screenshot: https://ibb.co/3F8HsvG
So the patch call is async to terminate t
Hi Congxian,
I am making rest call to get the checkpoint config: curl -X GET \
http://localhost:8081/jobs/d2c91a44f23efa2b6a0a89b9f1ca5a3d/checkpoints/config
and here is the response:
{
"mode": "at_least_once",
"interval": 3000,
"timeout": 1,
"min_pause": 1000,
"max_concur
Hi Eleanore
What the `CheckpointRetentionPolicy`[1] did you set for your job? if
`ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION` is set, then the
checkpoint will be kept when canceling a job.
PS the image did not show
[1]
https://ci.apache.org/projects/flink/flink-docs-stable/ops/stat
> The job sub directory will be cleaned up when the job
finished/canceled/failed.
Since we could submit multiple jobs into a Flink session, what i mean is
when a job
reached to the terminal state, the sub node(e.g.
/flink/application_/running_job_registry/4d255397c7aeb5327adb567238c983c1)
on th
> The job sub directory will be cleaned up when the job
> finished/canceled/failed.
What does this mean?
Also, to clarify: I'm a very sloppy developer. My jobs crash ALL the time...
and yet, the jobs would ALWAYS resume from the last checkpoint.
The only cases where I expect Flink to clean u
AFAIK, the HA data, including Zookeeper meta data and real data on DFS,
will only be cleaned up
when the Flink cluster reached terminated state.
So if you are using a session cluster, the root cluster node on Zk will be
cleaned up after you manually
stop the session cluster. The job sub directory
I'm using the standalone script to start the cluster.
As far as I can tell, it's not easy to reproduce. We found that zookeeper lost
a node around the time this happened, but all of our other 75 Flink jobs which
use the same setup, version and zookeeper, didn't have any issues. They didn't
eve
Thanks a lot for reporting this problem here Cristian!
I am not super familiar with the involved components, but the behavior you
are describing doesn't sound right to me.
Which entrypoint are you using? This is logged at the beginning, like this:
"2020-09-08 14:45:32,807 INFO
org.apache.flink.ru
Hi Cristian,
I don't know if it was designed to be like this deliberately.
So I have already submitted an issue ,and wait for somebody to response.
https://issues.apache.org/jira/browse/FLINK-19154
--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
That's an excellent question. I can't explain that. All I know is this:
- the job was upgraded and resumed from a savepoint
- After hours of working fine, it failed (like it shows in the logs)
- the Metadata was cleaned up, again as shown in the logs
- because I run this in Kubernetes, the conta
I means that checkpoints are usually dropped after the job was terminated by
the user (except if explicitly configured as retained Checkpoints). You
could use "ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION" to save
your checkpoint when te cames to failure.
When your zookeeper lost connect
> If you want to save your checkPoint,you could refer to this document
What do you mean? We already persist our savepoints, and we do not delete them
explicitly ever.
The problem is that Flink deleted the data from zookeeper when it shouldn't
have. Is it possible to start a job from a checkpo
Hi Cristian,
>From this code , we could see that the Exception or Error was ignored in
dispatcher.shutDownCluster(applicationStatus) .
``
org.apache.flink.runtime.dispatcher.DispatcherGateway#shutDownCluster
return applicationCompletionFuture
.handle((r, t) -> {
My suspicion is that somewhere in the path were it fails to connect yo
zookeeper, the exception is swallowed, so instead of running the shutdown path
for when the job fails, the general shutdown path is taken.
This was fortunately a job for which we had a savepoint from yesterday.
Otherwise
Hi Cristian,
In the log,we can see it went to the method
shutDownAsync(applicationStatus,null,true);
``
2020-09-04 17:32:07,950 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Shutting
StandaloneApplicationClusterEntryPoint down w
Someone told me that maybe this issue is Mesos specific. I'm kind of a
newbie in Flink, and I digged into the code but can not get a conclusion.
Here I just wanna have a better JoinWindow that emits the result and delete
it from the window state immediately when joined successfully, is there any
ot
Hi Si-li
Thanks for the notice.
I just want to double-check is the original problem has been solved? As I
found that the created issue FLINK-18464 has been closed with reason "can
not reproduce". Am I missing something here?
Best,
Congxian
Si-li Liu 于2020年7月10日周五 下午6:06写道:
> Sorry
>
> I can'
Sorry
I can't reproduce it with reduce/aggregate/fold/apply and due to some
limitations in my working environment, I can't use flink 1.10 or 1.11.
Congxian Qiu 于2020年7月5日周日 下午6:21写道:
> Hi
>
> First, Could you please try this problem still there if use flink 1.10 or
> 1.11?
>
> It seems strange,
that
> problem should be first considered.
>
> Best
> Yun Tang
> --
> *From:* SmileSmile
> *Sent:* Friday, July 3, 2020 14:30
> *To:* Yun Tang
> *Cc:* 'user@flink.apache.org'
> *Subject:* Re: Checkpoint is disable, will history d
Hi
First, Could you please try this problem still there if use flink 1.10 or
1.11?
It seems strange, from the error message, here is an error when trying to
convert a non-Window state(VoidNameSpace) to a Window State (serializer is
the serializer of Window state, but the state is non-Window state
1 - 100 of 238 matches
Mail list logo