Re: Restore from checkpoint

2024-05-19 Thread archzi lu
Hi Phil, correction: But the error you have is a familiar error if you have written some code to handle directory path. --> But the error you have is a familiar error if you have written some code to handle directory path with Java. No offence. Best regards. Jiadong. Lu Jiadong Lu 于2024年5月20日周

Re: Restore from checkpoint

2024-05-19 Thread Jiadong Lu
Hi, Phil I don't have more expertise about the flink-python module. But the error you have is a familiar error if you have written some code to handle directory path. The correct form of Path/URI will be : 1. "/home/foo" 2. "file:///home/foo/boo" 3. "hdfs:///home/foo/boo" 4. or Win32 director

Re: Restore from checkpoint

2024-05-19 Thread Jinzhong Li
Hi Phil, I think you can use the "-s :checkpointMetaDataPath" arg to resume the job from a retained checkpoint[1]. [1] https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints/#resuming-from-a-retained-checkpoint Best, Jinzhong Li On Mon, May 20, 2024 at 2:29 AM Phil St

Re: Restore from checkpoint

2024-05-19 Thread Phil Stavridis
Hi Lu, Thanks for your reply. In what way are the paths to get passed to the job that needs to used the checkpoint? Is the standard way, using -s :/ or by passing the path in the module as a Python arg? Kind regards Phil > On 18 May 2024, at 03:19, jiadong.lu wrote: > > Hi Phil, > > AFAIK,

Re: Restore from checkpoint

2024-05-17 Thread jiadong.lu
Hi Phil, AFAIK, the error indicated your path was incorrect. your should use '/opt/flink/checkpoints/1875588e19b1d8709ee62be1cdcc' or 'file:///opt/flink/checkpoints/1875588e19b1d8709ee62be1cdcc' instead. Best. Jiadong.Lu On 5/18/24 2:37 AM, Phil Stavridis wrote: Hi, I am trying to test how

Restore from checkpoint

2024-05-17 Thread Phil Stavridis
Hi, I am trying to test how the checkpoints work for restoring state, but not sure how to run a new instance of a flink job, after I have cancelled it, using the checkpoints which I store in the filesystem of the job manager, e.g. /opt/flink/checkpoints. I have tried passing the checkpoint as

Re: Weird operator ID check restore from checkpoint fails

2021-12-07 Thread Dan Hill
Thanks! On Tue, Dec 7, 2021, 22:55 Robert Metzger wrote: > 811d3b279c8b26ed99ff0883b7630242 is the operator id. > If I'm not mistaken, running the job graph generation (e.g. the main > method) in DEBUG log level will show you all the IDs generated. This should > help you map this ID to your code

Re: Weird operator ID check restore from checkpoint fails

2021-12-07 Thread Robert Metzger
811d3b279c8b26ed99ff0883b7630242 is the operator id. If I'm not mistaken, running the job graph generation (e.g. the main method) in DEBUG log level will show you all the IDs generated. This should help you map this ID to your code. On Wed, Dec 8, 2021 at 7:52 AM Dan Hill wrote: > Nothing change

Re: Weird operator ID check restore from checkpoint fails

2021-12-07 Thread Dan Hill
Nothing changed (as far as I know). It's the same binary and the same args. It's Flink v1.12.3. I'm going to switch away from auto-gen uids and see if that helps. The job randomly started failing to checkpoint. I cancelled the job and started it from the last successful checkpoint. I'm confus

Re: Weird operator ID check restore from checkpoint fails

2021-12-07 Thread Robert Metzger
Hi Dan, When restoring a savepoint/checkpoint, Flink is matching the state for the operators based on the uuid of the operator. The exception says that there is some state that doesn't match any operator. So from Flink's perspective, the operator is gone. Here is more information: https://nightlie

Re: Weird operator ID check restore from checkpoint fails

2021-12-07 Thread Dan Hill
I'm restoring the job with the same binary and same flags/args. On Tue, Dec 7, 2021 at 8:48 PM Dan Hill wrote: > Hi. I noticed this warning has "operator > 811d3b279c8b26ed99ff0883b7630242" in it. I assume this should be an > operator uid or name. It looks like something else. What is it? I

Weird operator ID check restore from checkpoint fails

2021-12-07 Thread Dan Hill
Hi. I noticed this warning has "operator 811d3b279c8b26ed99ff0883b7630242" in it. I assume this should be an operator uid or name. It looks like something else. What is it? Is something corrupted? org.apache.flink.runtime.client.JobInitializationException: Could not instantiate JobManager.

Re: Flink 1.13.0 reactive mode: Job stop and cannot restore from checkpoint

2021-05-17 Thread 陳昌倬
On Mon, May 17, 2021 at 01:22:16PM +0200, Arvid Heise wrote: > Hi ChangZhuo, > > This looks indeed like a bug. I created FLINK-22686 [1] to track it. It > looks unrelated to reactive mode to me and more related to unaligned > checkpoints. So, you can try out reactive mode with aligned checkpoints.

Re: Flink 1.13.0 reactive mode: Job stop and cannot restore from checkpoint

2021-05-17 Thread Arvid Heise
Hi ChangZhuo, This looks indeed like a bug. I created FLINK-22686 [1] to track it. It looks unrelated to reactive mode to me and more related to unaligned checkpoints. So, you can try out reactive mode with aligned checkpoints. If you can provide us with the topology, we can also fix it soonish:

Re: Flink 1.13.0 reactive mode: Job stop and cannot restore from checkpoint

2021-05-17 Thread Till Rohrmann
One small addition: The old mapping looks to use the SubtaskStateMapper.RANGE whereas the new mapping looks to use the SubtaskStateMapper.ROUND_ROBIN. On Mon, May 17, 2021 at 11:56 AM Till Rohrmann wrote: > Hi ChangZhuo Chen, > > This looks like a bug in Flink. Could you provide us with the logs

Re: Flink 1.13.0 reactive mode: Job stop and cannot restore from checkpoint

2021-05-17 Thread Till Rohrmann
Hi ChangZhuo Chen, This looks like a bug in Flink. Could you provide us with the logs of the run and more information about your job? In particular, how does your topology look like? My suspicion is the following: You have an operator with two inputs. One input is keyed whereas the other input is

Flink 1.13.0 reactive mode: Job stop and cannot restore from checkpoint

2021-05-13 Thread 陳昌倬
Hi, We run our application in Flink 1.13.0, Kubernetes standalone application cluster with reactive mode enabled. The application has stopped and cannot restore today, so we try to restore the application from checkpoint. However, the application cannot restart from checkpoint due to the following

Re: Flink failing to restore from checkpoint

2021-03-29 Thread Piotr Nowojski
Hi, What Flink version are you using and what is the scenario that's happening? It can be a number of things, most likely an issue that your filed mounted under: > /mnt/checkpoints/5dde50b6e70608c63708cbf979bce4aa/shared/47993871-c7eb-4fec-ae23-207d307c384a disappeared or stopped being accessible.

Re: Restore from Checkpoint from local Standalone Job

2021-03-29 Thread Piotr Nowojski
Hi Sandeep, I think it should work fine with `StandaloneCompletedCheckpointStore`. Have you checked if your directory /Users/test/savepoint is being populated in the first place? And if so, if the restarted job is not throwing some exceptions like it can not access those files? Also note, that

Flink failing to restore from checkpoint

2021-03-29 Thread Claude M
Hello, I executed a flink job in a Kubernetes Application cluster w/ four taskmanagers. The job was running fine for several hours but then crashed w/ the following exception which seems to be when restoring from a checkpoint.The UI shows the following for the checkpoint counts: Triggered: 6

Restore from Checkpoint from local Standalone Job

2021-03-26 Thread Sandeep khanzode
Hello I was reading this: https://stackoverflow.com/questions/61010970/flink-resume-from-externalised-checkpoint-question I am trying to run a standalone job on my local with a single job manager and task manager. I have enabled checkpointing as below: env.setStateBackend(new RocksDBState

Re: Automatically restore from checkpoint

2020-09-18 Thread Arpith P
Thanks David, in case of manual restart; to get checkpoint path programmatically I'm using the following code to retrieve JobId and CheckpointID so i could pass along while restarting with "-s" but seems I'm missing something as I'm getting empty TimestampedFileSplit array. GlobFilePathFilter file

Re: Automatically restore from checkpoint

2020-09-18 Thread David Anderson
If your job crashes, Flink will automatically restart from the latest checkpoint, without any manual intervention. JobManager HA is only needed for automatic recovery after the failure of the Job Manager. You only need externalized checkpoints and "-s :checkpointPath" if you want to use checkpoint

Automatically restore from checkpoint

2020-09-17 Thread Arpith P
Hi, I'm running Flink job in distributed mode deployed in Yarn; I've enabled externalized checkpoint to save in Hdfs, but I don't have access to read checkpoints folder. To restart Flink job from the last saved checkpoint is it possible to do without passing "-s :checkpointPath". If this is not po

Re: Cancel the flink task and restore from checkpoint ,can I change the flink operator's parallelism

2020-03-17 Thread LakeShen
Thank you, I will do that. jinhai wang 于2020年3月17日周二 下午5:58写道: > Hi LakeShen > > You also must assign IDs to all operators of an application. Otherwise, > you may not be able to recover from checkpoint > > Doc: > https://ci.apache.org/projects/flink/flink-docs-stable/ops/upgrading.html#matching-

Re: Cancel the flink task and restore from checkpoint ,can I change the flink operator's parallelism

2020-03-17 Thread jinhai wang
Hi LakeShen You also must assign IDs to all operators of an application. Otherwise, you may not be able to recover from checkpoint Doc: https://ci.apache.org/projects/flink/flink-docs-stable/ops/upgrading.html#matching-operator-state

Re: Cancel the flink task and restore from checkpoint ,can I change the flink operator's parallelism

2020-03-17 Thread Till Rohrmann
Let us know if you should run into any problems. The state processor API is still young and could benefit from as much feedback as possible. Cheers, Till On Tue, Mar 17, 2020 at 2:57 AM LakeShen wrote: > Wow,this feature is look so good,I will see this feature. > Thank you to reply me , Till😀.

Re: Cancel the flink task and restore from checkpoint ,can I change the flink operator's parallelism

2020-03-16 Thread Till Rohrmann
If you want to change the max parallelism then you need to take a savepoint and use Flink's state processor API [1] to rewrite the max parallelism by creating a new savepoint from the old one. [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/state_processor_api.html Cheers, Til

Re: Cancel the flink task and restore from checkpoint ,can I change the flink operator's parallelism

2020-03-13 Thread Piotr Nowojski
Hi, Yes, you can change the parallelism. One thing that you can not change is “max parallelism”. Piotrek > On 13 Mar 2020, at 04:34, Sivaprasanna wrote: > > I think you can modify the operator’s parallelism. It is only if you have set > maxParallelism, and while restoring from a checkpoint,

Re: Cancel the flink task and restore from checkpoint ,can I change the flink operator's parallelism

2020-03-12 Thread Sivaprasanna
I think you can modify the operator’s parallelism. It is only if you have set maxParallelism, and while restoring from a checkpoint, you shouldn’t modify the maxParallelism. Otherwise, I believe the state will be lost. - Sivaprasanna On Fri, 13 Mar 2020 at 9:01 AM, LakeShen wrote: > Hi communit

Cancel the flink task and restore from checkpoint ,can I change the flink operator's parallelism

2020-03-12 Thread LakeShen
Hi community, I have a question is that I cancel the flink task and retain the checkpoint dir, then restore from the checkpoint dir ,can I change the flink operator's parallelism,in my thoughts, I think I can't change the flink operator's parallelism,but I am not sure. Thanks to your rep

Re: Flink 1.7 job cluster (restore from checkpoint error)

2018-12-06 Thread Hao Sun
Thanks for the tip! I did change the jobGraph this time. Hao Sun Team Lead 1019 Market St. 7F San Francisco, CA 94103 On Thu, Dec 6, 2018 at 2:47 AM Till Rohrmann wrote: > Hi Hao, > > if Flink tries to recover from a checkpoint, then the JobGraph should not > be modified and the system should

Re: Flink 1.7 job cluster (restore from checkpoint error)

2018-12-06 Thread Till Rohrmann
Hi Hao, if Flink tries to recover from a checkpoint, then the JobGraph should not be modified and the system should be able to restore the state. Have you changed the JobGraph and are you now trying to recover from the latest checkpoint which is stored in ZooKeeper? If so, then you can also start

Re: Flink 1.7 job cluster (restore from checkpoint error)

2018-12-05 Thread Hao Sun
Till, Flink is automatically trying to recover from a checkpoint not savepoint. How can I get allowNonRestoredState applied in this case? Hao Sun Team Lead 1019 Market St. 7F San Francisco, CA 94103 On Wed, Dec 5, 2018 at 10:09 AM Till Rohrmann wrote: > Hi Hao, > > I think you need to provide

Re: Flink 1.7 job cluster (restore from checkpoint error)

2018-12-05 Thread Till Rohrmann
Hi Hao, I think you need to provide a savepoint file via --fromSavepoint to resume from in order to specify --allowNonRestoredState. Otherwise this option will be ignored because it only works if you resume from a savepoint. Cheers, Till On Wed, Dec 5, 2018 at 12:29 AM Hao Sun wrote: > I am us

Flink 1.7 job cluster (restore from checkpoint error)

2018-12-04 Thread Hao Sun
I am using 1.7 and job cluster on k8s. Here is how I start my job docker-entrypoint.sh job-cluster -j com.zendesk.fraud_prevention.examples.ConnectedStreams --allowNonRestoredState *Seems like --allowNonRestoredState is not honored* === Logs === java","line":"1041","message":"Restoring

Re: Job fails to restore from checkpoint in Kubernetes with FileNotFoundException

2018-10-30 Thread Till Rohrmann
As Vino pointed out, you need to configure a checkpoint directory which is accessible from all TMs. Otherwise you won't be able to recover the state if the task gets scheduled to a different TaskManager. Usually, people use HDFS or S3 for that. Cheers, Till On Tue, Oct 30, 2018 at 9:50 AM vino ya

Re: Job fails to restore from checkpoint in Kubernetes with FileNotFoundException

2018-10-30 Thread vino yang
Hi John, Is the file system configured by RocksDBStateBackend HDFS?[1] Thanks, vino. [1]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/state/state_backends.html#the-rocksdbstatebackend John Stone 于2018年10月30日周二 上午2:54写道: > I am testing Flink in a Kubernetes cluster and am f

Job fails to restore from checkpoint in Kubernetes with FileNotFoundException

2018-10-29 Thread John Stone
I am testing Flink in a Kubernetes cluster and am finding that a job gets caught in a recovery loop. Logs show that the issue is that a checkpoint cannot be found although checkpoints are being taken per the Flink web UI. Any advice on how to resolve this is most appreciated. Note on below: I