Re: Savepoint failure along with JobManager crash

2021-08-31 Thread Matthias Pohl
Hi Prasanna, thanks for reaching out to the community. What you're experiencing is that the savepoint was created but the job itself ended up in an inconsistent state with Executions being cancelled instead of being finished. This should have triggered a global failover resulting in a job restart.

Re: savepoint failure

2021-07-14 Thread Till Rohrmann
Hi Dan, Can you provide us with more information about your job (maybe even the job code or a minimally working example), the Flink configuration, the exact workflow you are doing and the corresponding logs and error messages? Cheers, Till On Tue, Jul 13, 2021 at 9:39 PM Dan Hill wrote: > Coul

Re: savepoint failure

2021-07-13 Thread Dan Hill
Could this be caused by mixing of configuration settings when running? Running a job with one parallelism, stop/savepointing and then recovering with a different parallelism? I'd assume that's fine and wouldn't put create bad state. On Tue, Jul 13, 2021 at 12:34 PM Dan Hill wrote: > I checked m

Re: savepoint failure

2021-07-13 Thread Dan Hill
I checked my code. Our keys for streams and map state only use either (1) string, (2) long IDs that don't change or (3) Tuple of 1 and 2. I don't know why my current case is breaking. Our job partitions and parallelism settings have not changed. On Tue, Jul 13, 2021 at 12:11 PM Dan Hill wrot

Re: savepoint failure

2021-07-13 Thread Dan Hill
Hey. I just hit a similar error in production when trying to savepoint. We also use protobufs. Has anyone found a better fix to this? On Fri, Oct 23, 2020 at 5:21 AM Till Rohrmann wrote: > Glad to hear that you solved your problem. Afaik Flink should not read the > fields of messages and call

Re: Savepoint failure with operation not found under key

2021-06-29 Thread Rainie Li
I see, then it passed longer than 5 mins. Thanks for the help. Best regards Rainie On Tue, Jun 29, 2021 at 12:29 AM Chesnay Schepler wrote: > How much time has passed between the requests? (You can only query the > status for about 5 minutes) > > On 6/29/2021 6:37 AM, Rainie Li wrote: > > Thank

Re: Savepoint failure with operation not found under key

2021-06-29 Thread Chesnay Schepler
How much time has passed between the requests? (You can only query the status for about 5 minutes) On 6/29/2021 6:37 AM, Rainie Li wrote: Thanks for the context Chesnay. Yes, I sent both requests to the same JM. Best regards Rainie On Mon, Jun 28, 2021 at 8:33 AM Chesnay Schepler

Re: Savepoint failure with operation not found under key

2021-06-28 Thread Rainie Li
Thanks for the context Chesnay. Yes, I sent both requests to the same JM. Best regards Rainie On Mon, Jun 28, 2021 at 8:33 AM Chesnay Schepler wrote: > Ordinarily this happens because the status request is sent to a different > JM than the one who received the request for creating a savepoint.

Re: Savepoint failure with operation not found under key

2021-06-28 Thread Chesnay Schepler
Ordinarily this happens because the status request is sent to a different JM than the one who received the request for creating a savepoint. The meta information for such requests is only stored locally on each JM and neither distributed to all JMs nor persisted anywhere. Did you send both requ

Re: savepoint failure

2020-10-23 Thread Till Rohrmann
Glad to hear that you solved your problem. Afaik Flink should not read the fields of messages and call hashCode on them. Cheers, Till On Fri, Oct 23, 2020 at 2:18 PM Radoslav Smilyanov < radoslav.smilya...@smule.com> wrote: > Hi Till, > > I found my problem. It was indeed related to a mutable ha

Re: savepoint failure

2020-10-23 Thread Till Rohrmann
Hi Rado, it is hard to tell the reason w/o a bit more details. Could you share with us the complete logs of the problematic run? Also the job you are running and the types of the state you are storing in RocksDB and use as events in your job are very important. In the linked SO question, the probl