Re: Missing checkpoint when restarting failed job

Stefan Richter Tue, 21 Nov 2017 07:28:03 -0800

Ok, thanks for trying to reproduce this. If possible, could you also activate 
trace-level logging for class 
org.apache.flink.runtime.state.SharedStateRegistry? In case the problem occurs, 
this would greatly help to understand what was going on.


> Am 21.11.2017 um 15:16 schrieb gerardg <ger...@talaia.io>:
> 
>> where exactly did you read many times that incremental checkpoints cannot
> reference files from previous 
>> checkpoints, because we would have to correct that information. In fact,
>> this is how incremental checkpoints work. 
> 
> My fault, I read it in some other posts in the mailing list but now that I
> read it carefully it meant savepoints not checkpoints.
> 
>> Now for this case, I would consider it extremely unlikely that a
>> checkpoint 1620 would still reference a checkpoint 1,
>> in particular if the files for that checkpoint are already deleted, which
>> should only happen if it is no longer
>> referenced. Which version of Flink are you using and what is your
>> distributed filesystem? Is there any way to
>> reproduce the problem? 
> 
> We are using Flink version 1.3.2 and GlusterFS.  There are usually a few
> checkpoints around at the same time, for example right now: 
> 
> chk-1  chk-26  chk-27  chk-28  chk-29  chk-30  chk-31
> 
> I'm not sure how to reproduce the problem but I'll monitor the folder to see
> when chk-1 gets deleted and try to make the task fail when that happens.
> 
> Gerard
> 
> Gerard
> 
> 
> 
> 
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Missing checkpoint when restarting failed job

Reply via email to