Let do a separate JIRA, not overload this already tough pull request... On Fri, Nov 13, 2015 at 10:44 AM, Matthias J. Sax <mj...@apache.org> wrote:
> I was thinking about this issue too and wanted to include it in my > current PR (I just about to rebase it to the current master... > https://github.com/apache/flink/pull/750). > > Or should be open a new JIRA for it and address it after Stop signal is > available? > > > -Matthias > > On 11/12/2015 11:47 AM, Maximilian Michels wrote: > > +1 for the proposed changes. But why not always create a snapshot on > > shutdown? Does that break any assumptions in the checkpointing > > interval? I see that if the user has checkpointing disabled, we can > > just create a fake snapshot. > > > > On Thu, Nov 12, 2015 at 9:56 AM, Gyula Fóra <gyula.f...@gmail.com> > wrote: > >> Yes, I agree with you. > >> > >> Once we have the graceful shutdown we can make this happen fairly simply > >> with the mechanism you described :) > >> > >> Gyula > >> > >> Stephan Ewen <se...@apache.org> ezt írta (időpont: 2015. nov. 11., Sze, > >> 15:43): > >> > >>> I think you are touching on something important here. > >>> > >>> There is a discussion/PullRequest about graceful shutdown of streaming > jobs > >>> (like stop > >>> the sources and let the remainder of the streams run out). > >>> > >>> With the work in progress to draw external checkpoint, it should be > easy do > >>> checkpoint-and-close. > >>> We may not even need the last ack in the "checkpoint -> ack -> notify > -> > >>> ack" sequence, when the > >>> operators simply wait for the "notifyComplete" function to finish. > Then, > >>> the operators finish naturally > >>> only successfully when the "notifyComplete()" method succeeds, > otherwise > >>> they go to the state "failed". > >>> That is good, because we need no extra mechanism (extra message type). > >>> > >>> What we do need anyways is a way to detect when the checkpoint did not > >>> globally succeed, that the > >>> functions where it succeeded do not wait forever for the > "notifySuccessful" > >>> message. > >>> > >>> We have two things here now: > >>> > >>> 1) Graceful shutdown should trigger an "internal" checkpoint (which is > >>> immediately discarded), in order to commit > >>> pending data for cases where data is staged between checkpoints. > >>> > >>> 2) An option to shut down with external checkpoint would also be > important, > >>> to stop and resume from exactly there. > >>> > >>> > >>> Stephan > >>> > >>> > >>> On Wed, Nov 11, 2015 at 3:19 PM, Gyula Fóra <gyf...@apache.org> wrote: > >>> > >>>> Hey guys, > >>>> > >>>> With recent discussions around being able to shutdown and restart > >>> streaming > >>>> jobs from specific checkpoints, there is another issue that I think > needs > >>>> tackling. > >>>> > >>>> As far as I understand when a streaming job finishes the tasks are not > >>>> notified for the last checkpoints and also jobs don't take a final > >>>> checkpoint before shutting down. > >>>> > >>>> In my opinion this might lead to situations when the user cannot tell > >>>> whether the job finished properly (with consistent states/ outputs) > etc. > >>> To > >>>> give you a concrete example, let's say I am using the RollingSink to > >>>> produce exactly once output files. If the job finishes I think there > will > >>>> be some files that remain in the pending state and are never > completed. > >>> The > >>>> user then sees some complete files, and some pending files for the > >>>> completed job. The question is then, how do I tell whether the pending > >>>> files were actually completed properly no that the job is finished. > >>>> > >>>> Another example would be that I want to manually shut down my job at > >>> 12:00 > >>>> and make sure that I produce every output up to that point. Is there > any > >>>> way to achieve this currently? > >>>> > >>>> I think we need to do 2 things to make this work: > >>>> 1. Job shutdowns (finish/manual) should trigger a final checkpoint > >>>> 2. These final checkpoints should actually be 2 phase checkpoints: > >>>> checkpoint -> ack -> notify -> ack , then when the > checkpointcoordinator > >>>> gets all the notification acks it can tell the user that the system > shut > >>>> down cleanely. > >>>> > >>>> Unfortunately it can happen that for some reason the coordinator does > not > >>>> receive all the acks for a complete job, in that case it can warn the > >>> user > >>>> that the checkpoint might be inconsistent. > >>>> > >>>> Let me know what you think! > >>>> > >>>> Cheers, > >>>> Gyula > >>>> > >>> > >