It might make sense to backport - [FLINK-4944] Replace Akka's death watch with own heartbeat on the TM side: https://github.com/apache/flink/pull/2742
as well. This will allow us to activate the quarantine monitoring per default in 1.1.4 without risking to kill all TMs in case of a JM failure. Cheers, Till On Wed, Nov 2, 2016 at 11:43 AM, Ufuk Celebi <u...@apache.org> wrote: > As a quick update: the "pending review" issues have all been resolved. > > The open issues are still open: > > - FLINK-4904: Add a limit for how much data may be spilled in > checkpoint alignments => fix pending > - FLINK-4910: Introduce safety net for closing file system streams > > Any updates here? > > – Ufuk > > > On Fri, Oct 28, 2016 at 5:45 PM, Stefan Richter > <s.rich...@data-artisans.com> wrote: > > Benefit of a backport, as I see it, is increased stability. The danger > is potentially breaking some code that was casting FileSystems to subtypes > like LocalFileSytem. I don’t know how common that would be in user code. > > > >> Am 28.10.2016 um 14:27 schrieb Ufuk Celebi <u...@apache.org>: > >> > >> Thanks for all your feedback. > >> > >> If there are no objections, I would like to stick to the mentioned > >> issues in this thread and create RC1 as soon as they are all > >> addressed. This will probably not be this week though, but it looks > >> good for next week. > >> > >> DONE > >> ===== > >> - FLINK-4619: Answer client if savepoint restore fails > >> - FLINK-4715: Safety net for stuck task cancellation > >> - FLINK-4510: Always create CheckpointCoordinator > >> - FLINK-4894: Don't block on buffer request after broadcast event > >> - FLINK-4298: Add proper repository for Closure dependencies > >> - FLINK-4218: Do not fail checkpoints when state size cannot be > determined > >> - FLINK-3347: TaskManager (or its ActorSystem) need to restart in case > >> they notice quarantine > >> - FLINK-4875: Use correct operator name > >> - FLINK-4913: Include user jars in system class loader > >> > >> PENDING REVIEW > >> =============== > >> - FLINK-4445: Add option to ignore unmatched state when restoring from > >> savepoint => https://github.com/apache/flink/pull/2713 > >> - FLINK-4932: Don't let ExecutionGraph fail when in state Restarting > >> => https://github.com/apache/flink/pull/2711 > >> - FLINK-4933: ExecutionGraph.scheduleOrUpdateConsumers can fail the > >> ExecutionGraph => https://github.com/apache/flink/pull/2701 > >> > >> OPEN > >> ===== > >> - FLINK-4904: Add a limit for how much data may be spilled in > >> checkpoint alignments => fix pending > >> - FLINK-4910: Introduce safety net for closing file system streams => > >> @Stephan, Stefan: What's the conclusion of your discussion whether to > >> backport this or not? > >> > >> > >> On Wed, Oct 26, 2016 at 9:57 PM, dan bress <danbr...@gmail.com> wrote: > >>> +1 for this release, > >>> also +1 to Chesnay's suggesting for including this: [FLINK-4875] > [metrics] > >>> Use correct operator name > >>> > >>> Dan > >>> > >>> On Wed, Oct 26, 2016 at 5:06 AM Till Rohrmann <trohrm...@apache.org> > wrote: > >>> > >>>> I'll work on FLINK-3347. Additionally I would like to get in > >>>> > >>>> - https://issues.apache.org/jira/browse/FLINK-4932: Don't let > >>>> ExecutionGraph fail when in state Restarting > >>>> - https://issues.apache.org/jira/browse/FLINK-4933: > >>>> ExecutionGraph.scheduleOrUpdateConsumers > >>>> can fail the ExecutionGraph > >>>> > >>>> Cheers, > >>>> Till > >>>> > >>>> On Wed, Oct 26, 2016 at 1:02 PM, Stephan Ewen <se...@apache.org> > wrote: > >>>> > >>>>> Concerning backporting the "I/O streams safety net" - we need to make > >>>> sure > >>>>> that this does not change any behavior that users may implicitly > expect. > >>>>> > >>>>> > >>>>> On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels <m...@apache.org > > > >>>>> wrote: > >>>>> > >>>>>> +1 for a 1.1.4 release > >>>>>> > >>>>>> We could backport putting user jars into the system class loader for > >>>>>> per-job Yarn clusters: https://github.com/apache/flink/pull/2692 > >>>>>> Arguably, this is somewhat a new feature but it gets rid of > duplicate > >>>>>> class loading issues users experienced in practice. > >>>>>> > >>>>>> We already have the following commits on the release-1.1 branch: > >>>>>> > >>>>>> 05a5f46 [FLINK-4862] fix Timer register in > ContinuousEventTimeTrigger > >>>>>> 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable > driver > >>>>>> found for jdbc:calcite" > >>>>>> 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator > >>>>>> 210230c [FLINK-4829] snapshot accumulators on a best-effort basis > >>>>>> c1d6b24 [FLINK-4829] protect user accumulators against concurrent > >>>> updates > >>>>>> fe464b4 [FLINK-4709] [core] Fix resource leak in > >>>>> InputStreamFSInputWrapper > >>>>>> 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for > >>>>> InputFormats. > >>>>>> 9591d50 [FLINK-4506] [DataSet] Fix documentation of CsvOutputFormat > >>>> about > >>>>>> incorrect default of allowNullValues > >>>>>> c9433bf [FLINK-3706] Fix YARN test instability > >>>>>> 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI > examples. > >>>>>> > >>>>>> -Max > >>>>>> > >>>>>> > >>>>>> On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré < > j...@nanthrax.net > >>>>> > >>>>>> wrote: > >>>>>>> +1 > >>>>>>> > >>>>>>> Looking forward this release ! > >>>>>>> > >>>>>>> Regards > >>>>>>> JB > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Oct 25, 2016, 14:43, at 14:43, Robert Metzger < > >>>> rmetz...@apache.org> > >>>>>> wrote: > >>>>>>>> +1 for a bugfix release soon. > >>>>>>>> > >>>>>>>> On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <se...@apache.org> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Thanks fort starting this Ufuk. > >>>>>>>>> > >>>>>>>>> I would like to add the following issues to 1.1.4: > >>>>>>>>> > >>>>>>>>> Build errors due to Storm dependencies *(fix pending)* > >>>>>>>>> - [FLINK-4298] [storm compatibility] Add proper repository for > >>>>>>>> Closure > >>>>>>>>> dependencies. > >>>>>>>>> > >>>>>>>>> Stability on S3 considering eventual consistency *(fix pending)* > >>>>>>>>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when > state > >>>>>>>> size > >>>>>>>>> cannot be determined > >>>>>>>>> > >>>>>>>>> Avoiding Zombie TaskManagers *(still needs to be done)* > >>>>>>>>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need to > >>>>>>>> restart > >>>>>>>>> in case they notice quarantine > >>>>>>>>> > >>>>>>>>> Adding a limit to the amount of data spilled during checkpoint > >>>>>>>> alignments > >>>>>>>>> *(fix > >>>>>>>>> is work in progress)* > >>>>>>>>> - [FLINK-4904] [checkpoints] Add a limit for how much data may > >>>> be > >>>>>>>>> spilled in checkpoint alignments > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> I can push the first two fixes to the 1.1.4 branch in a bit, the > >>>>>>>> fourth one > >>>>>>>>> later today. > >>>>>>>>> The third one (akka) is still pending. > >>>>>>>>> > >>>>>>>>> Best, > >>>>>>>>> Stephan > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <u...@apache.org> > >>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hey all, > >>>>>>>>>> > >>>>>>>>>> I would like to start the discussion for kicking off the next > bug > >>>>>>>> fix > >>>>>>>>>> release, Flink 1.1.4. What do you think about aiming for a RC by > >>>>>>>> end > >>>>>>>>>> of this week? > >>>>>>>>>> > >>>>>>>>>> Users reported some instabilities/inconveniences that would be > >>>> good > >>>>>>>> to > >>>>>>>>> fix. > >>>>>>>>>> > >>>>>>>>>> Personally, I would like to backport the following fixes: > >>>>>>>>>> > >>>>>>>>>> (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer > >>>>> client > >>>>>>>> if > >>>>>>>>>> savepoint restore fails (Already merged for master, needs > minimal > >>>>>>>>>> adjustment for 1.1) > >>>>>>>>>> (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety > net > >>>>>>>> for > >>>>>>>>>> stuck task cancellation (Already reviewed for master, waiting > for > >>>>>>>>>> tests to finish of backport) > >>>>>>>>>> (3) https://issues.apache.org/jira/browse/FLINK-4510: Always > >>>>> create > >>>>>>>>>> CheckpointCoordinator (Already merged for master, needs minimal > >>>>>>>>>> adjustments for 1.1) > >>>>>>>>>> > >>>>>>>>>> Furthermore, I would like to address the following: > >>>>>>>>>> > >>>>>>>>>> (4) https://issues.apache.org/jira/browse/FLINK-4445: Add > option > >>>>> to > >>>>>>>>>> ignore unmatched state when restoring from savepoint > >>>>>>>>>> (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't > >>>> block > >>>>>>>> on > >>>>>>>>>> buffer request after broadcast event > >>>>>>>>>> > >>>>>>>>>> Strictly speaking, the (4) is not a bug fix. But given that it > >>>>>>>> would > >>>>>>>>>> only add an optional flag to savepoint restoring and should have > >>>>>>>> been > >>>>>>>>>> addressed for 1.1.0 already, I would like to get it in. > >>>>>>>>>> > >>>>>>>>> > >>>>>> > >>>>> > >>>> > > >