I opened a pull request for the backport of [FLINK-4904] <https://issues.apache.org/jira/browse/FLINK-4904>
https://github.com/apache/flink/pull/2773 On Tue, Nov 8, 2016 at 2:00 PM, Stephan Ewen <se...@apache.org> wrote: > The issue FLINK-4904 (Add a limit for how much data may be spilled in > checkpoint alignments) is doen for master and I am currently backporting > it. Hope to finish that this week... > > Stephan > > > On Wed, Nov 2, 2016 at 5:03 PM, Till Rohrmann <till.rohrm...@gmail.com> > wrote: > >> It might make sense to backport >> >> - [FLINK-4944] Replace Akka's death watch with own heartbeat on the TM >> side: https://github.com/apache/flink/pull/2742 >> >> as well. This will allow us to activate the quarantine monitoring per >> default in 1.1.4 without risking to kill all TMs in case of a JM failure. >> >> Cheers, >> Till >> >> On Wed, Nov 2, 2016 at 11:43 AM, Ufuk Celebi <u...@apache.org> wrote: >> >> > As a quick update: the "pending review" issues have all been resolved. >> > >> > The open issues are still open: >> > >> > - FLINK-4904: Add a limit for how much data may be spilled in >> > checkpoint alignments => fix pending >> > - FLINK-4910: Introduce safety net for closing file system streams >> > >> > Any updates here? >> > >> > – Ufuk >> > >> > >> > On Fri, Oct 28, 2016 at 5:45 PM, Stefan Richter >> > <s.rich...@data-artisans.com> wrote: >> > > Benefit of a backport, as I see it, is increased stability. The danger >> > is potentially breaking some code that was casting FileSystems to >> subtypes >> > like LocalFileSytem. I don’t know how common that would be in user code. >> > > >> > >> Am 28.10.2016 um 14:27 schrieb Ufuk Celebi <u...@apache.org>: >> > >> >> > >> Thanks for all your feedback. >> > >> >> > >> If there are no objections, I would like to stick to the mentioned >> > >> issues in this thread and create RC1 as soon as they are all >> > >> addressed. This will probably not be this week though, but it looks >> > >> good for next week. >> > >> >> > >> DONE >> > >> ===== >> > >> - FLINK-4619: Answer client if savepoint restore fails >> > >> - FLINK-4715: Safety net for stuck task cancellation >> > >> - FLINK-4510: Always create CheckpointCoordinator >> > >> - FLINK-4894: Don't block on buffer request after broadcast event >> > >> - FLINK-4298: Add proper repository for Closure dependencies >> > >> - FLINK-4218: Do not fail checkpoints when state size cannot be >> > determined >> > >> - FLINK-3347: TaskManager (or its ActorSystem) need to restart in >> case >> > >> they notice quarantine >> > >> - FLINK-4875: Use correct operator name >> > >> - FLINK-4913: Include user jars in system class loader >> > >> >> > >> PENDING REVIEW >> > >> =============== >> > >> - FLINK-4445: Add option to ignore unmatched state when restoring >> from >> > >> savepoint => https://github.com/apache/flink/pull/2713 >> > >> - FLINK-4932: Don't let ExecutionGraph fail when in state Restarting >> > >> => https://github.com/apache/flink/pull/2711 >> > >> - FLINK-4933: ExecutionGraph.scheduleOrUpdateConsumers can fail the >> > >> ExecutionGraph => https://github.com/apache/flink/pull/2701 >> > >> >> > >> OPEN >> > >> ===== >> > >> - FLINK-4904: Add a limit for how much data may be spilled in >> > >> checkpoint alignments => fix pending >> > >> - FLINK-4910: Introduce safety net for closing file system streams => >> > >> @Stephan, Stefan: What's the conclusion of your discussion whether to >> > >> backport this or not? >> > >> >> > >> >> > >> On Wed, Oct 26, 2016 at 9:57 PM, dan bress <danbr...@gmail.com> >> wrote: >> > >>> +1 for this release, >> > >>> also +1 to Chesnay's suggesting for including this: [FLINK-4875] >> > [metrics] >> > >>> Use correct operator name >> > >>> >> > >>> Dan >> > >>> >> > >>> On Wed, Oct 26, 2016 at 5:06 AM Till Rohrmann <trohrm...@apache.org >> > >> > wrote: >> > >>> >> > >>>> I'll work on FLINK-3347. Additionally I would like to get in >> > >>>> >> > >>>> - https://issues.apache.org/jira/browse/FLINK-4932: Don't let >> > >>>> ExecutionGraph fail when in state Restarting >> > >>>> - https://issues.apache.org/jira/browse/FLINK-4933: >> > >>>> ExecutionGraph.scheduleOrUpdateConsumers >> > >>>> can fail the ExecutionGraph >> > >>>> >> > >>>> Cheers, >> > >>>> Till >> > >>>> >> > >>>> On Wed, Oct 26, 2016 at 1:02 PM, Stephan Ewen <se...@apache.org> >> > wrote: >> > >>>> >> > >>>>> Concerning backporting the "I/O streams safety net" - we need to >> make >> > >>>> sure >> > >>>>> that this does not change any behavior that users may implicitly >> > expect. >> > >>>>> >> > >>>>> >> > >>>>> On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels < >> m...@apache.org >> > > >> > >>>>> wrote: >> > >>>>> >> > >>>>>> +1 for a 1.1.4 release >> > >>>>>> >> > >>>>>> We could backport putting user jars into the system class loader >> for >> > >>>>>> per-job Yarn clusters: https://github.com/apache/flink/pull/2692 >> > >>>>>> Arguably, this is somewhat a new feature but it gets rid of >> > duplicate >> > >>>>>> class loading issues users experienced in practice. >> > >>>>>> >> > >>>>>> We already have the following commits on the release-1.1 branch: >> > >>>>>> >> > >>>>>> 05a5f46 [FLINK-4862] fix Timer register in >> > ContinuousEventTimeTrigger >> > >>>>>> 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable >> > driver >> > >>>>>> found for jdbc:calcite" >> > >>>>>> 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator >> > >>>>>> 210230c [FLINK-4829] snapshot accumulators on a best-effort basis >> > >>>>>> c1d6b24 [FLINK-4829] protect user accumulators against concurrent >> > >>>> updates >> > >>>>>> fe464b4 [FLINK-4709] [core] Fix resource leak in >> > >>>>> InputStreamFSInputWrapper >> > >>>>>> 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for >> > >>>>> InputFormats. >> > >>>>>> 9591d50 [FLINK-4506] [DataSet] Fix documentation of >> CsvOutputFormat >> > >>>> about >> > >>>>>> incorrect default of allowNullValues >> > >>>>>> c9433bf [FLINK-3706] Fix YARN test instability >> > >>>>>> 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI >> > examples. >> > >>>>>> >> > >>>>>> -Max >> > >>>>>> >> > >>>>>> >> > >>>>>> On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré < >> > j...@nanthrax.net >> > >>>>> >> > >>>>>> wrote: >> > >>>>>>> +1 >> > >>>>>>> >> > >>>>>>> Looking forward this release ! >> > >>>>>>> >> > >>>>>>> Regards >> > >>>>>>> JB >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> On Oct 25, 2016, 14:43, at 14:43, Robert Metzger < >> > >>>> rmetz...@apache.org> >> > >>>>>> wrote: >> > >>>>>>>> +1 for a bugfix release soon. >> > >>>>>>>> >> > >>>>>>>> On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen < >> se...@apache.org> >> > >>>>>>>> wrote: >> > >>>>>>>> >> > >>>>>>>>> Thanks fort starting this Ufuk. >> > >>>>>>>>> >> > >>>>>>>>> I would like to add the following issues to 1.1.4: >> > >>>>>>>>> >> > >>>>>>>>> Build errors due to Storm dependencies *(fix pending)* >> > >>>>>>>>> - [FLINK-4298] [storm compatibility] Add proper repository >> for >> > >>>>>>>> Closure >> > >>>>>>>>> dependencies. >> > >>>>>>>>> >> > >>>>>>>>> Stability on S3 considering eventual consistency *(fix >> pending)* >> > >>>>>>>>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when >> > state >> > >>>>>>>> size >> > >>>>>>>>> cannot be determined >> > >>>>>>>>> >> > >>>>>>>>> Avoiding Zombie TaskManagers *(still needs to be done)* >> > >>>>>>>>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) >> need to >> > >>>>>>>> restart >> > >>>>>>>>> in case they notice quarantine >> > >>>>>>>>> >> > >>>>>>>>> Adding a limit to the amount of data spilled during checkpoint >> > >>>>>>>> alignments >> > >>>>>>>>> *(fix >> > >>>>>>>>> is work in progress)* >> > >>>>>>>>> - [FLINK-4904] [checkpoints] Add a limit for how much data >> may >> > >>>> be >> > >>>>>>>>> spilled in checkpoint alignments >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> I can push the first two fixes to the 1.1.4 branch in a bit, >> the >> > >>>>>>>> fourth one >> > >>>>>>>>> later today. >> > >>>>>>>>> The third one (akka) is still pending. >> > >>>>>>>>> >> > >>>>>>>>> Best, >> > >>>>>>>>> Stephan >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <u...@apache.org> >> > >>>> wrote: >> > >>>>>>>>> >> > >>>>>>>>>> Hey all, >> > >>>>>>>>>> >> > >>>>>>>>>> I would like to start the discussion for kicking off the next >> > bug >> > >>>>>>>> fix >> > >>>>>>>>>> release, Flink 1.1.4. What do you think about aiming for a >> RC by >> > >>>>>>>> end >> > >>>>>>>>>> of this week? >> > >>>>>>>>>> >> > >>>>>>>>>> Users reported some instabilities/inconveniences that would >> be >> > >>>> good >> > >>>>>>>> to >> > >>>>>>>>> fix. >> > >>>>>>>>>> >> > >>>>>>>>>> Personally, I would like to backport the following fixes: >> > >>>>>>>>>> >> > >>>>>>>>>> (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer >> > >>>>> client >> > >>>>>>>> if >> > >>>>>>>>>> savepoint restore fails (Already merged for master, needs >> > minimal >> > >>>>>>>>>> adjustment for 1.1) >> > >>>>>>>>>> (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety >> > net >> > >>>>>>>> for >> > >>>>>>>>>> stuck task cancellation (Already reviewed for master, waiting >> > for >> > >>>>>>>>>> tests to finish of backport) >> > >>>>>>>>>> (3) https://issues.apache.org/jira/browse/FLINK-4510: Always >> > >>>>> create >> > >>>>>>>>>> CheckpointCoordinator (Already merged for master, needs >> minimal >> > >>>>>>>>>> adjustments for 1.1) >> > >>>>>>>>>> >> > >>>>>>>>>> Furthermore, I would like to address the following: >> > >>>>>>>>>> >> > >>>>>>>>>> (4) https://issues.apache.org/jira/browse/FLINK-4445: Add >> > option >> > >>>>> to >> > >>>>>>>>>> ignore unmatched state when restoring from savepoint >> > >>>>>>>>>> (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't >> > >>>> block >> > >>>>>>>> on >> > >>>>>>>>>> buffer request after broadcast event >> > >>>>>>>>>> >> > >>>>>>>>>> Strictly speaking, the (4) is not a bug fix. But given that >> it >> > >>>>>>>> would >> > >>>>>>>>>> only add an optional flag to savepoint restoring and should >> have >> > >>>>>>>> been >> > >>>>>>>>>> addressed for 1.1.0 already, I would like to get it in. >> > >>>>>>>>>> >> > >>>>>>>>> >> > >>>>>> >> > >>>>> >> > >>>> >> > > >> > >> > >