The issue FLINK-4904 (Add a limit for how much data may be spilled in
checkpoint alignments) is doen for master and I am currently backporting
it. Hope to finish that this week...

Stephan


On Wed, Nov 2, 2016 at 5:03 PM, Till Rohrmann <till.rohrm...@gmail.com>
wrote:

> It might make sense to backport
>
> - [FLINK-4944] Replace Akka's death watch with own heartbeat on the TM
> side: https://github.com/apache/flink/pull/2742
>
> as well. This will allow us to activate the quarantine monitoring per
> default in 1.1.4 without risking to kill all TMs in case of a JM failure.
>
> Cheers,
> Till
>
> On Wed, Nov 2, 2016 at 11:43 AM, Ufuk Celebi <u...@apache.org> wrote:
>
> > As a quick update: the "pending review" issues have all been resolved.
> >
> > The open issues are still open:
> >
> > - FLINK-4904: Add a limit for how much data may be spilled in
> > checkpoint alignments => fix pending
> > - FLINK-4910: Introduce safety net for closing file system streams
> >
> > Any updates here?
> >
> > – Ufuk
> >
> >
> > On Fri, Oct 28, 2016 at 5:45 PM, Stefan Richter
> > <s.rich...@data-artisans.com> wrote:
> > > Benefit of a backport, as I see it, is increased stability. The danger
> > is potentially breaking some code that was casting FileSystems to
> subtypes
> > like LocalFileSytem. I don’t know how common that would be in user code.
> > >
> > >> Am 28.10.2016 um 14:27 schrieb Ufuk Celebi <u...@apache.org>:
> > >>
> > >> Thanks for all your feedback.
> > >>
> > >> If there are no objections, I would like to stick to the mentioned
> > >> issues in this thread and create RC1 as soon as they are all
> > >> addressed. This will probably not be this week though, but it looks
> > >> good for next week.
> > >>
> > >> DONE
> > >> =====
> > >> - FLINK-4619: Answer client if savepoint restore fails
> > >> - FLINK-4715: Safety net for stuck task cancellation
> > >> - FLINK-4510: Always create CheckpointCoordinator
> > >> - FLINK-4894: Don't block on buffer request after broadcast event
> > >> - FLINK-4298: Add proper repository for Closure dependencies
> > >> - FLINK-4218: Do not fail checkpoints when state size cannot be
> > determined
> > >> - FLINK-3347: TaskManager (or its ActorSystem) need to restart in case
> > >> they notice quarantine
> > >> - FLINK-4875: Use correct operator name
> > >> - FLINK-4913: Include user jars in system class loader
> > >>
> > >> PENDING REVIEW
> > >> ===============
> > >> - FLINK-4445: Add option to ignore unmatched state when restoring from
> > >> savepoint => https://github.com/apache/flink/pull/2713
> > >> - FLINK-4932: Don't let ExecutionGraph fail when in state Restarting
> > >> => https://github.com/apache/flink/pull/2711
> > >> - FLINK-4933: ExecutionGraph.scheduleOrUpdateConsumers can fail the
> > >> ExecutionGraph => https://github.com/apache/flink/pull/2701
> > >>
> > >> OPEN
> > >> =====
> > >> - FLINK-4904: Add a limit for how much data may be spilled in
> > >> checkpoint alignments => fix pending
> > >> - FLINK-4910: Introduce safety net for closing file system streams =>
> > >> @Stephan, Stefan: What's the conclusion of your discussion whether to
> > >> backport this or not?
> > >>
> > >>
> > >> On Wed, Oct 26, 2016 at 9:57 PM, dan bress <danbr...@gmail.com>
> wrote:
> > >>> +1 for this release,
> > >>> also +1 to Chesnay's suggesting for including this: [FLINK-4875]
> > [metrics]
> > >>> Use correct operator name
> > >>>
> > >>> Dan
> > >>>
> > >>> On Wed, Oct 26, 2016 at 5:06 AM Till Rohrmann <trohrm...@apache.org>
> > wrote:
> > >>>
> > >>>> I'll work on FLINK-3347. Additionally I would like to get in
> > >>>>
> > >>>> - https://issues.apache.org/jira/browse/FLINK-4932: Don't let
> > >>>> ExecutionGraph fail when in state Restarting
> > >>>> - https://issues.apache.org/jira/browse/FLINK-4933:
> > >>>> ExecutionGraph.scheduleOrUpdateConsumers
> > >>>> can fail the ExecutionGraph
> > >>>>
> > >>>> Cheers,
> > >>>> Till
> > >>>>
> > >>>> On Wed, Oct 26, 2016 at 1:02 PM, Stephan Ewen <se...@apache.org>
> > wrote:
> > >>>>
> > >>>>> Concerning backporting the "I/O streams safety net" - we need to
> make
> > >>>> sure
> > >>>>> that this does not change any behavior that users may implicitly
> > expect.
> > >>>>>
> > >>>>>
> > >>>>> On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels <
> m...@apache.org
> > >
> > >>>>> wrote:
> > >>>>>
> > >>>>>> +1 for a 1.1.4 release
> > >>>>>>
> > >>>>>> We could backport putting user jars into the system class loader
> for
> > >>>>>> per-job Yarn clusters: https://github.com/apache/flink/pull/2692
> > >>>>>> Arguably, this is somewhat a new feature but it gets rid of
> > duplicate
> > >>>>>> class loading issues users experienced in practice.
> > >>>>>>
> > >>>>>> We already have the following commits on the release-1.1 branch:
> > >>>>>>
> > >>>>>> 05a5f46 [FLINK-4862] fix Timer register in
> > ContinuousEventTimeTrigger
> > >>>>>> 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable
> > driver
> > >>>>>> found for jdbc:calcite"
> > >>>>>> 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator
> > >>>>>> 210230c [FLINK-4829] snapshot accumulators on a best-effort basis
> > >>>>>> c1d6b24 [FLINK-4829] protect user accumulators against concurrent
> > >>>> updates
> > >>>>>> fe464b4 [FLINK-4709] [core] Fix resource leak in
> > >>>>> InputStreamFSInputWrapper
> > >>>>>> 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for
> > >>>>> InputFormats.
> > >>>>>> 9591d50 [FLINK-4506] [DataSet] Fix documentation of
> CsvOutputFormat
> > >>>> about
> > >>>>>> incorrect default of allowNullValues
> > >>>>>> c9433bf [FLINK-3706] Fix YARN test instability
> > >>>>>> 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI
> > examples.
> > >>>>>>
> > >>>>>> -Max
> > >>>>>>
> > >>>>>>
> > >>>>>> On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré <
> > j...@nanthrax.net
> > >>>>>
> > >>>>>> wrote:
> > >>>>>>> +1
> > >>>>>>>
> > >>>>>>> Looking forward this release !
> > >>>>>>>
> > >>>>>>> Regards
> > >>>>>>> JB
> > >>>>>>>
> > >>>>>>> ⁣
> > >>>>>>>
> > >>>>>>> On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <
> > >>>> rmetz...@apache.org>
> > >>>>>> wrote:
> > >>>>>>>> +1 for a bugfix release soon.
> > >>>>>>>>
> > >>>>>>>> On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <
> se...@apache.org>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Thanks fort starting this Ufuk.
> > >>>>>>>>>
> > >>>>>>>>> I would like to add the following issues to 1.1.4:
> > >>>>>>>>>
> > >>>>>>>>> Build errors due to Storm dependencies *(fix pending)*
> > >>>>>>>>>    - [FLINK-4298] [storm compatibility] Add proper repository
> for
> > >>>>>>>> Closure
> > >>>>>>>>> dependencies.
> > >>>>>>>>>
> > >>>>>>>>> Stability on S3 considering eventual consistency *(fix
> pending)*
> > >>>>>>>>>    - [FLINK-4218] [checkpoints] Do not fail checkpoints when
> > state
> > >>>>>>>> size
> > >>>>>>>>> cannot be determined
> > >>>>>>>>>
> > >>>>>>>>> Avoiding Zombie TaskManagers *(still needs to be done)*
> > >>>>>>>>>    - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need
> to
> > >>>>>>>> restart
> > >>>>>>>>> in case they notice quarantine
> > >>>>>>>>>
> > >>>>>>>>> Adding a limit to the amount of data spilled during checkpoint
> > >>>>>>>> alignments
> > >>>>>>>>> *(fix
> > >>>>>>>>> is work in progress)*
> > >>>>>>>>>    - [FLINK-4904] [checkpoints] Add a limit for how much data
> may
> > >>>> be
> > >>>>>>>>> spilled in checkpoint alignments
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> I can push the first two fixes to the 1.1.4 branch in a bit,
> the
> > >>>>>>>> fourth one
> > >>>>>>>>> later today.
> > >>>>>>>>> The third one (akka) is still pending.
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Stephan
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <u...@apache.org>
> > >>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hey all,
> > >>>>>>>>>>
> > >>>>>>>>>> I would like to start the discussion for kicking off the next
> > bug
> > >>>>>>>> fix
> > >>>>>>>>>> release, Flink 1.1.4. What do you think about aiming for a RC
> by
> > >>>>>>>> end
> > >>>>>>>>>> of this week?
> > >>>>>>>>>>
> > >>>>>>>>>> Users reported some instabilities/inconveniences that would be
> > >>>> good
> > >>>>>>>> to
> > >>>>>>>>> fix.
> > >>>>>>>>>>
> > >>>>>>>>>> Personally, I would like to backport the following fixes:
> > >>>>>>>>>>
> > >>>>>>>>>> (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer
> > >>>>> client
> > >>>>>>>> if
> > >>>>>>>>>> savepoint restore fails (Already merged for master, needs
> > minimal
> > >>>>>>>>>> adjustment for 1.1)
> > >>>>>>>>>> (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety
> > net
> > >>>>>>>> for
> > >>>>>>>>>> stuck task cancellation (Already reviewed for master, waiting
> > for
> > >>>>>>>>>> tests to finish of backport)
> > >>>>>>>>>> (3) https://issues.apache.org/jira/browse/FLINK-4510: Always
> > >>>>> create
> > >>>>>>>>>> CheckpointCoordinator (Already merged for master, needs
> minimal
> > >>>>>>>>>> adjustments for 1.1)
> > >>>>>>>>>>
> > >>>>>>>>>> Furthermore, I would like to address the following:
> > >>>>>>>>>>
> > >>>>>>>>>> (4) https://issues.apache.org/jira/browse/FLINK-4445: Add
> > option
> > >>>>> to
> > >>>>>>>>>> ignore unmatched state when restoring from savepoint
> > >>>>>>>>>> (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't
> > >>>> block
> > >>>>>>>> on
> > >>>>>>>>>> buffer request after broadcast event
> > >>>>>>>>>>
> > >>>>>>>>>> Strictly speaking, the (4) is not a bug fix. But given that it
> > >>>>>>>> would
> > >>>>>>>>>> only add an optional flag to savepoint restoring and should
> have
> > >>>>>>>> been
> > >>>>>>>>>> addressed for 1.1.0 already, I would like to get it in.
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >
> >
>

Reply via email to