Re: [DISCUSS] Releasing Flink 1.1.4

Ufuk Celebi Thu, 10 Nov 2016 14:29:40 -0800

The last fixes are finally in. Thanks to everyone who participated in the 
discussion.


I will now create the release artifacts and start the vote tomorrow (CET).

– Ufuk

On 8 November 2016 at 19:02:46, Stephan Ewen ([email protected]) wrote:
> I opened a pull request for the backport of [FLINK-4904]
>  
>  
> https://github.com/apache/flink/pull/2773
>  
>  
> On Tue, Nov 8, 2016 at 2:00 PM, Stephan Ewen wrote:
>  
> > The issue FLINK-4904 (Add a limit for how much data may be spilled in
> > checkpoint alignments) is doen for master and I am currently backporting
> > it. Hope to finish that this week...
> >
> > Stephan
> >
> >
> > On Wed, Nov 2, 2016 at 5:03 PM, Till Rohrmann  
> > wrote:
> >
> >> It might make sense to backport
> >>
> >> - [FLINK-4944] Replace Akka's death watch with own heartbeat on the TM
> >> side: https://github.com/apache/flink/pull/2742
> >>
> >> as well. This will allow us to activate the quarantine monitoring per
> >> default in 1.1.4 without risking to kill all TMs in case of a JM failure.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Wed, Nov 2, 2016 at 11:43 AM, Ufuk Celebi wrote:
> >>
> >> > As a quick update: the "pending review" issues have all been resolved.
> >> >
> >> > The open issues are still open:
> >> >
> >> > - FLINK-4904: Add a limit for how much data may be spilled in
> >> > checkpoint alignments => fix pending
> >> > - FLINK-4910: Introduce safety net for closing file system streams
> >> >
> >> > Any updates here?
> >> >
> >> > – Ufuk
> >> >
> >> >
> >> > On Fri, Oct 28, 2016 at 5:45 PM, Stefan Richter
> >> > wrote:
> >> > > Benefit of a backport, as I see it, is increased stability. The danger
> >> > is potentially breaking some code that was casting FileSystems to
> >> subtypes
> >> > like LocalFileSytem. I don’t know how common that would be in user code.
> >> > >
> >> > >> Am 28.10.2016 um 14:27 schrieb Ufuk Celebi :
> >> > >>
> >> > >> Thanks for all your feedback.
> >> > >>
> >> > >> If there are no objections, I would like to stick to the mentioned
> >> > >> issues in this thread and create RC1 as soon as they are all
> >> > >> addressed. This will probably not be this week though, but it looks
> >> > >> good for next week.
> >> > >>
> >> > >> DONE
> >> > >> =====
> >> > >> - FLINK-4619: Answer client if savepoint restore fails
> >> > >> - FLINK-4715: Safety net for stuck task cancellation
> >> > >> - FLINK-4510: Always create CheckpointCoordinator
> >> > >> - FLINK-4894: Don't block on buffer request after broadcast event
> >> > >> - FLINK-4298: Add proper repository for Closure dependencies
> >> > >> - FLINK-4218: Do not fail checkpoints when state size cannot be
> >> > determined
> >> > >> - FLINK-3347: TaskManager (or its ActorSystem) need to restart in
> >> case
> >> > >> they notice quarantine
> >> > >> - FLINK-4875: Use correct operator name
> >> > >> - FLINK-4913: Include user jars in system class loader
> >> > >>
> >> > >> PENDING REVIEW
> >> > >> ===============
> >> > >> - FLINK-4445: Add option to ignore unmatched state when restoring
> >> from
> >> > >> savepoint => https://github.com/apache/flink/pull/2713
> >> > >> - FLINK-4932: Don't let ExecutionGraph fail when in state Restarting
> >> > >> => https://github.com/apache/flink/pull/2711
> >> > >> - FLINK-4933: ExecutionGraph.scheduleOrUpdateConsumers can fail the
> >> > >> ExecutionGraph => https://github.com/apache/flink/pull/2701
> >> > >>
> >> > >> OPEN
> >> > >> =====
> >> > >> - FLINK-4904: Add a limit for how much data may be spilled in
> >> > >> checkpoint alignments => fix pending
> >> > >> - FLINK-4910: Introduce safety net for closing file system streams =>
> >> > >> @Stephan, Stefan: What's the conclusion of your discussion whether to
> >> > >> backport this or not?
> >> > >>
> >> > >>
> >> > >> On Wed, Oct 26, 2016 at 9:57 PM, dan bress  
> >> wrote:
> >> > >>> +1 for this release,
> >> > >>> also +1 to Chesnay's suggesting for including this: [FLINK-4875]
> >> > [metrics]
> >> > >>> Use correct operator name
> >> > >>>
> >> > >>> Dan
> >> > >>>
> >> > >>> On Wed, Oct 26, 2016 at 5:06 AM Till Rohrmann > >> >
> >> > wrote:
> >> > >>>
> >> > >>>> I'll work on FLINK-3347. Additionally I would like to get in
> >> > >>>>
> >> > >>>> - https://issues.apache.org/jira/browse/FLINK-4932: Don't let
> >> > >>>> ExecutionGraph fail when in state Restarting
> >> > >>>> - https://issues.apache.org/jira/browse/FLINK-4933:
> >> > >>>> ExecutionGraph.scheduleOrUpdateConsumers
> >> > >>>> can fail the ExecutionGraph
> >> > >>>>
> >> > >>>> Cheers,
> >> > >>>> Till
> >> > >>>>
> >> > >>>> On Wed, Oct 26, 2016 at 1:02 PM, Stephan Ewen  
> >> > wrote:
> >> > >>>>
> >> > >>>>> Concerning backporting the "I/O streams safety net" - we need to
> >> make
> >> > >>>> sure
> >> > >>>>> that this does not change any behavior that users may implicitly
> >> > expect.
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels <
> >> [email protected]
> >> > >
> >> > >>>>> wrote:
> >> > >>>>>
> >> > >>>>>> +1 for a 1.1.4 release
> >> > >>>>>>
> >> > >>>>>> We could backport putting user jars into the system class loader
> >> for
> >> > >>>>>> per-job Yarn clusters: https://github.com/apache/flink/pull/2692
> >> > >>>>>> Arguably, this is somewhat a new feature but it gets rid of
> >> > duplicate
> >> > >>>>>> class loading issues users experienced in practice.
> >> > >>>>>>
> >> > >>>>>> We already have the following commits on the release-1.1 branch:
> >> > >>>>>>
> >> > >>>>>> 05a5f46 [FLINK-4862] fix Timer register in
> >> > ContinuousEventTimeTrigger
> >> > >>>>>> 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable
> >> > driver
> >> > >>>>>> found for jdbc:calcite"
> >> > >>>>>> 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator
> >> > >>>>>> 210230c [FLINK-4829] snapshot accumulators on a best-effort basis
> >> > >>>>>> c1d6b24 [FLINK-4829] protect user accumulators against concurrent
> >> > >>>> updates
> >> > >>>>>> fe464b4 [FLINK-4709] [core] Fix resource leak in
> >> > >>>>> InputStreamFSInputWrapper
> >> > >>>>>> 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for
> >> > >>>>> InputFormats.
> >> > >>>>>> 9591d50 [FLINK-4506] [DataSet] Fix documentation of
> >> CsvOutputFormat
> >> > >>>> about
> >> > >>>>>> incorrect default of allowNullValues
> >> > >>>>>> c9433bf [FLINK-3706] Fix YARN test instability
> >> > >>>>>> 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI
> >> > examples.
> >> > >>>>>>
> >> > >>>>>> -Max
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>> On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré <
> >> > [email protected]
> >> > >>>>>
> >> > >>>>>> wrote:
> >> > >>>>>>> +1
> >> > >>>>>>>
> >> > >>>>>>> Looking forward this release !
> >> > >>>>>>>
> >> > >>>>>>> Regards
> >> > >>>>>>> JB
> >> > >>>>>>>
> >> > >>>>>>> ⁣
> >> > >>>>>>>
> >> > >>>>>>> On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <
> >> > >>>> [email protected]>
> >> > >>>>>> wrote:
> >> > >>>>>>>> +1 for a bugfix release soon.
> >> > >>>>>>>>
> >> > >>>>>>>> On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <
> >> [email protected]>
> >> > >>>>>>>> wrote:
> >> > >>>>>>>>
> >> > >>>>>>>>> Thanks fort starting this Ufuk.
> >> > >>>>>>>>>
> >> > >>>>>>>>> I would like to add the following issues to 1.1.4:
> >> > >>>>>>>>>
> >> > >>>>>>>>> Build errors due to Storm dependencies *(fix pending)*
> >> > >>>>>>>>> - [FLINK-4298] [storm compatibility] Add proper repository
> >> for
> >> > >>>>>>>> Closure
> >> > >>>>>>>>> dependencies.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Stability on S3 considering eventual consistency *(fix
> >> pending)*
> >> > >>>>>>>>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when
> >> > state
> >> > >>>>>>>> size
> >> > >>>>>>>>> cannot be determined
> >> > >>>>>>>>>
> >> > >>>>>>>>> Avoiding Zombie TaskManagers *(still needs to be done)*
> >> > >>>>>>>>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem)
> >> need to
> >> > >>>>>>>> restart
> >> > >>>>>>>>> in case they notice quarantine
> >> > >>>>>>>>>
> >> > >>>>>>>>> Adding a limit to the amount of data spilled during checkpoint
> >> > >>>>>>>> alignments
> >> > >>>>>>>>> *(fix
> >> > >>>>>>>>> is work in progress)*
> >> > >>>>>>>>> - [FLINK-4904] [checkpoints] Add a limit for how much data
> >> may
> >> > >>>> be
> >> > >>>>>>>>> spilled in checkpoint alignments
> >> > >>>>>>>>>
> >> > >>>>>>>>>
> >> > >>>>>>>>> I can push the first two fixes to the 1.1.4 branch in a bit,
> >> the
> >> > >>>>>>>> fourth one
> >> > >>>>>>>>> later today.
> >> > >>>>>>>>> The third one (akka) is still pending.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Best,
> >> > >>>>>>>>> Stephan
> >> > >>>>>>>>>
> >> > >>>>>>>>>
> >> > >>>>>>>>>
> >> > >>>>>>>>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi  
> >> > >>>> wrote:
> >> > >>>>>>>>>
> >> > >>>>>>>>>> Hey all,
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I would like to start the discussion for kicking off the next
> >> > bug
> >> > >>>>>>>> fix
> >> > >>>>>>>>>> release, Flink 1.1.4. What do you think about aiming for a
> >> RC by
> >> > >>>>>>>> end
> >> > >>>>>>>>>> of this week?
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Users reported some instabilities/inconveniences that would
> >> be
> >> > >>>> good
> >> > >>>>>>>> to
> >> > >>>>>>>>> fix.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Personally, I would like to backport the following fixes:
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer  
> >> > >>>>> client
> >> > >>>>>>>> if
> >> > >>>>>>>>>> savepoint restore fails (Already merged for master, needs
> >> > minimal
> >> > >>>>>>>>>> adjustment for 1.1)
> >> > >>>>>>>>>> (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety  
> >> > net
> >> > >>>>>>>> for
> >> > >>>>>>>>>> stuck task cancellation (Already reviewed for master, waiting
> >> > for
> >> > >>>>>>>>>> tests to finish of backport)
> >> > >>>>>>>>>> (3) https://issues.apache.org/jira/browse/FLINK-4510: Always  
> >> > >>>>> create
> >> > >>>>>>>>>> CheckpointCoordinator (Already merged for master, needs
> >> minimal
> >> > >>>>>>>>>> adjustments for 1.1)
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Furthermore, I would like to address the following:
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> (4) https://issues.apache.org/jira/browse/FLINK-4445: Add
> >> > option
> >> > >>>>> to
> >> > >>>>>>>>>> ignore unmatched state when restoring from savepoint
> >> > >>>>>>>>>> (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't
> >> > >>>> block
> >> > >>>>>>>> on
> >> > >>>>>>>>>> buffer request after broadcast event
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Strictly speaking, the (4) is not a bug fix. But given that
> >> it
> >> > >>>>>>>> would
> >> > >>>>>>>>>> only add an optional flag to savepoint restoring and should
> >> have
> >> > >>>>>>>> been
> >> > >>>>>>>>>> addressed for 1.1.0 already, I would like to get it in.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>
> >> > >>>>>>
> >> > >>>>>
> >> > >>>>
> >> > >
> >> >
> >>
> >
> >
>

Re: [DISCUSS] Releasing Flink 1.1.4

Reply via email to