Re: [DISCUSS] Release Flink 1.1.5 / Flink 1.2.1

Jinkui Shi Thu, 16 Mar 2017 01:42:04 -0700

@Tzu-li(Fordon)Tai

FLINK-5650 is fix by [1]. Chesnay Scheduler push a PR please.


[1] https://github.com/zentol/flink/tree/5650_python_test_debug 
<https://github.com/zentol/flink/tree/5650_python_test_debug>


> 在 2017年3月16日，上午3:37，Stephan Ewen <[email protected]> 写道：
> 
> Thanks for the update!
> 
> Just merged to 1.2.1 also: [FLINK-5962] [checkpoints] Remove scheduled
> cancel-task from timer queue to prevent memory leaks
> 
> The remaining issue list looks good, but I would say that (5) is optional.
> It is not a critical production bug.
> 
> 
> 
> On Wed, Mar 15, 2017 at 5:38 PM, Tzu-Li (Gordon) Tai <[email protected]>
> wrote:
> 
>> Thanks a lot for the updates so far everyone!
>> 
>> From the discussion so far, the below is the still unfixed pending issues
>> for 1.1.5 / 1.2.1 release.
>> 
>> Since there’s only one backport for 1.1.5 left, I think having an RC for
>> 1.1.5 near the end of this week / early next week is very promising, as
>> basically everything is already in.
>> I’d be happy to volunteer to help manage the release for 1.1.5, and
>> prepare the RC when it’s ready :)
>> 
>> For 1.2.1, we can leave the pending list here for tracking, and come back
>> to update it in the near future.
>> 
>> If there’s anything I missed, please let me know!
>> 
>> 
>> =========== Still pending for Flink 1.1.5 ===========
>> 
>> (1) https://issues.apache.org/jira/browse/FLINK-5701
>> Broken at-least-once Kafka producer.
>> Status: backport PR pending - https://github.com/apache/flink/pull/3549.
>> Since it is a relatively self-contained change, I expect this to be a fast
>> fix.
>> 
>> 
>> 
>> =========== Still pending for Flink 1.2.1 ===========
>> 
>> (1) https://issues.apache.org/jira/browse/FLINK-5808
>> Fix Missing verification for setParallelism and setMaxParallelism
>> Status: PR - https://github.com/apache/flink/pull/3509, review in progress
>> 
>> (2) https://issues.apache.org/jira/browse/FLINK-5713
>> Protect against NPE in WindowOperator window cleanup
>> Status: PR - https://github.com/apache/flink/pull/3535, review pending
>> 
>> (3) https://issues.apache.org/jira/browse/FLINK-6044
>> TypeSerializerSerializationProxy.read() doesn't verify the read buffer
>> length
>> Status: Fixed for master, 1.2 backport pending
>> 
>> (4) https://issues.apache.org/jira/browse/FLINK-5985
>> Flink treats every task as stateful (making topology changes impossible)
>> Status: PR - https://github.com/apache/flink/pull/3543, review in progress
>> 
>> (5) https://issues.apache.org/jira/browse/FLINK-5650
>> Flink-python tests taking up too much time
>> Status: I think Chesnay currently has some progress with this one, we can
>> see if we want to make this a blocker
>> 
>> 
>> Cheers,
>> Gordon
>> 
>> On March 15, 2017 at 7:16:53 PM, Jinkui Shi ([email protected]) wrote:
>> 
>> Can we fix this issue in the 1.2.1:
>> 
>> Flink-python tests cost too long time
>> https://issues.apache.org/jira/browse/FLINK-5650 <
>> https://issues.apache.org/jira/browse/FLINK-5650>
>> 
>>> 在 2017年3月15日，下午6:29，Vladislav Pernin <[email protected]> 写道：
>>> 
>>> I just tested in in my reproducer. It works.
>>> 
>>> 2017-03-15 11:22 GMT+01:00 Aljoscha Krettek <[email protected]>:
>>> 
>>>> I did in fact just open a PR for
>>>>> https://issues.apache.org/jira/browse/FLINK-6001
>>>>> NPE on TumblingEventTimeWindows with ContinuousEventTimeTrigger and
>>>>> allowedLateness
>>>> 
>>>> 
>>>> On Tue, Mar 14, 2017, at 18:20, Vladislav Pernin wrote:
>>>>> Hi,
>>>>> 
>>>>> I would also include the following (not yet resolved) issue in the
>> 1.2.1
>>>>> scope :
>>>>> 
>>>>> https://issues.apache.org/jira/browse/FLINK-6001
>>>>> NPE on TumblingEventTimeWindows with ContinuousEventTimeTrigger and
>>>>> allowedLateness
>>>>> 
>>>>> 2017-03-14 17:34 GMT+01:00 Ufuk Celebi <[email protected]>:
>>>>> 
>>>>>> Big +1 Gordon!
>>>>>> 
>>>>>> I think (10) is very critical to have in 1.2.1.
>>>>>> 
>>>>>> – Ufuk
>>>>>> 
>>>>>> 
>>>>>> On Tue, Mar 14, 2017 at 3:37 PM, Stefan Richter
>>>>>> <[email protected]> wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I would suggest to also include in 1.2.1:
>>>>>>> 
>>>>>>> (9) https://issues.apache.org/jira/browse/FLINK-6044 <
>>>>>> https://issues.apache.org/jira/browse/FLINK-6044>
>>>>>>> Replaces unintentional calls to InputStream#read(…) with the intended
>>>>>>> and correct InputStream#readFully(…)
>>>>>>> Status: PR
>>>>>>> 
>>>>>>> (10) https://issues.apache.org/jira/browse/FLINK-5985 <
>>>>>> https://issues.apache.org/jira/browse/FLINK-5985>
>>>>>>> Flink 1.2 was creating state handles for stateless tasks which caused
>>>>>> trouble
>>>>>>> at restore time for users that wanted to do some changes that only
>>>>>> include
>>>>>>> stateless operators to their topology.
>>>>>>> Status: PR
>>>>>>> 
>>>>>>> 
>>>>>>>> Am 14.03.2017 um 15:15 schrieb Till Rohrmann <[email protected]
>>>>> :
>>>>>>>> 
>>>>>>>> Thanks for kicking off the discussion Tzu-Li. I'd like to add the
>>>>>> following
>>>>>>>> issues which have already been merged into the 1.2-release and
>>>>>> 1.1-release
>>>>>>>> branch:
>>>>>>>> 
>>>>>>>> 1.2.1:
>>>>>>>> 
>>>>>>>> (7) https://issues.apache.org/jira/browse/FLINK-5942
>>>>>>>> Hardens the checkpoint recovery in case of corrupted ZooKeeper data.
>>>>>>>> Corrupted checkpoints will now be skipped.
>>>>>>>> Status: Merged
>>>>>>>> 
>>>>>>>> (8) https://issues.apache.org/jira/browse/FLINK-5940
>>>>>>>> Hardens the checkpoint recovery in case that we cannot retrieve the
>>>>>>>> completed checkpoint from the meta data state handle retrieved from
>>>>>>>> ZooKeeper. This can, for example, happen if the meta data is
>>>> deleted.
>>>>>>>> Checkpoints with unretrievable state handles are skipped.
>>>>>>>> Status: Merged
>>>>>>>> 
>>>>>>>> 1.1.5:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> (7) https://issues.apache.org/jira/browse/FLINK-5942
>>>>>>>> Hardens the checkpoint recovery in case of corrupted ZooKeeper data.
>>>>>>>> Corrupted checkpoints will now be skipped.
>>>>>>>> Status: Merged
>>>>>>>> 
>>>>>>>> (8) https://issues.apache.org/jira/browse/FLINK-5940
>>>>>>>> Hardens the checkpoint recovery in case that we cannot retrieve the
>>>>>>>> completed checkpoint from the meta data state handle retrieved from
>>>>>>>> ZooKeeper. This can, for example, happen if the meta data is
>>>> deleted.
>>>>>>>> Checkpoints with unretrievable state handles are skipped.
>>>>>>>> Status: Merged
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Till
>>>>>>>> 
>>>>>>>> On Tue, Mar 14, 2017 at 12:02 PM, Tzu-Li (Gordon) Tai <
>>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi all!
>>>>>>>>> 
>>>>>>>>> I would like to start a discussion for the next bugfix release for
>>>>>> 1.1.x
>>>>>>>>> and 1.2.x.
>>>>>>>>> There’s been quite a few critical fixes for bugs in both the
>>>> releases
>>>>>>>>> recently, and I think they deserve a bugfix release soon.
>>>>>>>>> Most of the bugs were reported by users.
>>>>>>>>> 
>>>>>>>>> I’m starting the discussion for both bugfix releases because most
>>>> fixes
>>>>>>>>> span both releases (almost identical).
>>>>>>>>> Of course, the actual RC votes and RC creation process doesn’t
>>>> have to
>>>>>> be
>>>>>>>>> started together.
>>>>>>>>> 
>>>>>>>>> Here’s an overview of what’s been collected so far, for both bugfix
>>>>>>>>> releases -
>>>>>>>>> (it’s a list of what I’m aware of so far, and may be missing stuff;
>>>>>> please
>>>>>>>>> append and bring to attention as necessary :-) )
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> For Flink 1.2.1:
>>>>>>>>> 
>>>>>>>>> (1) https://issues.apache.org/jira/browse/FLINK-5701:
>>>>>>>>> Async exceptions in the FlinkKafkaProducer are not checked on
>>>>>> checkpoints.
>>>>>>>>> This compromises the producer’s at-least-once guarantee.
>>>>>>>>> Status: merged
>>>>>>>>> 
>>>>>>>>> (2) https://issues.apache.org/jira/browse/FLINK-5949:
>>>>>>>>> Do not check Kerberos credentials for non-Kerberos authentications.
>>>>>> MapR
>>>>>>>>> users are affected by this, and cannot submit Flink on YARN jobs
>>>> on a
>>>>>>>>> secured MapR cluster.
>>>>>>>>> Status: PR - https://github.com/apache/flink/pull/3528, one +1
>>>> already
>>>>>>>>> 
>>>>>>>>> (3) https://issues.apache.org/jira/browse/FLINK-6006:
>>>>>>>>> Kafka Consumer can lose state if queried partition list is
>>>> incomplete
>>>>>> on
>>>>>>>>> restore.
>>>>>>>>> Status: PR - https://github.com/apache/flink/pull/3505, one +1
>>>> already
>>>>>>>>> 
>>>>>>>>> (4) https://issues.apache.org/jira/browse/FLINK-6025:
>>>>>>>>> KryoSerializer may use the wrong classloader when Kryo’s
>>>>>> JavaSerializer is
>>>>>>>>> used.
>>>>>>>>> Status: merged
>>>>>>>>> 
>>>>>>>>> (5) https://issues.apache.org/jira/browse/FLINK-5771:
>>>>>>>>> Fix multi-char delimiters in Batch InputFormats.
>>>>>>>>> Status: merged
>>>>>>>>> 
>>>>>>>>> (6) https://issues.apache.org/jira/browse/FLINK-5934:
>>>>>>>>> Set the Scheduler in the ExecutionGraph via its constructor. This
>>>>>> fixes a
>>>>>>>>> bug that causes HA recovery to fail.
>>>>>>>>> Status: merged
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> For Flink 1.1.5:
>>>>>>>>> 
>>>>>>>>> (1) https://issues.apache.org/jira/browse/FLINK-5701:
>>>>>>>>> Async exceptions in the FlinkKafkaProducer are not checked on
>>>>>> checkpoints.
>>>>>>>>> This compromises the producer’s at-least-once guarantee.
>>>>>>>>> Status: This is already merged for 1.2.1. I would personally like
>>>> to
>>>>>>>>> backport the fix for this to 1.1.5 also.
>>>>>>>>> 
>>>>>>>>> (2) https://issues.apache.org/jira/browse/FLINK-6006:
>>>>>>>>> Kafka Consumer can lose state if queried partition list is
>>>> incomplete
>>>>>> on
>>>>>>>>> restore.
>>>>>>>>> Status: PR - https://github.com/apache/flink/pull/3507, one +1
>>>> already
>>>>>>>>> 
>>>>>>>>> (3) https://issues.apache.org/jira/browse/FLINK-6025:
>>>>>>>>> KryoSerializer may use the wrong classloader when Kryo’s
>>>>>> JavaSerializer is
>>>>>>>>> used.
>>>>>>>>> Status: merged
>>>>>>>>> 
>>>>>>>>> (4) https://issues.apache.org/jira/browse/FLINK-5771:
>>>>>>>>> Fix multi-char delimiters in Batch InputFormats.
>>>>>>>>> Status: merged
>>>>>>>>> 
>>>>>>>>> (5) https://issues.apache.org/jira/browse/FLINK-5934:
>>>>>>>>> Set the Scheduler in the ExecutionGraph via its constructor. This
>>>>>> fixes a
>>>>>>>>> bug that causes HA recovery to fail.
>>>>>>>>> Status: merged
>>>>>>>>> 
>>>>>>>>> (6) https://issues.apache.org/jira/browse/FLINK-5048:
>>>>>>>>> Kafka Consumer (0.9/0.10) threading model leads problematic
>>>>>> cancellation
>>>>>>>>> behavior.
>>>>>>>>> Status: This fix was already released in 1.2.0, but never made it
>>>> into
>>>>>> the
>>>>>>>>> 1.1.x bugfixes. Do we want to backport this also for 1.1.5?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> What do you think? From the list so far, we pretty much already
>>>> have
>>>>>>>>> everything in, so I think it would be nice to aim for RCs by the
>>>> end of
>>>>>>>>> this week.
>>>>>>>>> Since both bugfix releases cover almost the same list of issues, I
>>>>>> think
>>>>>>>>> it shouldn’t be too hard for us to kick off both bugfix releases
>>>>>> around the
>>>>>>>>> same time.
>>>>>>>>> 
>>>>>>>>> Also FYI, here’s the lists of JIRA tickets tagged with "1.2.1” /
>>>>>> “1.1.5”
>>>>>>>>> as the Fix Versions, and are still open.
>>>>>>>>> We should probably want to check if there’s anything on there that
>>>> we
>>>>>>>>> should block on for the releases:
>>>>>>>>> 
>>>>>>>>> For 1.2.1:
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-5711?jql=
>>>>>>>>> project%20%3D%20FLINK%20AND%20status%20in%20(Open%2C%20%
>>>>>>>>> 22In%20Progress%22%2C%20Reopened)%20AND%20fixVersion%20%3D%201.2.1
>>>>>>>>> 
>>>>>>>>> For 1.1.5:
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-6006?jql=
>>>>>>>>> project%20%3D%20FLINK%20AND%20status%20in%20(Open%2C%20%
>>>>>>>>> 22In%20Progress%22%2C%20Reopened)%20AND%20fixVersion%20%3D%201.1.5
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
>>

Re: [DISCUSS] Release Flink 1.1.5 / Flink 1.2.1

Reply via email to