Hi all! According to the plan <https://cwiki.apache.org/confluence/display/FLINK/1.12+Release> discussed earlier in the release cycle, the feature freeze is expected to happen in the week of October 26th. That's in 2.5 weeks from now.
I believe now is the time to discuss if we want to postpone the feature freeze. In my opinion, I would prefer to stick to the original schedule and rather delay features to the 1.13 release if they are not ready yet. >From a stability perspective, we currently have the following situation: - 6 blockers: https://issues.apache.org/jira/browse/FLINK-19154?filter=12349334, most of them are making progress, I notified people on those where the status is unclear. - 80 test instabilities: https://issues.apache.org/jira/browse/FLINK-18117?filter=12348580&jql=project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20test-stability%20ORDER%20BY%20updated%20DESC%2C%20created%20DESC - The CI system is a bit unstable these days: The e2e tests are often timing out. I will look into options to mitigate this. Drilling deeper into the test instabilities, these are some notable clusters of test instabilities (with recent failures, usually more than once) [tests marked with >> have nobody assigned] E2E tests, probably all test infrastructure >> "Kerberized YARN per-job on Docker test" fails with "Could not start hadoop cluster." https://issues.apache.org/jira/browse/FLINK-18117 >> SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1) failed due to download error https://issues.apache.org/jira/browse/FLINK-17424 - "ES6 ElasticsearchSinkITCase unstable" https://issues.apache.org/jira/browse/FLINK-17159 - "Avro Confluent Schema Registry nightly end-to-end test failed with "Register operation timed out; error code: 50002"" https://issues.apache.org/jira/browse/FLINK-19422 - "SQLClientHBaseITCase.testHBase fails on azure" https://issues.apache.org/jira/browse/FLINK-18570 New Source API - "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable" https://issues.apache.org/jira/browse/FLINK-19427 >> "CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs" https://issues.apache.org/jira/browse/FLINK-19448 - "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent gets stuck" https://issues.apache.org/jira/browse/FLINK-19489 Distributed Coordination - "LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed with "NoResourceAvailableException: Could not allocate the required slot within slot request timeout" https://issues.apache.org/jira/browse/FLINK-19237 - "TaskExecutorSubmissionTest#testFailingScheduleOrUpdateConsumers" https://issues.apache.org/jira/browse/FLINK-17458 - "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange times out" https://issues.apache.org/jira/browse/FLINK-19514 - "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange: ZooKeeper unexpectedly modified" https://issues.apache.org/jira/browse/FLINK-19458 Kafka >> "KafkaITCase failing with "Failed to send data to Kafka: This server does not host this topic-partition"" https://issues.apache.org/jira/browse/FLINK-18444 >> "KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388 expected:<310> but was:<0>" https://issues.apache.org/jira/browse/FLINK-17949 - "KafkaITCase.testKeyValueSupport failure due to assertion error."" https://issues.apache.org/jira/browse/FLINK-15745 - "KafkaITCase.testStartFromGroupOffsets times out on azure" https://issues.apache.org/jira/browse/FLINK-18648 - "FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis" https://issues.apache.org/jira/browse/FLINK-13733 On Tue, Sep 29, 2020 at 11:49 AM Dian Fu <dian0511...@gmail.com> wrote: > Hi all, > > I'd like to update the status about the blocker issues and build > instabilities as there is only one month left and the number of blocker > issues increases a lot compared to last week. > > == Blockers: > https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 < > https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> > > Currently there are 10 blocker issues > - 3 performance regression ( > https://issues.apache.org/jira/browse/FLINK-19439 < > https://issues.apache.org/jira/browse/FLINK-19439>, > https://issues.apache.org/jira/browse/FLINK-19440 < > https://issues.apache.org/jira/browse/FLINK-19440>, > https://issues.apache.org/jira/browse/FLINK-19441 < > https://issues.apache.org/jira/browse/FLINK-19441>) > - 3 Runtime (https://issues.apache.org/jira/browse/FLINK-19264 < > https://issues.apache.org/jira/browse/FLINK-19264>, > https://issues.apache.org/jira/browse/FLINK-19388 < > https://issues.apache.org/jira/browse/FLINK-19388>, > https://issues.apache.org/jira/browse/FLINK-19249 < > https://issues.apache.org/jira/browse/FLINK-19249>) > - 1 HBase connector (https://issues.apache.org/jira/browse/FLINK-19445 < > https://issues.apache.org/jira/browse/FLINK-19445>) > - 1 Application mode (https://issues.apache.org/jira/browse/FLINK-19154 < > https://issues.apache.org/jira/browse/FLINK-19154>) > - 1 New source API (https://issues.apache.org/jira/browse/FLINK-19384 < > https://issues.apache.org/jira/browse/FLINK-19384>) > - 1 Kinesis (https://issues.apache.org/jira/browse/FLINK-19332 < > https://issues.apache.org/jira/browse/FLINK-19332>) > > == Recent notable build instabilities which still have no owners: > - New source API > https://issues.apache.org/jira/browse/FLINK-19253 < > https://issues.apache.org/jira/browse/FLINK-19253> > SourceReaderTestBase.testAddSplitToExistingFetcher hangs > https://issues.apache.org/jira/browse/FLINK-19370 < > https://issues.apache.org/jira/browse/FLINK-19370> > FileSourceTextLinesITCase.testContinuousTextFileSource failed as results > mismatch > https://issues.apache.org/jira/browse/FLINK-19427 < > https://issues.apache.org/jira/browse/FLINK-19427> > SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable, > https://issues.apache.org/jira/browse/FLINK-19437 < > https://issues.apache.org/jira/browse/FLINK-19437> > FileSourceTextLinesITCase.testContinuousTextFileSource failed with > "SimpleStreamFormat is not splittable, but found split end (0) different > from file length (198)" > https://issues.apache.org/jira/browse/FLINK-19448 < > https://issues.apache.org/jira/browse/FLINK-19448> > CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs > - Runtime/Network > https://issues.apache.org/jira/browse/FLINK-19426 < > https://issues.apache.org/jira/browse/FLINK-19426> End-to-end test > sometimes fails with PartitionConnectionException > - Unaligned Checkpoint > https://issues.apache.org/jira/browse/FLINK-19027 < > https://issues.apache.org/jira/browse/FLINK-19027> > UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointOnParallelRemoteChannel > failed because of test timeout > - Table > https://issues.apache.org/jira/browse/FLINK-19340 < > https://issues.apache.org/jira/browse/FLINK-19340> > AggregateITCase.testListAggWithDistinct failed with "expected:<List(1,A, > 2,B, 3,C#A, 4,EF)> but was:<List(1,A, 2,B, 3,C#A, 4,EF#EF)>" > - HBase connector > https://issues.apache.org/jira/browse/FLINK-18570 < > https://issues.apache.org/jira/browse/FLINK-18570> > SQLClientHBaseITCase.testHBase fails on azure > https://issues.apache.org/jira/browse/FLINK-19447 < > https://issues.apache.org/jira/browse/FLINK-19447> > HBaseConnectorITCase.HBaseTestingClusterAutoStarter failed with "Master not > initialized after 200000ms" > - Avro > https://issues.apache.org/jira/browse/FLINK-19422 < > https://issues.apache.org/jira/browse/FLINK-19422> Avro Confluent Schema > Registry nightly end-to-end test failed with "Register operation timed out; > error code: 50002" > > Regards, > Dian > > > 在 2020年9月21日,下午2:32,Robert Metzger <rmetz...@apache.org> 写道: > > > > Hi all, > > > > An update on the release status: > > 1. We have 35 days = *5 weeks left until feature freeze* > > 2. There are currently 2 blockers for Flink > > <https://issues.apache.org/jira/browse/FLINK-19264?filter=12349334>, all > > making progress > > 3. We have 72 test instabilities > > <https://issues.apache.org/jira/browse/FLINK-19237> (down 7 from 2 weeks > > ago). I have pinged people to help addressing frequent or critical > issues. > > > > Best, > > Robert > > > > > > On Mon, Sep 7, 2020 at 10:37 AM Robert Metzger <rmetz...@apache.org> > wrote: > > > >> Hi all, > >> > >> another two weeks have passed. We now have 5 blockers > >> <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> (Up > >> 3 from 2 weeks ago), but they are all making progress. > >> > >> We currently have 79 test-instabilities > >> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>, > >> since the last report, a few have been resolved, and some others have > been > >> added. > >> I have checked the tickets, closed some old ones and pinged people to > help > >> resolve new or frequent ones. > >> Except for Kafka, there are no major clusters of test instabilities. > Most > >> failures are rarely failing tests across the entire system. > >> > >> > >> On Tue, Aug 25, 2020 at 9:05 AM Rui Li <lirui.fu...@gmail.com> wrote: > >> > >>> Thanks Dian for the pointer. I'll take a look. > >>> > >>> On Tue, Aug 25, 2020 at 3:02 PM Dian Fu <dian0511...@gmail.com> wrote: > >>> > >>>> Thanks Rui for the info. This issue(hive related) > >>>> https://issues.apache.org/jira/browse/FLINK-19025 < > >>>> https://issues.apache.org/jira/browse/FLINK-19025> is marked as a > >>> blocker. > >>>> > >>>> Regards, > >>>> Dian > >>>> > >>>>> 在 2020年8月25日,下午2:58,Rui Li <lirui.fu...@gmail.com> 写道: > >>>>> > >>>>> Hi Dian, > >>>>> > >>>>> FLINK-18682 has been fixed. Is there any other blocker in the hive > >>>>> connector? > >>>>> > >>>>> On Tue, Aug 25, 2020 at 2:41 PM Dian Fu <dian0511...@gmail.com > >>> <mailto: > >>>> dian0511...@gmail.com>> wrote: > >>>>> > >>>>>> Hi all, > >>>>>> > >>>>>> Two weeks have passed and it seems that none of the test stabilities > >>>>>> issues have been addressed since then. > >>>>>> > >>>>>> Here is an updated status report of Blockers and Test instabilities: > >>>>>> > >>>>>> Blockers < > >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 < > >>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> < > >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 < > >>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>>>: > >>>>>> Currently 2 blockers (1x Hive, 1x CI Infra) > >>>>>> > >>>>>> Test-Instabilities < > >>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 < > >>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580> < > >>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 < > >>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>>>: > >>>>>> (total 80) > >>>>>> > >>>>>> Besides the issues already posted in previous mail, here are the new > >>>>>> instability issues which should be taken care of: > >>>>>> > >>>>>> - FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012 < > >>>> https://issues.apache.org/jira/browse/FLINK-19012> < > >>>>>> https://issues.apache.org/jira/browse/FLINK-19012 < > >>>> https://issues.apache.org/jira/browse/FLINK-19012>>) > >>>>>> E2E test fails with "Cannot register Closeable, this > >>>>>> subtaskCheckpointCoordinator is already closed. Closing argument." > >>>>>> > >>>>>> -> This is a new issue occurred recently. It has occurred several > >>> times > >>>>>> and may indicate a bug somewhere and should be taken care of. > >>>>>> > >>>>>> - FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 < > >>>> https://issues.apache.org/jira/browse/FLINK-9992> < > >>>>>> https://issues.apache.org/jira/browse/FLINK-9992 < > >>>> https://issues.apache.org/jira/browse/FLINK-9992>>) > >>>>>> FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI > >>>>>> > >>>>>> -> There is already a PR for it and needs review. > >>>>>> > >>>>>> - FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842 < > >>>> https://issues.apache.org/jira/browse/FLINK-18842> < > >>>>>> https://issues.apache.org/jira/browse/FLINK-18842 < > >>>> https://issues.apache.org/jira/browse/FLINK-18842>>) > >>>>>> e2e test failed to download "localhost:9999/flink.tgz" in "Wordcount > >>> on > >>>>>> Docker test" > >>>>>> > >>>>>> > >>>>>>> 在 2020年8月11日,下午2:08,Robert Metzger <rmetz...@apache.org> 写道: > >>>>>>> > >>>>>>> Hi team, > >>>>>>> > >>>>>>> 2 weeks have passed since the last update. None of the test > >>> stabilities > >>>>>>> I've mentioned have been addressed since then. > >>>>>>> > >>>>>>> Here's an updated status report of Blockers and Test instabilities: > >>>>>>> > >>>>>>> Blockers < > >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>: > >>>>>>> Currently 3 blockers (2x Hive, 1x CI Infra) > >>>>>>> > >>>>>>> Test-Instabilities > >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 > > > >>>>>> (total > >>>>>>> 79) which failed recently or frequently: > >>>>>>> > >>>>>>> > >>>>>>> - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807> > >>>>>>> FlinkKafkaProducerITCase.testScaleUpAfterScalingDown > >>>>>>> failed with "Timeout expired after 60000milliseconds while awaiting > >>>>>>> EndTxn(COMMIT)" > >>>>>>> > >>>>>>> - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634> > >>>>>>> FlinkKafkaProducerITCase.testRecoverCommittedTransaction > >>>>>>> failed with "Timeout expired after 60000milliseconds while awaiting > >>>>>>> InitProducerId" > >>>>>>> > >>>>>>> - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908> > >>>>>>> FlinkKafkaProducerITCase > >>>>>>> testScaleUpAfterScalingDown Timeout expired while initializing > >>>>>>> transactional state in 60000ms. > >>>>>>> > >>>>>>> - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733> > >>>>>>> FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis > >>>>>>> > >>>>>>> --> The first three tickets seem related. > >>>>>>> > >>>>>>> > >>>>>>> - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260> > >>>>>>> StreamingKafkaITCase failure on Azure > >>>>>>> > >>>>>>> --> This one seems really hard to reproduce > >>>>>>> > >>>>>>> > >>>>>>> - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768> > >>>>>>> HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart > >>>>>>> hangs > >>>>>>> > >>>>>>> - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374> > >>>>>>> > >>>>>> > >>>> > >>> > HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart > >>>>>>> produced no output for 900 seconds > >>>>>>> > >>>>>>> --> nobody seems to feel responsible for these tickets. My guess is > >>>> that > >>>>>>> the S3 connector should have shorter timeouts / faster retries to > >>>> finish > >>>>>>> within the 15 minutes test timeout. OR there is really something > >>> wrong > >>>>>> with > >>>>>>> the code. > >>>>>>> > >>>>>>> > >>>>>>> - FLINK-18333 UnsignedTypeConversionITCase failed caused by > >>> MariaDB4j > >>>>>>> "Asked to waitFor Program" > >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333> > >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159 > >>>>>>> <https://issues.apache.org/jira/browse/FLINK-17159> ES6 > >>>>>>> ElasticsearchSinkITCase unstable > >>>>>>> > >>>>>>> - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949> > >>>>>>> KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388 > >>>>>>> expected:<310> but was:<0> > >>>>>>> > >>>>>>> - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222> > >>>> "Avro > >>>>>>> Confluent Schema Registry nightly end-to-end test" unstable with > >>> "Kafka > >>>>>>> cluster did not start after 120 seconds" > >>>>>>> > >>>>>>> - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511> > >>>>>> "RocksDB > >>>>>>> Memory Management end-to-end test" fails with "Current block cache > >>>> usage > >>>>>>> 202123272 larger than expected memory limit 200000000" > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger < > rmetz...@apache.org > >>>> > >>>>>> wrote: > >>>>>>> > >>>>>>>> Hi team, > >>>>>>>> > >>>>>>>> We would like to use this thread as a permanent thread for > >>>>>>>> regularly syncing on stale blockers (need to have somebody > assigned > >>>>>> within > >>>>>>>> a week and progress, or a good plan) and build instabilities (need > >>> to > >>>>>> check > >>>>>>>> if its a blocker). > >>>>>>>> > >>>>>>>> Recent test-instabilities: > >>>>>>>> > >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test) > >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test > >>>>>> unstable) > >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test > >>>>>> unstable) > >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17949 > >>>>>>>> (KafkaShuffleITCase) > >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka > >>>>>>>> transactions) > >>>>>>>> > >>>>>>>> > >>>>>>>> It would be nice if the committers taking care of these components > >>>> could > >>>>>>>> look into the test failures. > >>>>>>>> If nothing happens, we'll personally reach out to people I believe > >>>> they > >>>>>>>> could look into the ticket. > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> Dian & Robert > >>>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> -- > >>>>> Best regards! > >>>>> Rui Li > >>>> > >>>> > >>> > >>> -- > >>> Best regards! > >>> Rui Li > >>> > >> > >