Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Robert Metzger Mon, 12 Oct 2020 12:03:47 -0700

Hi all!

According to the plan
<https://cwiki.apache.org/confluence/display/FLINK/1.12+Release> discussed
earlier in the release cycle, the feature freeze is expected to happen in
the week of October 26th. That's in 2.5 weeks from now.


I believe now is the time to discuss if we want to postpone the feature
freeze.
In my opinion, I would prefer to stick to the original schedule and rather
delay features to the 1.13 release if they are not ready yet.

>From a stability perspective, we currently have the following situation:
- 6 blockers:
https://issues.apache.org/jira/browse/FLINK-19154?filter=12349334, most of
them are making progress, I notified people on those where the status is
unclear.
- 80 test instabilities:
https://issues.apache.org/jira/browse/FLINK-18117?filter=12348580&jql=project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20test-stability%20ORDER%20BY%20updated%20DESC%2C%20created%20DESC
- The CI system is a bit unstable these days: The e2e tests are often
timing out. I will look into options to mitigate this.



Drilling deeper into the test instabilities, these are some notable
clusters of test instabilities  (with recent failures, usually more than
once) [tests marked with >> have nobody assigned]

E2E tests, probably all test infrastructure
>> "Kerberized YARN per-job on Docker test" fails with "Could not start
hadoop cluster." https://issues.apache.org/jira/browse/FLINK-18117
>> SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1) failed
due to download error https://issues.apache.org/jira/browse/FLINK-17424
- "ES6 ElasticsearchSinkITCase unstable"
https://issues.apache.org/jira/browse/FLINK-17159
- "Avro Confluent Schema Registry nightly end-to-end test failed with
"Register operation timed out; error code: 50002""
https://issues.apache.org/jira/browse/FLINK-19422
- "SQLClientHBaseITCase.testHBase fails on azure"
https://issues.apache.org/jira/browse/FLINK-18570

New Source API
- "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable"
https://issues.apache.org/jira/browse/FLINK-19427
>> "CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs"
https://issues.apache.org/jira/browse/FLINK-19448
- "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent gets stuck"
https://issues.apache.org/jira/browse/FLINK-19489


Distributed Coordination
- "LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed with
"NoResourceAvailableException: Could not allocate the required slot within
slot request timeout" https://issues.apache.org/jira/browse/FLINK-19237
- "TaskExecutorSubmissionTest#testFailingScheduleOrUpdateConsumers"
https://issues.apache.org/jira/browse/FLINK-17458
- "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange
times out" https://issues.apache.org/jira/browse/FLINK-19514
- "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange:
ZooKeeper unexpectedly modified"
https://issues.apache.org/jira/browse/FLINK-19458

Kafka
>> "KafkaITCase failing with "Failed to send data to Kafka: This server
does not host this topic-partition""
https://issues.apache.org/jira/browse/FLINK-18444
>> "KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
expected:<310> but was:<0>"
https://issues.apache.org/jira/browse/FLINK-17949
- "KafkaITCase.testKeyValueSupport failure due to assertion error.""
https://issues.apache.org/jira/browse/FLINK-15745
- "KafkaITCase.testStartFromGroupOffsets times out on azure"
https://issues.apache.org/jira/browse/FLINK-18648
- "FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis"
https://issues.apache.org/jira/browse/FLINK-13733



On Tue, Sep 29, 2020 at 11:49 AM Dian Fu <dian0511...@gmail.com> wrote:

> Hi all,
>
> I'd like to update the status about the blocker issues and build
> instabilities as there is only one month left and the number of blocker
> issues increases a lot compared to last week.
>
> == Blockers:
> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>
>
> Currently there are 10 blocker issues
> - 3 performance regression (
> https://issues.apache.org/jira/browse/FLINK-19439 <
> https://issues.apache.org/jira/browse/FLINK-19439>,
> https://issues.apache.org/jira/browse/FLINK-19440 <
> https://issues.apache.org/jira/browse/FLINK-19440>,
> https://issues.apache.org/jira/browse/FLINK-19441 <
> https://issues.apache.org/jira/browse/FLINK-19441>)
> - 3 Runtime (https://issues.apache.org/jira/browse/FLINK-19264 <
> https://issues.apache.org/jira/browse/FLINK-19264>,
> https://issues.apache.org/jira/browse/FLINK-19388 <
> https://issues.apache.org/jira/browse/FLINK-19388>,
> https://issues.apache.org/jira/browse/FLINK-19249 <
> https://issues.apache.org/jira/browse/FLINK-19249>)
> - 1 HBase connector (https://issues.apache.org/jira/browse/FLINK-19445 <
> https://issues.apache.org/jira/browse/FLINK-19445>)
> - 1 Application mode (https://issues.apache.org/jira/browse/FLINK-19154 <
> https://issues.apache.org/jira/browse/FLINK-19154>)
> - 1 New source API (https://issues.apache.org/jira/browse/FLINK-19384 <
> https://issues.apache.org/jira/browse/FLINK-19384>)
> - 1 Kinesis (https://issues.apache.org/jira/browse/FLINK-19332 <
> https://issues.apache.org/jira/browse/FLINK-19332>)
>
> == Recent notable build instabilities which still have no owners:
> - New source API
>    https://issues.apache.org/jira/browse/FLINK-19253 <
> https://issues.apache.org/jira/browse/FLINK-19253>
> SourceReaderTestBase.testAddSplitToExistingFetcher hangs
>    https://issues.apache.org/jira/browse/FLINK-19370 <
> https://issues.apache.org/jira/browse/FLINK-19370>
> FileSourceTextLinesITCase.testContinuousTextFileSource failed as results
> mismatch
>    https://issues.apache.org/jira/browse/FLINK-19427 <
> https://issues.apache.org/jira/browse/FLINK-19427>
> SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable,
>    https://issues.apache.org/jira/browse/FLINK-19437 <
> https://issues.apache.org/jira/browse/FLINK-19437>
> FileSourceTextLinesITCase.testContinuousTextFileSource failed with
> "SimpleStreamFormat is not splittable, but found split end (0) different
> from file length (198)"
>    https://issues.apache.org/jira/browse/FLINK-19448 <
> https://issues.apache.org/jira/browse/FLINK-19448>
> CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs
> - Runtime/Network
>    https://issues.apache.org/jira/browse/FLINK-19426 <
> https://issues.apache.org/jira/browse/FLINK-19426>  End-to-end test
> sometimes fails with PartitionConnectionException
> - Unaligned Checkpoint
>    https://issues.apache.org/jira/browse/FLINK-19027 <
> https://issues.apache.org/jira/browse/FLINK-19027>
> UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointOnParallelRemoteChannel
> failed because of test timeout
> - Table
>    https://issues.apache.org/jira/browse/FLINK-19340 <
> https://issues.apache.org/jira/browse/FLINK-19340>
> AggregateITCase.testListAggWithDistinct failed with "expected:<List(1,A,
> 2,B, 3,C#A, 4,EF)> but was:<List(1,A, 2,B, 3,C#A, 4,EF#EF)>"
> - HBase connector
>    https://issues.apache.org/jira/browse/FLINK-18570 <
> https://issues.apache.org/jira/browse/FLINK-18570>
> SQLClientHBaseITCase.testHBase fails on azure
>     https://issues.apache.org/jira/browse/FLINK-19447 <
> https://issues.apache.org/jira/browse/FLINK-19447>
> HBaseConnectorITCase.HBaseTestingClusterAutoStarter failed with "Master not
> initialized after 200000ms"
> - Avro
>    https://issues.apache.org/jira/browse/FLINK-19422 <
> https://issues.apache.org/jira/browse/FLINK-19422>  Avro Confluent Schema
> Registry nightly end-to-end test failed with "Register operation timed out;
> error code: 50002"
>
> Regards,
> Dian
>
> > 在 2020年9月21日，下午2:32，Robert Metzger <rmetz...@apache.org> 写道：
> >
> > Hi all,
> >
> > An update on the release status:
> > 1. We have 35 days = *5 weeks left until feature freeze*
> > 2. There are currently 2 blockers for Flink
> > <https://issues.apache.org/jira/browse/FLINK-19264?filter=12349334>, all
> > making progress
> > 3. We have 72 test instabilities
> > <https://issues.apache.org/jira/browse/FLINK-19237> (down 7 from 2 weeks
> > ago). I have pinged people to help addressing frequent or critical
> issues.
> >
> > Best,
> > Robert
> >
> >
> > On Mon, Sep 7, 2020 at 10:37 AM Robert Metzger <rmetz...@apache.org>
> wrote:
> >
> >> Hi all,
> >>
> >> another two weeks have passed. We now have 5 blockers
> >> <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> (Up
> >> 3 from 2 weeks ago), but they are all making progress.
> >>
> >> We currently have 79 test-instabilities
> >> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>,
> >> since the last report, a few have been resolved, and some others have
> been
> >> added.
> >> I have checked the tickets, closed some old ones and pinged people to
> help
> >> resolve new or frequent ones.
> >> Except for Kafka, there are no major clusters of test instabilities.
> Most
> >> failures are rarely failing tests across the entire system.
> >>
> >>
> >> On Tue, Aug 25, 2020 at 9:05 AM Rui Li <lirui.fu...@gmail.com> wrote:
> >>
> >>> Thanks Dian for the pointer. I'll take a look.
> >>>
> >>> On Tue, Aug 25, 2020 at 3:02 PM Dian Fu <dian0511...@gmail.com> wrote:
> >>>
> >>>> Thanks Rui for the info. This issue(hive related)
> >>>> https://issues.apache.org/jira/browse/FLINK-19025 <
> >>>> https://issues.apache.org/jira/browse/FLINK-19025> is marked as a
> >>> blocker.
> >>>>
> >>>> Regards,
> >>>> Dian
> >>>>
> >>>>> 在 2020年8月25日，下午2:58，Rui Li <lirui.fu...@gmail.com> 写道：
> >>>>>
> >>>>> Hi Dian,
> >>>>>
> >>>>> FLINK-18682 has been fixed. Is there any other blocker in the hive
> >>>>> connector?
> >>>>>
> >>>>> On Tue, Aug 25, 2020 at 2:41 PM Dian Fu <dian0511...@gmail.com
> >>> <mailto:
> >>>> dian0511...@gmail.com>> wrote:
> >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> Two weeks have passed and it seems that none of the test stabilities
> >>>>>> issues have been addressed since then.
> >>>>>>
> >>>>>> Here is an updated status report of Blockers and Test instabilities:
> >>>>>>
> >>>>>> Blockers <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>>>:
> >>>>>> Currently 2 blockers (1x Hive, 1x CI Infra)
> >>>>>>
> >>>>>> Test-Instabilities <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580> <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>>>:
> >>>>>> (total 80)
> >>>>>>
> >>>>>> Besides the issues already posted in previous mail, here are the new
> >>>>>> instability issues which should be taken care of:
> >>>>>>
> >>>>>> - FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012 <
> >>>> https://issues.apache.org/jira/browse/FLINK-19012> <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-19012 <
> >>>> https://issues.apache.org/jira/browse/FLINK-19012>>)
> >>>>>> E2E test fails with "Cannot register Closeable, this
> >>>>>> subtaskCheckpointCoordinator is already closed. Closing argument."
> >>>>>>
> >>>>>> -> This is a new issue occurred recently. It has occurred several
> >>> times
> >>>>>> and may indicate a bug somewhere and should be taken care of.
> >>>>>>
> >>>>>> - FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 <
> >>>> https://issues.apache.org/jira/browse/FLINK-9992> <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-9992 <
> >>>> https://issues.apache.org/jira/browse/FLINK-9992>>)
> >>>>>> FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI
> >>>>>>
> >>>>>> -> There is already a PR for it and needs review.
> >>>>>>
> >>>>>> - FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18842> <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18842 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18842>>)
> >>>>>> e2e test failed to download "localhost:9999/flink.tgz" in "Wordcount
> >>> on
> >>>>>> Docker test"
> >>>>>>
> >>>>>>
> >>>>>>> 在 2020年8月11日，下午2:08，Robert Metzger <rmetz...@apache.org> 写道：
> >>>>>>>
> >>>>>>> Hi team,
> >>>>>>>
> >>>>>>> 2 weeks have passed since the last update. None of the test
> >>> stabilities
> >>>>>>> I've mentioned have been addressed since then.
> >>>>>>>
> >>>>>>> Here's an updated status report of Blockers and Test instabilities:
> >>>>>>>
> >>>>>>> Blockers <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>:
> >>>>>>> Currently 3 blockers (2x Hive, 1x CI Infra)
> >>>>>>>
> >>>>>>> Test-Instabilities
> >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
> >
> >>>>>> (total
> >>>>>>> 79) which failed recently or frequently:
> >>>>>>>
> >>>>>>>
> >>>>>>> - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807>
> >>>>>>> FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
> >>>>>>> failed with "Timeout expired after 60000milliseconds while awaiting
> >>>>>>> EndTxn(COMMIT)"
> >>>>>>>
> >>>>>>> - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634>
> >>>>>>> FlinkKafkaProducerITCase.testRecoverCommittedTransaction
> >>>>>>> failed with "Timeout expired after 60000milliseconds while awaiting
> >>>>>>> InitProducerId"
> >>>>>>>
> >>>>>>> - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908>
> >>>>>>> FlinkKafkaProducerITCase
> >>>>>>> testScaleUpAfterScalingDown Timeout expired while initializing
> >>>>>>> transactional state in 60000ms.
> >>>>>>>
> >>>>>>> - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733>
> >>>>>>> FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis
> >>>>>>>
> >>>>>>> --> The first three tickets seem related.
> >>>>>>>
> >>>>>>>
> >>>>>>> - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260>
> >>>>>>> StreamingKafkaITCase failure on Azure
> >>>>>>>
> >>>>>>> --> This one seems really hard to reproduce
> >>>>>>>
> >>>>>>>
> >>>>>>> - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768>
> >>>>>>> HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
> >>>>>>> hangs
> >>>>>>>
> >>>>>>> - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
> >>>>>>> produced no output for 900 seconds
> >>>>>>>
> >>>>>>> --> nobody seems to feel responsible for these tickets. My guess is
> >>>> that
> >>>>>>> the S3 connector should have shorter timeouts / faster retries to
> >>>> finish
> >>>>>>> within the 15 minutes test timeout. OR there is really something
> >>> wrong
> >>>>>> with
> >>>>>>> the code.
> >>>>>>>
> >>>>>>>
> >>>>>>> - FLINK-18333 UnsignedTypeConversionITCase failed caused by
> >>> MariaDB4j
> >>>>>>> "Asked to waitFor Program"
> >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>
> >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
> >>>>>>> <https://issues.apache.org/jira/browse/FLINK-17159> ES6
> >>>>>>> ElasticsearchSinkITCase unstable
> >>>>>>>
> >>>>>>> - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949>
> >>>>>>> KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
> >>>>>>> expected:<310> but was:<0>
> >>>>>>>
> >>>>>>> - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222>
> >>>> "Avro
> >>>>>>> Confluent Schema Registry nightly end-to-end test" unstable with
> >>> "Kafka
> >>>>>>> cluster did not start after 120 seconds"
> >>>>>>>
> >>>>>>> - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511>
> >>>>>> "RocksDB
> >>>>>>> Memory Management end-to-end test" fails with "Current block cache
> >>>> usage
> >>>>>>> 202123272 larger than expected memory limit 200000000"
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <
> rmetz...@apache.org
> >>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi team,
> >>>>>>>>
> >>>>>>>> We would like to use this thread as a permanent thread for
> >>>>>>>> regularly syncing on stale blockers (need to have somebody
> assigned
> >>>>>> within
> >>>>>>>> a week and progress, or a good plan) and build instabilities (need
> >>> to
> >>>>>> check
> >>>>>>>> if its a blocker).
> >>>>>>>>
> >>>>>>>> Recent test-instabilities:
> >>>>>>>>
> >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
> >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test
> >>>>>> unstable)
> >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test
> >>>>>> unstable)
> >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17949
> >>>>>>>> (KafkaShuffleITCase)
> >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
> >>>>>>>> transactions)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> It would be nice if the committers taking care of these components
> >>>> could
> >>>>>>>> look into the test failures.
> >>>>>>>> If nothing happens, we'll personally reach out to people I believe
> >>>> they
> >>>>>>>> could look into the ticket.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Dian & Robert
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Best regards!
> >>>>> Rui Li
> >>>>
> >>>>
> >>>
> >>> --
> >>> Best regards!
> >>> Rui Li
> >>>
> >>
>
>

Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Reply via email to