Hi all, Two weeks have passed and it seems that none of the test stabilities issues have been addressed since then.
Here is an updated status report of Blockers and Test instabilities: Blockers <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>>: Currently 2 blockers (1x Hive, 1x CI Infra) Test-Instabilities <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>>: (total 80) Besides the issues already posted in previous mail, here are the new instability issues which should be taken care of: - FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012 <https://issues.apache.org/jira/browse/FLINK-19012>) E2E test fails with "Cannot register Closeable, this subtaskCheckpointCoordinator is already closed. Closing argument." -> This is a new issue occurred recently. It has occurred several times and may indicate a bug somewhere and should be taken care of. - FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 <https://issues.apache.org/jira/browse/FLINK-9992>) FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI -> There is already a PR for it and needs review. - FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842 <https://issues.apache.org/jira/browse/FLINK-18842>) e2e test failed to download "localhost:9999/flink.tgz" in "Wordcount on Docker test" > 在 2020年8月11日,下午2:08,Robert Metzger <rmetz...@apache.org> 写道: > > Hi team, > > 2 weeks have passed since the last update. None of the test stabilities > I've mentioned have been addressed since then. > > Here's an updated status report of Blockers and Test instabilities: > > Blockers <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>: > Currently 3 blockers (2x Hive, 1x CI Infra) > > Test-Instabilities > <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580> (total > 79) which failed recently or frequently: > > > - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807> > FlinkKafkaProducerITCase.testScaleUpAfterScalingDown > failed with "Timeout expired after 60000milliseconds while awaiting > EndTxn(COMMIT)" > > - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634> > FlinkKafkaProducerITCase.testRecoverCommittedTransaction > failed with "Timeout expired after 60000milliseconds while awaiting > InitProducerId" > > - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908> > FlinkKafkaProducerITCase > testScaleUpAfterScalingDown Timeout expired while initializing > transactional state in 60000ms. > > - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733> > FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis > > --> The first three tickets seem related. > > > - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260> > StreamingKafkaITCase failure on Azure > > --> This one seems really hard to reproduce > > > - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768> > HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart > hangs > > - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374> > HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart > produced no output for 900 seconds > > --> nobody seems to feel responsible for these tickets. My guess is that > the S3 connector should have shorter timeouts / faster retries to finish > within the 15 minutes test timeout. OR there is really something wrong with > the code. > > > - FLINK-18333 UnsignedTypeConversionITCase failed caused by MariaDB4j > "Asked to waitFor Program" > <https://issues.apache.org/jira/browse/FLINK-18333> > <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159 > <https://issues.apache.org/jira/browse/FLINK-17159> ES6 > ElasticsearchSinkITCase unstable > > - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949> > KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388 > expected:<310> but was:<0> > > - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222> "Avro > Confluent Schema Registry nightly end-to-end test" unstable with "Kafka > cluster did not start after 120 seconds" > > - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511> "RocksDB > Memory Management end-to-end test" fails with "Current block cache usage > 202123272 larger than expected memory limit 200000000" > > > > > On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <rmetz...@apache.org> wrote: > >> Hi team, >> >> We would like to use this thread as a permanent thread for >> regularly syncing on stale blockers (need to have somebody assigned within >> a week and progress, or a good plan) and build instabilities (need to check >> if its a blocker). >> >> Recent test-instabilities: >> >> - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test) >> - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test unstable) >> - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test unstable) >> - https://issues.apache.org/jira/browse/FLINK-17949 >> (KafkaShuffleITCase) >> - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka >> transactions) >> >> >> It would be nice if the committers taking care of these components could >> look into the test failures. >> If nothing happens, we'll personally reach out to people I believe they >> could look into the ticket. >> >> Best, >> Dian & Robert >>