Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Dian Fu Mon, 24 Aug 2020 23:41:06 -0700

Hi all,

Two weeks have passed and it seems that none of the test stabilities issues 
have been addressed since then.


Here is an updated status report of Blockers and Test instabilities:

Blockers <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 
<https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>>:
Currently 2 blockers (1x Hive, 1x CI Infra)

Test-Instabilities 
<https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 
<https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>>:
(total 80)

Besides the issues already posted in previous mail, here are the new 
instability issues which should be taken care of:

- FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012 
<https://issues.apache.org/jira/browse/FLINK-19012>)
E2E test fails with "Cannot register Closeable, this 
subtaskCheckpointCoordinator is already closed. Closing argument."

-> This is a new issue occurred recently. It has occurred several times and may 
indicate a bug somewhere and should be taken care of.

- FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 
<https://issues.apache.org/jira/browse/FLINK-9992>)
FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI

-> There is already a PR for it and needs review.

- FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842 
<https://issues.apache.org/jira/browse/FLINK-18842>) 
e2e test failed to download "localhost:9999/flink.tgz" in "Wordcount on Docker 
test"


> 在 2020年8月11日，下午2:08，Robert Metzger <rmetz...@apache.org> 写道：
> 
> Hi team,
> 
> 2 weeks have passed since the last update. None of the test stabilities
> I've mentioned have been addressed since then.
> 
> Here's an updated status report of Blockers and Test instabilities:
> 
> Blockers <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>:
> Currently 3 blockers (2x Hive, 1x CI Infra)
> 
> Test-Instabilities
> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580> (total
> 79) which failed recently or frequently:
> 
> 
> - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807>
> FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
> failed with "Timeout expired after 60000milliseconds while awaiting
> EndTxn(COMMIT)"
> 
> - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634>
> FlinkKafkaProducerITCase.testRecoverCommittedTransaction
> failed with "Timeout expired after 60000milliseconds while awaiting
> InitProducerId"
> 
> - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908>
> FlinkKafkaProducerITCase
> testScaleUpAfterScalingDown Timeout expired while initializing
> transactional state in 60000ms.
> 
> - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733>
> FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis
> 
> --> The first three tickets seem related.
> 
> 
> - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260>
> StreamingKafkaITCase failure on Azure
> 
> --> This one seems really hard to reproduce
> 
> 
> - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768>
> HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
> hangs
> 
> - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374>
> HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
> produced no output for 900 seconds
> 
> --> nobody seems to feel responsible for these tickets. My guess is that
> the S3 connector should have shorter timeouts / faster retries to finish
> within the 15 minutes test timeout. OR there is really something wrong with
> the code.
> 
> 
> - FLINK-18333 UnsignedTypeConversionITCase failed caused by MariaDB4j
> "Asked to waitFor Program"
> <https://issues.apache.org/jira/browse/FLINK-18333>
> <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
> <https://issues.apache.org/jira/browse/FLINK-17159> ES6
> ElasticsearchSinkITCase unstable
> 
> - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949>
> KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
> expected:<310> but was:<0>
> 
> - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222> "Avro
> Confluent Schema Registry nightly end-to-end test" unstable with "Kafka
> cluster did not start after 120 seconds"
> 
> - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511> "RocksDB
> Memory Management end-to-end test" fails with "Current block cache usage
> 202123272 larger than expected memory limit 200000000"
> 
> 
> 
> 
> On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <rmetz...@apache.org> wrote:
> 
>> Hi team,
>> 
>> We would like to use this thread as a permanent thread for
>> regularly syncing on stale blockers (need to have somebody assigned within
>> a week and progress, or a good plan) and build instabilities (need to check
>> if its a blocker).
>> 
>> Recent test-instabilities:
>> 
>>   - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
>>   - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test unstable)
>>   - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test unstable)
>>   - https://issues.apache.org/jira/browse/FLINK-17949
>>   (KafkaShuffleITCase)
>>   - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
>>   transactions)
>> 
>> 
>> It would be nice if the committers taking care of these components could
>> look into the test failures.
>> If nothing happens, we'll personally reach out to people I believe they
>> could look into the ticket.
>> 
>> Best,
>> Dian & Robert
>>

Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Reply via email to