Roman Khachatryan created FLINK-22232: -----------------------------------------
Summary: Improve test coverage for network stack Key: FLINK-22232 URL: https://issues.apache.org/jira/browse/FLINK-22232 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing, Runtime / Network, Tests Reporter: Roman Khachatryan Assignee: Piotr Nowojski Fix For: 1.13.0 This is a follow-up ticket after FLINK-20097. With the current setup (UnalignedITCase): - race conditions are not detected reliably (1 per tens of runs) - require changing the configuration (low checkpoint timeout) - adding a new job graph often reveals a new bug An additional issue with the current setup is that it's difficult to git bisect (for long ranges). Changes that might hide the bugs: - having Preconditions in ChannelStatePersister (slow down processing) - some Preconditions may mask errors by causing job restart - timings in tests (UnalignedITCase) Some options to consider # chaos monkey tests including induced latency and/or CPU bursts - on different workloads/configs # side-by-side tests with randomized inputs/configs Extending Jepsen coverage further (validating output) does not seem promising in the context of Flink because it's output isn't linearisable. Some tools for (1) that could be used: 1. https://github.com/chaosblade-io/chaosblade (docs need translation) 2. https://github.com/Netflix/chaosmonkey - requires spinnaker (CD) 3. jvm agent: https://github.com/mrwilson/byte-monkey 4. https://vmware.github.io/mangle/ - supports java method latency; ui oriented?; not actively maintained? -- This message was sent by Atlassian Jira (v8.3.4#803005)