[ https://issues.apache.org/jira/browse/FLINK-20103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Piotr Nowojski updated FLINK-20103: ----------------------------------- Fix Version/s: (was: 1.17.0) > Improve test coverage with chaos testing & side-by-side tests > ------------------------------------------------------------- > > Key: FLINK-20103 > URL: https://issues.apache.org/jira/browse/FLINK-20103 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / Network, Runtime / > State Backends, Tests > Reporter: Roman Khachatryan > Priority: Minor > Labels: auto-deprioritized-major, pull-request-available > > This is a follow-up ticket after FLINK-20097. > With the current setup (UnalignedITCase): > - race conditions are not detected reliably (1 per tens of runs) > - require changing the configuration (low checkpoint timeout) > - adding a new job graph often reveals a new bug > An additional issue with the current setup is that it's difficult to git > bisect (for long ranges). > Changes that might hide the bugs: > - having Preconditions in ChannelStatePersister (slow down processing) > - some Preconditions may mask errors by causing job restart > - timings in tests (UnalignedITCase) > Some options to consider > # chaos monkey tests including induced latency and/or CPU bursts - on > different workloads/configs > # side-by-side tests with randomized inputs/configs > Extending Jepsen coverage further (validating output) does not seem promising > in the context of Flink because it's output isn't linearisable. > > Some tools for (1) that could be used: > 1. https://github.com/chaosblade-io/chaosblade (docs need translation) > 2. https://github.com/Netflix/chaosmonkey - requires spinnaker (CD) > 3. jvm agent: https://github.com/mrwilson/byte-monkey > 4. https://vmware.github.io/mangle/ - supports java method latency; ui > oriented?; not actively maintained? > > -- This message was sent by Atlassian Jira (v8.20.10#820010)