Hongshun Wang created FLINK-32668: ------------------------------------- Summary: fix up watchdog timeout bug in common.sh(e2e test) ? Key: FLINK-32668 URL: https://issues.apache.org/jira/browse/FLINK-32668 Project: Flink Issue Type: Improvement Components: Build System / CI Affects Versions: 1.17.1 Reporter: Hongshun Wang Fix For: 1.17.2 Attachments: image-2023-07-25-15-27-37-441.png
When run e2e test, an error like this occrurs: !image-2023-07-25-15-27-37-441.png|width=733,height=115! then I find a problem in the corresponding code: {code:java} kill_test_watchdog() { local watchdog_pid=$(cat $TEST_DATA_DIR/job_watchdog.pid) echo "Stopping job timeout watchdog (with pid=$watchdog_pid)" kill $watchdog_pid } internal_run_with_timeout() { local timeout_in_seconds="$1" local on_failure="$2" local command_label="$3" local command="${@:4}" on_exit kill_test_watchdog ( command_pid=$BASHPID (sleep "${timeout_in_seconds}" # set a timeout for this command echo "${command_label:-"The command '${command}'"} (pid: $command_pid) did not finish after $timeout_in_seconds seconds." eval "${on_failure}" kill "$command_pid") & watchdog_pid=$! echo $watchdog_pid > $TEST_DATA_DIR/job_watchdog.pid # invoke $command ) }{code} When {{$command}} completes before the timeout, the watchdog process is killed successfully. However, when {{$command}} times out, the watchdog process kills {{$command}} and then exits itself, leaving behind an error message when trying to kill its own process ID with {{{}kill $watchdog_pid{}}}. So, I will modify like this: {code:java} kill_test_watchdog() { local watchdog_pid=$(cat $TEST_DATA_DIR/job_watchdog.pid) if kill -0 $watchdog_pid > /dev/null 2>&1; then echo "Stopping job timeout watchdog (with pid=$watchdog_pid)" kill $watchdog_pid else echo "watchdog (with pid=$watchdog_pid) does not exist now" fi } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)