[ https://issues.apache.org/jira/browse/FLINK-32668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17747441#comment-17747441 ]
Matthias Pohl commented on FLINK-32668: --------------------------------------- As far as I see it, this failure can only occur if the watchdog thread exited on its own or the watchdog process was exited by some other process (which would be problematic if the corresponding test is still operating). The latter case is the problematic one because we might miss killing the test. We should throw an error in that case. We don't have to worry if both processes are already killed (in this situation, maybe even a warning is not needed but rather a informal output). WDYT? > fix up watchdog timeout error msg in common.sh(e2e test) > ---------------------------------------------------------- > > Key: FLINK-32668 > URL: https://issues.apache.org/jira/browse/FLINK-32668 > Project: Flink > Issue Type: Bug > Components: Build System / CI > Affects Versions: 1.16.2, 1.18.0, 1.17.1 > Reporter: Hongshun Wang > Assignee: Hongshun Wang > Priority: Minor > Attachments: image-2023-07-25-15-27-37-441.png > > > When run e2e test, an error like this occrurs: > !image-2023-07-25-15-27-37-441.png|width=733,height=115! > > The corresponding code: > {code:java} > kill_test_watchdog() { > local watchdog_pid=$(cat $TEST_DATA_DIR/job_watchdog.pid) > echo "Stopping job timeout watchdog (with pid=$watchdog_pid)" > kill $watchdog_pid > } > internal_run_with_timeout() { > local timeout_in_seconds="$1" > local on_failure="$2" > local command_label="$3" > local command="${@:4}" > on_exit kill_test_watchdog > ( > command_pid=$BASHPID > (sleep "${timeout_in_seconds}" # set a timeout for this command > echo "${command_label:-"The command '${command}'"} (pid: > $command_pid) did not finish after $timeout_in_seconds seconds." > eval "${on_failure}" > kill "$command_pid") & watchdog_pid=$! > echo $watchdog_pid > $TEST_DATA_DIR/job_watchdog.pid > # invoke > $command > ) > }{code} > > When {{$command}} completes before the timeout, the watchdog process is > killed successfully. However, when {{$command}} times out, the watchdog > process kills {{$command}} and then exits itself, leaving behind an error > message when trying to kill its own process ID with {{{}kill > $watchdog_pid{}}}.This error msg "no such process" is hard to understand. > > So, I will modify like this with better error message: > > {code:java} > kill_test_watchdog() { > local watchdog_pid=$(cat $TEST_DATA_DIR/job_watchdog.pid) > if kill -0 $watchdog_pid > /dev/null 2>&1; then > echo "Stopping job timeout watchdog (with pid=$watchdog_pid)" > kill $watchdog_pid > else > echo "[ERROR] Test is timeout" > exit 1 > fi > } {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)