[ 
https://issues.apache.org/jira/browse/FLINK-32668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755421#comment-17755421
 ] 

Matthias Pohl commented on FLINK-32668:
---------------------------------------

If the test times out, the failure handling should be triggered (see 
[common.sh:957|https://github.com/apache/flink/blob/c8ae39d4ac73f81873e1d8ac37e17c29ae330b23/flink-end-to-end-tests/test-scripts/common.sh#L957])
 to print a meaningful output. Nonetheless, the following kill command will 
cause the actual test run to return with a non-zero exit code. As a 
consequence, the script will fail. So, your scenario should be already covered 
by the existing code.

The {{kill_test_watchdog}} covers a separate scenario: Killing the watchdog 
process if it's still around. Here, we shouldn't cover the corresponding test 
run anymore (and imply whether the test succeeded or not based on whether the 
watchdog process is still around). The error handling, as said in the previous 
paragraph, should be covered in the test execution itself.

Please correct me if I'm missing anything.

> fix up watchdog timeout error msg  in common.sh(e2e test) 
> ----------------------------------------------------------
>
>                 Key: FLINK-32668
>                 URL: https://issues.apache.org/jira/browse/FLINK-32668
>             Project: Flink
>          Issue Type: Bug
>          Components: Build System / CI
>    Affects Versions: 1.16.2, 1.18.0, 1.17.1
>            Reporter: Hongshun Wang
>            Assignee: Hongshun Wang
>            Priority: Minor
>         Attachments: image-2023-07-25-15-27-37-441.png
>
>
> When run e2e test, an error like this occrurs:
> !image-2023-07-25-15-27-37-441.png|width=733,height=115!
>  
> The corresponding code:
> {code:java}
> kill_test_watchdog() {
>     local watchdog_pid=$(cat $TEST_DATA_DIR/job_watchdog.pid)
>     echo "Stopping job timeout watchdog (with pid=$watchdog_pid)"
>     kill $watchdog_pid
> } 
> internal_run_with_timeout() {
>     local timeout_in_seconds="$1"
>     local on_failure="$2"
>     local command_label="$3"
>     local command="${@:4}"
>     on_exit kill_test_watchdog
>    (
>            command_pid=$BASHPID
>            (sleep "${timeout_in_seconds}" # set a timeout for this command
>             echo "${command_label:-"The command '${command}'"} (pid: 
> $command_pid) did not finish after $timeout_in_seconds seconds."
> eval "${on_failure}"
>            kill "$command_pid") & watchdog_pid=$!
>            echo $watchdog_pid > $TEST_DATA_DIR/job_watchdog.pid
>            # invoke
>           $command
>   )
> }{code}
>  
> When {{$command}} completes before the timeout, the watchdog process is 
> killed successfully. However, when {{$command}} times out, the watchdog 
> process kills {{$command}} and then exits itself, leaving behind an error 
> message when trying to kill its own process ID with {{{}kill 
> $watchdog_pid{}}}.This error msg "no such process" is hard to understand.
>  
> So, I will modify like this with better error message:
>  
> {code:java}
> kill_test_watchdog() {
>       local watchdog_pid=$(cat $TEST_DATA_DIR/job_watchdog.pid)
>       if kill -0 $watchdog_pid > /dev/null 2>&1; then
>            echo "Stopping job timeout watchdog (with pid=$watchdog_pid)"
>            kill $watchdog_pid
>       else
>             echo "[ERROR] Test is timeout"
>             exit 1       
>       fi
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to