[ 
https://issues.apache.org/jira/browse/KUDU-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17901299#comment-17901299
 ] 

ASF subversion and git services commented on KUDU-3624:
-------------------------------------------------------

Commit 7eadba5c2d0fbcd9571c4de7d08ac6f8352338c8 in kudu's branch 
refs/heads/branch-1.18.x from Ádám Bakai
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=7eadba5c2 ]

[subprocess] KUDU-3624 Fix DoWait thread-safety

waitpid() does return with an error if it is called with a pid that was
already shut down. So Subprocess::DoWait() stores the return value of
previous waitpid execution and returns it instead of running it again.
But in EchoSubprocessTest.TestSubprocessMetricsOnError it can happen
that SubprocessServer::ExitCheckerThread() and Subprocess::KillAndWait()
both call Subprocess::DoWait() and both of them call waitpid. And if
ExitCheckerThread() calls it second, then it fails the following check:
Check failed: _s.ok() Bad status: Runtime error: Unable
to wait on child: No child processes (error 10)

To fix this behaviour, wait_mutex_ is added. If a thread runs and
calls waitpid(), other threads won't execute it in the same time. If
locking is unsuccessful but the WaitMode is NON_BLOCKING, then return as
if nothing happened. Unit test SubprocessTest.TestMultiThreadWait
was added to verify executing two wait commands concurrently.

Change-Id: I1cb540860b439c26e1c8529123c8b29940d9f84f
Reviewed-on: http://gerrit.cloudera.org:8080/22056
Tested-by: Alexey Serbin <ale...@apache.org>
Reviewed-by: Alexey Serbin <ale...@apache.org>
(cherry picked from commit da6211df5f8df0c53ceedd542b61634f3bab7205)
Reviewed-on: http://gerrit.cloudera.org:8080/22127
Reviewed-by: Abhishek Chennaka <achenn...@cloudera.com>
Tested-by: Kudu Jenkins


> EchoSubprocessTest.TestSubprocessMetricsOnError is flaky
> --------------------------------------------------------
>
>                 Key: KUDU-3624
>                 URL: https://issues.apache.org/jira/browse/KUDU-3624
>             Project: Kudu
>          Issue Type: Sub-task
>            Reporter: Bakai Ádám
>            Assignee: Bakai Ádám
>            Priority: Major
>
> {code:java}
> I20241104 14:09:34.411099 543202 server.cc:273] Received an EOF from the 
> subprocess
> W20241104 14:09:34.411275 543165 server.cc:408] The subprocess has exited 
> with status 9
> I20241104 14:09:34.417060 543203 server.cc:440] outbound queue shut down: 
> Aborted:
> I20241104 14:09:34.417075 543200 server.cc:366] get failed, inbound queue 
> shut down: Aborted:
> I20241104 14:09:34.417109 543201 server.cc:366] get failed, inbound queue 
> shut down: Aborted:
> I20241104 14:09:34.417068 543199 server.cc:366] get failed, inbound queue 
> shut down: Aborted:
> I20241104 14:09:37.790342 543244 server.cc:273] Received an EOF from the 
> subprocess
> F20241104 14:09:37.790630 543207 server.cc:401] Check failed: _s.ok() Bad 
> status: Runtime error: Unable to wait on child: No child processes (error 10)
> *** Check failure stack trace: ***
> I20241104 14:09:37.790678 543242 server.cc:366] get failed, inbound queue 
> shut down: Aborted:
> I20241104 14:09:37.790673 543243 server.cc:366] get failed, inbound queue 
> shut down: Aborted:
> *** Aborted at 1730729377 (unix time) try "date -d @1730729377" if you are 
> using GNU date ***
> I20241104 14:09:37.790684 543241 server.cc:366] get failed, inbound queue 
> shut down: Aborted:
> I20241104 14:09:37.790699 543245 server.cc:440] outbound queue shut down: 
> Aborted:
> PC: @                0x0 (unknown)
> *** SIGABRT (@0x848a9) received by PID 542889 (TID 0x704a47600700) from PID 
> 542889; stack trace: ***
>     @     0x704a4d478980 (unknown)
>     @     0x704a4d0b3e87 gsignal
>     @     0x704a4d0b57f1 abort
>     @     0x704a4e171d8d google::LogMessage::Fail()
>     @     0x704a4e175b53 google::LogMessage::SendToLog()
>     @     0x704a4e17178c google::LogMessage::Flush()
>     @     0x704a4e172f19 google::LogMessageFatal::~LogMessageFatal()
>     @     0x704a4f855a7e 
> kudu::subprocess::SubprocessServer::ExitCheckerThread()
>     @     0x704a4f8529bd 
> _ZZN4kudu10subprocess16SubprocessServer4InitEvENKUlvE0_clEv
>     @     0x704a4f8568ab 
> _ZNSt17_Function_handlerIFvvEZN4kudu10subprocess16SubprocessServer4InitEvEUlvE0_E9_M_invokeERKSt9_Any_data
>     @     0x704a4f8cb06a std::function<>::operator()()
>     @     0x704a4eff682b kudu::Thread::SuperviseThread()
>     @     0x704a4d46d6db start_thread
>     @     0x704a4d19661f clone
> zsh: abort (core dumped)  ./bin/subprocess_proxy-test --gtest_repeat=999 
> --gtest_break_on_failure {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to