[ https://issues.apache.org/jira/browse/KUDU-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17901299#comment-17901299 ]
ASF subversion and git services commented on KUDU-3624: ------------------------------------------------------- Commit 7eadba5c2d0fbcd9571c4de7d08ac6f8352338c8 in kudu's branch refs/heads/branch-1.18.x from Ádám Bakai [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=7eadba5c2 ] [subprocess] KUDU-3624 Fix DoWait thread-safety waitpid() does return with an error if it is called with a pid that was already shut down. So Subprocess::DoWait() stores the return value of previous waitpid execution and returns it instead of running it again. But in EchoSubprocessTest.TestSubprocessMetricsOnError it can happen that SubprocessServer::ExitCheckerThread() and Subprocess::KillAndWait() both call Subprocess::DoWait() and both of them call waitpid. And if ExitCheckerThread() calls it second, then it fails the following check: Check failed: _s.ok() Bad status: Runtime error: Unable to wait on child: No child processes (error 10) To fix this behaviour, wait_mutex_ is added. If a thread runs and calls waitpid(), other threads won't execute it in the same time. If locking is unsuccessful but the WaitMode is NON_BLOCKING, then return as if nothing happened. Unit test SubprocessTest.TestMultiThreadWait was added to verify executing two wait commands concurrently. Change-Id: I1cb540860b439c26e1c8529123c8b29940d9f84f Reviewed-on: http://gerrit.cloudera.org:8080/22056 Tested-by: Alexey Serbin <ale...@apache.org> Reviewed-by: Alexey Serbin <ale...@apache.org> (cherry picked from commit da6211df5f8df0c53ceedd542b61634f3bab7205) Reviewed-on: http://gerrit.cloudera.org:8080/22127 Reviewed-by: Abhishek Chennaka <achenn...@cloudera.com> Tested-by: Kudu Jenkins > EchoSubprocessTest.TestSubprocessMetricsOnError is flaky > -------------------------------------------------------- > > Key: KUDU-3624 > URL: https://issues.apache.org/jira/browse/KUDU-3624 > Project: Kudu > Issue Type: Sub-task > Reporter: Bakai Ádám > Assignee: Bakai Ádám > Priority: Major > > {code:java} > I20241104 14:09:34.411099 543202 server.cc:273] Received an EOF from the > subprocess > W20241104 14:09:34.411275 543165 server.cc:408] The subprocess has exited > with status 9 > I20241104 14:09:34.417060 543203 server.cc:440] outbound queue shut down: > Aborted: > I20241104 14:09:34.417075 543200 server.cc:366] get failed, inbound queue > shut down: Aborted: > I20241104 14:09:34.417109 543201 server.cc:366] get failed, inbound queue > shut down: Aborted: > I20241104 14:09:34.417068 543199 server.cc:366] get failed, inbound queue > shut down: Aborted: > I20241104 14:09:37.790342 543244 server.cc:273] Received an EOF from the > subprocess > F20241104 14:09:37.790630 543207 server.cc:401] Check failed: _s.ok() Bad > status: Runtime error: Unable to wait on child: No child processes (error 10) > *** Check failure stack trace: *** > I20241104 14:09:37.790678 543242 server.cc:366] get failed, inbound queue > shut down: Aborted: > I20241104 14:09:37.790673 543243 server.cc:366] get failed, inbound queue > shut down: Aborted: > *** Aborted at 1730729377 (unix time) try "date -d @1730729377" if you are > using GNU date *** > I20241104 14:09:37.790684 543241 server.cc:366] get failed, inbound queue > shut down: Aborted: > I20241104 14:09:37.790699 543245 server.cc:440] outbound queue shut down: > Aborted: > PC: @ 0x0 (unknown) > *** SIGABRT (@0x848a9) received by PID 542889 (TID 0x704a47600700) from PID > 542889; stack trace: *** > @ 0x704a4d478980 (unknown) > @ 0x704a4d0b3e87 gsignal > @ 0x704a4d0b57f1 abort > @ 0x704a4e171d8d google::LogMessage::Fail() > @ 0x704a4e175b53 google::LogMessage::SendToLog() > @ 0x704a4e17178c google::LogMessage::Flush() > @ 0x704a4e172f19 google::LogMessageFatal::~LogMessageFatal() > @ 0x704a4f855a7e > kudu::subprocess::SubprocessServer::ExitCheckerThread() > @ 0x704a4f8529bd > _ZZN4kudu10subprocess16SubprocessServer4InitEvENKUlvE0_clEv > @ 0x704a4f8568ab > _ZNSt17_Function_handlerIFvvEZN4kudu10subprocess16SubprocessServer4InitEvEUlvE0_E9_M_invokeERKSt9_Any_data > @ 0x704a4f8cb06a std::function<>::operator()() > @ 0x704a4eff682b kudu::Thread::SuperviseThread() > @ 0x704a4d46d6db start_thread > @ 0x704a4d19661f clone > zsh: abort (core dumped) ./bin/subprocess_proxy-test --gtest_repeat=999 > --gtest_break_on_failure {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)