Tanya-W opened a new issue, #16606:
URL: https://github.com/apache/doris/issues/16606

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Version
   
   master
   
   ### What's Wrong?
   
   # Here are two problem
   ## one:
   - Problem phenomenon: 
   all thread in fragment thread pool in `VNodeChannel::close_wait` call 
`std::this_thread::sleep_for` to wait for finished.
   
   - Problem analysis:
   In function `FragmentMgr::_exec_actual` will call `exec_state->execute()`, 
but not process its return status, if `_executor.open()` failed, there is no 
place to call cancel actively, and in function `FragmentMgr::_exec_actual` 
after `exec_state->execute()` erase the `fragment_instance_id` from 
`_fragment_map`, leads to also cannot cancel through timeout, when 
deconstruction for exec_state, will call `VNodeChannel::close_wait`, and in 
this function call `std::this_thread::sleep_for` to wait for finished, but the 
variable `_add_batches_finished` and `_cancelled` value is always false because 
of executor open failed and not cancel, the thread will hang.
   
   not process error return status, and erase fragment instance id from map 
directly:
   ```
   void FragmentMgr::_exec_actual(std::shared_ptr<FragmentExecState> 
exec_state, FinishCallback cb) {
   ...
       exec_state->execute();
   
   ...
   
       // remove exec state after this fragment finished
       {
           std::lock_guard<std::mutex> lock(_lock);
           _fragment_map.erase(exec_state->fragment_instance_id());
   ...
       }
   
   ...
   }
   ```
   
   not process error return status, only print warning log:
   ```
   Status FragmentExecState::execute() {
   ...
   
       {
   ...
   
           WARN_IF_ERROR(_executor.open(),
                         strings::Substitute("Got error while opening fragment 
$0, query id: $1",
                                             print_id(_fragment_instance_id), 
print_id(_query_id)));
   
   ...
       }
   
   ...
       return Status::OK();
   }
   ```
   
   ```
   Status VNodeChannel::close_wait(RuntimeState* state) {
   ...
   
       // waiting for finished, it may take a long time, so we couldn't set a 
timeout
       while (!_add_batches_finished && !_cancelled) {
           std::this_thread::sleep_for(std::chrono::milliseconds(1));
       }
   
   ...
   }
   ```
   
   
   ## two:
   - Problem phenomenon:
   bthread workers are exhausted, leads to BE cannot receive new rpc requests, 
FE send rpc timeout.
   
   - Problem analysis:
   When the brpc request reaches BE `FragmentMgr::exec_plan_fragment`, if the 
pthread pool is full, submit thread pool failed, will need bthread to destruct 
the local variable `std::shared_ptr<FragmentExecState> exec_state`, and then 
`VNodeChannel::close_wait` will be called, but in `VNodeChannel::close_wait` 
call `std::this_thread::sleep_for` to wait finish, when the variable 
`_add_batches_finished` and `_cancelled` value is always false, the bthread 
cannot switch out in time, which leads to the exhaustion of the bthread worker, 
and finally leads to BE cannot receive new rpc requests.
   
   ```
   Status VNodeChannel::close_wait(RuntimeState* state) {
   ...
   
       // waiting for finished, it may take a long time, so we couldn't set a 
timeout
       while (!_add_batches_finished && !_cancelled) {
           std::this_thread::sleep_for(std::chrono::milliseconds(1));
       }
   
   ...
   }
   ```
   
   ### What You Expected?
   
   ## For problem one:
   process the return status for `exec_state->execute()` in function 
`FragmentMgr::_exec_actual`
   
   ## For problem two:
   - define member variable `_cancelled` in FragmentExecState as atomic
   - use use bthread_usleep instead of std::this_thread::sleep_for in function 
`VNodeChannel::close_wait`
   
   ### How to Reproduce?
   
   _No response_
   
   ### Anything Else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to