zhiqiang-hhhh opened a new issue, #23704:
URL: https://github.com/apache/doris/issues/23704

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Description
   
   ### Problem background
   
   As the title suggests, in the current version (2.0), when a node (FE + BE) 
in the doris cluster restarts or hangs up, the query undertaken by that node 
cannot be cancelled in time within the cluster, and can only be cancelled after 
the query times out (30 minutes).
   
   ### Overall idea
   The essence of this problem is actually how BE/FE can perceive the changes 
in the node status in the cluster. Because once we can use some mechanism to 
correctly send the event of a FE/BE process restart/hang up to all nodes in the 
cluster, then the problem will become a memory computation problem, which is 
easy to solve:
   
   1. For FE
   Each FE memory records all queries initiated from the current FE, as well as 
which be each query's fragment is on. If FE observes that a BE in the cluster 
hangs up/restarts, it only needs to judge whether there is a query fragment 
that falls on the faulty be. If so, FE can actively initiate cancel_query(list 
of queryIds); if FE observes that a FE restarts, if it is master, it broadcasts 
an event FEx fault to all BEs, if it is follower, it can choose to wait for 
leader election/or other strategies.
   2. For BE
   There is currently no heartbeat mechanism between BE and BE, and as the 
execution carrier of the fragment, it does not know the distribution of a query 
in the cluster. Therefore, at present be can only do one thing actively: 
release the relevant resources in the current process according to the query id 
passed by FE. But this cannot handle the situation where a FE fails. This 
situation can be solved by adding a field in the query parameters sent by FE to 
BE: original fe ip:port + fe start uuid. Each fe starts and generates a fe 
start start id saved in memory. Once a fe restarts/fails, then other surviving 
fe broadcast an event to all be in the cluster, which records a field: 
breakdown fe ip:port + new start id. After receiving this event, be first 
judges whether there is a query with original fe ip:port == breakdown fe 
ip:port in the current execution, and if so, further judges whether a query 
needs to be cancelled by fe start uuid (to solve the problem of wrong 
cancellation of ne
 w queries due to fe restart time difference).
   
   In short, when we can implement the mechanism of broadcasting node status 
changes to all nodes in the cluster, subsequent query cancellation will 
actually become a memory computation problem.
   
   ### Implementation plan
   #### Current situation
   Simplified, the current doris overall control and data model are as follows:
   
   Control flow:
   FE (master) process memory saves the cluster's real-time full metadata. The 
master process starts and creates a HearbeatMgr thread, which sends heartbeat 
requests to all nodes in the cluster except itself, and then waits 
asynchronously for heartbeat results. For each heartbeat result obtained, judge 
whether the node has changed status. If the node status changes or fails to get 
heartbeat, then write this information as oplog into bdb. FE (follower) will 
asynchronously replay oplog after last checkpoint and replay oplog to its own 
memory.
   #### Modification
   Before proposing modifications, some assumptions need to be made:
   - Do not consider FE leader election failure related (no leader/multiple 
leaders) situations, because FE leader election failure is a serious level and 
solution priority far higher than BE resource not being released problem
   - Consider based on bdbje log synchronization mechanism relatively stable, 
can complete oplog synchronization within a reasonable time interval
   - Consider FE node from process shutdown to restart, and then to log print 
Qe service start this time interval long enough, will not appear within one 
heartbeat interval some FE node multiple restarts situation
   
   ##### Solve FE restart
   FE generates a UUID at startup stage, this UUID is used to identify a FE 
restart action. FE (folloer) and FE (master) heartbeat response will carry this 
UUID. And FE (master) sends heartbeat request to BE will contain all nodes' 
startup uuid in cluster.
   If there is some FE (follower) unable to return heartbeat response, then FE 
(master) needs to generate a temporary uuid for that FE and send this temporary 
uuid to BE.
   
   When BE receives the heartbeat request from FE (master), it needs to compare 
the current frontend_infos with the previous frontend_infos saved in memory. If 
it finds that the startup uuid of some fe has changed, it cancels all fragments 
related to that query locally.
   If the restart node is master, it needs to wait for leader election to 
succeed first. After leader election succeeds, the heartbeat request sent to 
all BEs will contain the startup uuid of the restart node (or the node is 
deleted). BE continues the previous processing logic.
   
   ##### Solve BE restart
   The current BE has BE start time in the heartbeat response to FE, and FE 
will compare the two start times
   ```
   public class Backend {
       ...
       public boolean handleHbResponse(BackendHbResponse hbResponse, boolean 
isReplay) {
           ...
           if (this.lastStartTime != hbResponse.getBeStartTime() && 
hbResponse.getBeStartTime() > 0) {
               LOG.info("{} update last start time to {}", this.toString(), 
hbResponse.getBeStartTime());
               this.lastStartTime = hbResponse.getBeStartTime();
               isChanged = true;
           }
           ...
           return isChanged;
       }       
   }
   ```
   When FE detects BE status change, it will generate an oplog for this 
heartbeat, so that when FE (follower) replays this log, it can know that some 
BE has restarted. FE memory retains be -> coordinator information, follower 
only needs to call Coordinator.cancel.
   
   
   ### Solution
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to