zhiqiang-hhhh opened a new issue, #23704: URL: https://github.com/apache/doris/issues/23704
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Description ### Problem background As the title suggests, in the current version (2.0), when a node (FE + BE) in the doris cluster restarts or hangs up, the query undertaken by that node cannot be cancelled in time within the cluster, and can only be cancelled after the query times out (30 minutes). ### Overall idea The essence of this problem is actually how BE/FE can perceive the changes in the node status in the cluster. Because once we can use some mechanism to correctly send the event of a FE/BE process restart/hang up to all nodes in the cluster, then the problem will become a memory computation problem, which is easy to solve: 1. For FE Each FE memory records all queries initiated from the current FE, as well as which be each query's fragment is on. If FE observes that a BE in the cluster hangs up/restarts, it only needs to judge whether there is a query fragment that falls on the faulty be. If so, FE can actively initiate cancel_query(list of queryIds); if FE observes that a FE restarts, if it is master, it broadcasts an event FEx fault to all BEs, if it is follower, it can choose to wait for leader election/or other strategies. 2. For BE There is currently no heartbeat mechanism between BE and BE, and as the execution carrier of the fragment, it does not know the distribution of a query in the cluster. Therefore, at present be can only do one thing actively: release the relevant resources in the current process according to the query id passed by FE. But this cannot handle the situation where a FE fails. This situation can be solved by adding a field in the query parameters sent by FE to BE: original fe ip:port + fe start uuid. Each fe starts and generates a fe start start id saved in memory. Once a fe restarts/fails, then other surviving fe broadcast an event to all be in the cluster, which records a field: breakdown fe ip:port + new start id. After receiving this event, be first judges whether there is a query with original fe ip:port == breakdown fe ip:port in the current execution, and if so, further judges whether a query needs to be cancelled by fe start uuid (to solve the problem of wrong cancellation of ne w queries due to fe restart time difference). In short, when we can implement the mechanism of broadcasting node status changes to all nodes in the cluster, subsequent query cancellation will actually become a memory computation problem. ### Implementation plan #### Current situation Simplified, the current doris overall control and data model are as follows: Control flow: FE (master) process memory saves the cluster's real-time full metadata. The master process starts and creates a HearbeatMgr thread, which sends heartbeat requests to all nodes in the cluster except itself, and then waits asynchronously for heartbeat results. For each heartbeat result obtained, judge whether the node has changed status. If the node status changes or fails to get heartbeat, then write this information as oplog into bdb. FE (follower) will asynchronously replay oplog after last checkpoint and replay oplog to its own memory. #### Modification Before proposing modifications, some assumptions need to be made: - Do not consider FE leader election failure related (no leader/multiple leaders) situations, because FE leader election failure is a serious level and solution priority far higher than BE resource not being released problem - Consider based on bdbje log synchronization mechanism relatively stable, can complete oplog synchronization within a reasonable time interval - Consider FE node from process shutdown to restart, and then to log print Qe service start this time interval long enough, will not appear within one heartbeat interval some FE node multiple restarts situation ##### Solve FE restart FE generates a UUID at startup stage, this UUID is used to identify a FE restart action. FE (folloer) and FE (master) heartbeat response will carry this UUID. And FE (master) sends heartbeat request to BE will contain all nodes' startup uuid in cluster. If there is some FE (follower) unable to return heartbeat response, then FE (master) needs to generate a temporary uuid for that FE and send this temporary uuid to BE. When BE receives the heartbeat request from FE (master), it needs to compare the current frontend_infos with the previous frontend_infos saved in memory. If it finds that the startup uuid of some fe has changed, it cancels all fragments related to that query locally. If the restart node is master, it needs to wait for leader election to succeed first. After leader election succeeds, the heartbeat request sent to all BEs will contain the startup uuid of the restart node (or the node is deleted). BE continues the previous processing logic. ##### Solve BE restart The current BE has BE start time in the heartbeat response to FE, and FE will compare the two start times ``` public class Backend { ... public boolean handleHbResponse(BackendHbResponse hbResponse, boolean isReplay) { ... if (this.lastStartTime != hbResponse.getBeStartTime() && hbResponse.getBeStartTime() > 0) { LOG.info("{} update last start time to {}", this.toString(), hbResponse.getBeStartTime()); this.lastStartTime = hbResponse.getBeStartTime(); isChanged = true; } ... return isChanged; } } ``` When FE detects BE status change, it will generate an oplog for this heartbeat, so that when FE (follower) replays this log, it can know that some BE has restarted. FE memory retains be -> coordinator information, follower only needs to call Coordinator.cancel. ### Solution _No response_ ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org