github-actions[bot] commented on code in PR #60812:
URL: https://github.com/apache/doris/pull/60812#discussion_r2882336531
##########
be/src/runtime/fragment_mgr.cpp:
##########
@@ -674,6 +678,18 @@ void
FragmentMgr::remove_pipeline_context(std::pair<TUniqueId, int> key) {
}
void FragmentMgr::remove_query_context(const TUniqueId& key) {
+ // Clean up any saved rerunnable params for this query to avoid memory
leaks.
+ // This covers both cancel and normal destruction paths.
+ {
+ std::lock_guard<std::mutex> lk(_rerunnable_params_lock);
+ for (auto it = _rerunnable_params_map.begin(); it !=
_rerunnable_params_map.end();) {
+ if (it->first.first == key) {
+ it = _rerunnable_params_map.erase(it);
+ } else {
+ ++it;
Review Comment:
**Bug: Potential use-after-free of raw pointer**
`query_ctx` is a raw pointer obtained from `fragment_ctx->get_query_ctx()`.
After `fragment_ctx.reset()` (line 1497) and `remove_pipeline_context()` (line
1498) drop the PFC's `shared_ptr<QueryContext>`, the `QueryContext` lifetime
depends on other owners — primarily the `shared_ptr<QueryContext>` in
`_rerunnable_params_map` and `_query_ctx_map_delay_delete`.
If a concurrent `remove_query_context()` (triggered by query cancellation)
runs at this point, it erases the `_rerunnable_params_map` entries (line
683-691) and `_query_ctx_map_delay_delete` (line 693), potentially destroying
the `QueryContext` while this raw pointer is still in use (in the lambda at
line 1505).
Suggested fix: Use a `shared_ptr<QueryContext>` to keep the `QueryContext`
alive through this scope, similar to how the `recreate_and_submit` stage does
it:
```cpp
auto q_ctx = get_query_ctx(query_id);
if (!q_ctx) {
return Status::NotFound(...);
}
SCOPED_ATTACH_TASK(q_ctx.get());
```
Or extract it from the PFC before resetting:
```cpp
auto query_ctx = fragment_ctx->shared_query_ctx(); // if such accessor exists
```
##########
gensrc/proto/internal_service.proto:
##########
@@ -75,11 +75,10 @@ message PTransmitRecCTEBlockResult {
message PRerunFragmentParams {
enum Opcode {
- wait = 1; // wait fragment execute done
- release = 2; // release current round's resource
- rebuild = 3; // rebuild next round pipeline tasks
- submit = 4; // submit tasks to execute
- close = 5; // close fragment
+ wait_for_close = 1; // wait for all tasks to close, without sending
report or removing PFC
+ wait_for_destroy = 2; // deregister RF, destroy old PFC, wait for
external threads to release refs
Review Comment:
**Compatibility: Enum value remapping may break rolling upgrades**
The old enum values on master are:
```
wait = 1; release = 2; rebuild = 3; submit = 4; close = 5;
```
The new values reuse 1-4 with different semantics:
```
wait_for_close = 1; wait_for_destroy = 2; recreate_and_submit = 3;
final_close = 4;
```
During a rolling upgrade of a Doris cluster, a node running old code
receiving `wait_for_destroy=2` would interpret it as `release=2`, leading to
incorrect behavior. Similarly, `recreate_and_submit=3` would be interpreted as
`rebuild=3`.
If recursive CTE is used during rolling upgrades, this could cause subtle
failures. Consider:
1. Using higher enum values (e.g., starting at 10) to avoid overlap, or
2. Adding a version/compatibility check, or
3. Documenting that recursive CTE queries should not be run during rolling
upgrades.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]