zclllyybb commented on issue #64708: URL: https://github.com/apache/doris/issues/64708#issuecomment-4769903567
Breakwater-GitHub-Analysis-Slot: slot_f98169949cb7 This content is generated by AI for reference only. Initial triage: this report looks credible and should be treated as a BE write-path availability bug, not as a normal backend liveness issue. The strongest evidence in the issue is that writes time out on `:8060`, including loopback/self cases, while raw TCP is still open and the target-side `PInternalService::open_load_stream` entry log is absent. If those attached logs match the described window, the request is not reaching the Doris `open_load_stream` handler and the stall is below the Doris load-stream service logic. Code check against the 4.0.6 tag/commit reported here also shows a Doris-side recovery gap: - In 4.0.6, `VTabletWriterV2::_open_streams_to_backend()` opens V2 load streams through `ExecEnv::brpc_streaming_client_cache()`, not the ordinary internal brpc cache. - `LoadStreamStub::open()` creates a brpc stream, sets `open_load_stream_timeout_ms` to 60000 ms, gets the cached streaming stub, and calls `open_load_stream()` synchronously. On `cntl.Failed()`, it closes the stream and returns the error, but it does not evict the cached streaming stub. - `FailureDetectChannel` only marks a cached channel unhealthy when `cntl->ErrorCode() == EHOSTDOWN`; a plain `[E1008] Reached timeout` does not poison the cached channel. - `enable_brpc_connection_check` does not appear to cover the V2 streaming cache. In 4.0.6 it checks query-context brpc stubs and erases `brpc_internal_client_cache()`. I did not find a corresponding `available()` / `erase()` path for `brpc_streaming_client_cache()`. That makes the reporter's observation internally consistent: once a cached streaming-channel/socket gets into a state where `open_load_stream` only times out, later V2 load-stream opens can keep reusing the same cached stub instead of forcing a new channel/socket. It also explains why a full BE-fleet restart clears the mesh and why a single-BE restart may not, because peers can still hold their cached client-side streaming stubs. I would not yet state that the exact root cause is proven to be Apache brpc #1168. The brpc hypothesis is plausible from the symptom, but the current public evidence pins the first actionable Doris-side issue to missing invalidation/recreation of the streaming brpc client on load-stream open timeout. A maintainer can investigate this without needing to prove the exact brpc internal line first. I also checked the likely upgrade pointer: 4.1.2 still uses bundled brpc `1.4.0`, and the same streaming-cache open path and `EHOSTDOWN`-only channel invalidation pattern are still present. Current `branch-4.0`, `branch-4.1`, and `master` also still show brpc `1.4.0` and no obvious streaming-cache eviction on this path. So 4.1.2 should still be tested, but I would not present it as a known brpc-version fix based on source inspection alone. Suggested next maintainer actions: 1. Reproduce or fault-inject an `open_load_stream` timeout on the V2 path and verify whether the next V2 open reuses the same `brpc_streaming_client_cache()` entry. 2. Add cache invalidation for `brpc_streaming_client_cache()` when `LoadStreamStub::open()` fails with network/open-timeout style errors, or broaden `FailureDetectChannel` health marking for this open-RPC path beyond only `EHOSTDOWN`. 3. Add a regression/fault-injection test that forces one load-stream open failure and asserts the following open uses a fresh channel and does not wedge future writes. 4. Keep `experimental_enable_single_replica_insert` documented only as a mitigation; it reduces multi-replica load-stream exposure but does not fix the cached broken streaming-channel recovery gap. Information that would make the public diagnosis stronger: - For one affected load id / txn id, provide the target BE `be.INFO` window proving there is no `open load stream, load_id=...` handler-entry line while the coordinator logs the timeout. - Capture `rpcz` or equivalent brpc socket-level state with diagnostics enabled before reproducing; runtime enabling apparently did not work on this build. - Provide a minimal scripted reproducer if possible, or the exact sequence of CTAS / INSERT / UPDATE statements with table DDL, concurrency, and the load ids that correspond to the attached stacks. - If the cluster is still available, capture `gdb -p <doris_be> thread apply all bt` plus `ss -tanp` for the affected BE pairs during the wedge, including loopback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
