zclllyybb commented on issue #64708:
URL: https://github.com/apache/doris/issues/64708#issuecomment-4769903567

   Breakwater-GitHub-Analysis-Slot: slot_f98169949cb7
   This content is generated by AI for reference only.
   
   Initial triage: this report looks credible and should be treated as a BE 
write-path availability bug, not as a normal backend liveness issue. The 
strongest evidence in the issue is that writes time out on `:8060`, including 
loopback/self cases, while raw TCP is still open and the target-side 
`PInternalService::open_load_stream` entry log is absent. If those attached 
logs match the described window, the request is not reaching the Doris 
`open_load_stream` handler and the stall is below the Doris load-stream service 
logic.
   
   Code check against the 4.0.6 tag/commit reported here also shows a 
Doris-side recovery gap:
   
   - In 4.0.6, `VTabletWriterV2::_open_streams_to_backend()` opens V2 load 
streams through `ExecEnv::brpc_streaming_client_cache()`, not the ordinary 
internal brpc cache.
   - `LoadStreamStub::open()` creates a brpc stream, sets 
`open_load_stream_timeout_ms` to 60000 ms, gets the cached streaming stub, and 
calls `open_load_stream()` synchronously. On `cntl.Failed()`, it closes the 
stream and returns the error, but it does not evict the cached streaming stub.
   - `FailureDetectChannel` only marks a cached channel unhealthy when 
`cntl->ErrorCode() == EHOSTDOWN`; a plain `[E1008] Reached timeout` does not 
poison the cached channel.
   - `enable_brpc_connection_check` does not appear to cover the V2 streaming 
cache. In 4.0.6 it checks query-context brpc stubs and erases 
`brpc_internal_client_cache()`. I did not find a corresponding `available()` / 
`erase()` path for `brpc_streaming_client_cache()`.
   
   That makes the reporter's observation internally consistent: once a cached 
streaming-channel/socket gets into a state where `open_load_stream` only times 
out, later V2 load-stream opens can keep reusing the same cached stub instead 
of forcing a new channel/socket. It also explains why a full BE-fleet restart 
clears the mesh and why a single-BE restart may not, because peers can still 
hold their cached client-side streaming stubs.
   
   I would not yet state that the exact root cause is proven to be Apache brpc 
#1168. The brpc hypothesis is plausible from the symptom, but the current 
public evidence pins the first actionable Doris-side issue to missing 
invalidation/recreation of the streaming brpc client on load-stream open 
timeout. A maintainer can investigate this without needing to prove the exact 
brpc internal line first.
   
   I also checked the likely upgrade pointer: 4.1.2 still uses bundled brpc 
`1.4.0`, and the same streaming-cache open path and `EHOSTDOWN`-only channel 
invalidation pattern are still present. Current `branch-4.0`, `branch-4.1`, and 
`master` also still show brpc `1.4.0` and no obvious streaming-cache eviction 
on this path. So 4.1.2 should still be tested, but I would not present it as a 
known brpc-version fix based on source inspection alone.
   
   Suggested next maintainer actions:
   
   1. Reproduce or fault-inject an `open_load_stream` timeout on the V2 path 
and verify whether the next V2 open reuses the same 
`brpc_streaming_client_cache()` entry.
   2. Add cache invalidation for `brpc_streaming_client_cache()` when 
`LoadStreamStub::open()` fails with network/open-timeout style errors, or 
broaden `FailureDetectChannel` health marking for this open-RPC path beyond 
only `EHOSTDOWN`.
   3. Add a regression/fault-injection test that forces one load-stream open 
failure and asserts the following open uses a fresh channel and does not wedge 
future writes.
   4. Keep `experimental_enable_single_replica_insert` documented only as a 
mitigation; it reduces multi-replica load-stream exposure but does not fix the 
cached broken streaming-channel recovery gap.
   
   Information that would make the public diagnosis stronger:
   
   - For one affected load id / txn id, provide the target BE `be.INFO` window 
proving there is no `open load stream, load_id=...` handler-entry line while 
the coordinator logs the timeout.
   - Capture `rpcz` or equivalent brpc socket-level state with diagnostics 
enabled before reproducing; runtime enabling apparently did not work on this 
build.
   - Provide a minimal scripted reproducer if possible, or the exact sequence 
of CTAS / INSERT / UPDATE statements with table DDL, concurrency, and the load 
ids that correspond to the attached stacks.
   - If the cluster is still available, capture `gdb -p <doris_be> thread apply 
all bt` plus `ss -tanp` for the affected BE pairs during the wedge, including 
loopback.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to