ojalberts-itc commented on issue #64708:
URL: https://github.com/apache/doris/issues/64708#issuecomment-4777922226

   ## Update: the wedge **reproduces on Doris 4.1.2** (`doris-4.1.2-rc01`)
   
   Following the previous comment (where we promised a 4.1.2 test): we 
redeployed the cluster
   **fresh on 4.1.2** (`doris-4.1.2-rc01-aec169d2025`, official x64 GA tarball; 
3 FE + 4 BE, repl=3;
   the cause-#1 BE→BE `:8040` self-ingress SG rule present from t=0; empty data 
volumes) and ran a
   heavy multi-replica write workload with 
`experimental_enable_single_replica_insert` **OFF**.
   
   **It wedged — identical signature to 4.0.6.** The workload (a ~222M-row 
repl=3 build + churn) ran
   clean *during* the load, then the write path wedged at **idle, ~10 minutes 
after the last heavy
   write**. We captured all 4 BEs live before any restart.
   
   ### The parked write thread — same `LoadStreamStub::open` as before, now on 
4.1.2
   
   Doris's own `be.WARNING` printed the stack at the failure point:
   
   ```text
   open stream failed: [INTERNAL_ERROR]Failed to connect to backend <id>: 
[E1008]Reached timeout=60000ms @10.0.0.227:8060
     0#  doris::LoadStreamStub::open(...)                       
be/src/exec/sink/load_stream_stub.cpp:208
     1#  doris::LoadStreamStubs::open(...)                      
be/src/exec/sink/load_stream_stub.cpp:574
     2#  doris::VTabletWriterV2::_open_streams_to_backend(...)  
be/src/exec/sink/writer/vtablet_writer_v2.cpp:317
     3#  doris::VTabletWriterV2::_open_streams()                
be/src/exec/sink/writer/vtablet_writer_v2.cpp:298
   ```
   
   (Build path `/home/zcp/repo_center/doris_release/doris/be/src/...` confirms 
the official 4.1.2 GA build.)
   
   ### `[E1008]` on `:8060` across all 4 BEs
   
   `be.WARNING` E1008/broken-socket counts at the wedge were **23 / 34 / 30 / 
24** across the four
   BEs. The `enable_brpc_connection_check` path detects the broken socket and 
evicts it from cache,
   but it never revives:
   
   ```text
   fragment_mgr.cpp:953   brpc stub: 10.0.0.227:8060 check failed: 
[E1008]Reached timeout=10000ms @10.0.0.227:8060
   fragment_mgr.cpp:974   remove brpc stub from cache: 10.0.0.227:8060, error: 
[E1008]...
   brpc_client_cache.h:326 open brpc connection to 10.0.0.137:8060 failed: 
[E1008]Reached timeout=2000ms @10.0.0.137:8060
   vtablet_writer.cpp:715  failed to open tablet writer may caused by timeout 
...
   ```
   
   Raw TCP to `:8060` stays open throughout; the stall is in the brpc 
application layer.
   
   ### Workers parked, not saturated — and zero clone activity
   
   brpc `:8060` `/vars` at the wedge: `bthread_worker_usage` **0.46–1.76**, 
`bthread_count` 5, with
   `rpc_server_8060_connection_count` 58–59. The write threads are blocked on 
the RPC, not starved.
   FE `SHOW PROC '/cluster_balance/{running,pending}_tablets'` was **empty** — 
a zero-clone wedge,
   purely the brpc load-stream socket going Broken. `SHOW BACKENDS` showed all 
4 BEs `Alive=true`
   throughout (the 9050 heartbeat is a separate threadpool). Both repl=3 
**and** repl=1 writes hung;
   only a **simultaneous full-BE restart** cleared it.
   
   ### Preconditions — this is why a fresh-cluster smoke test will NOT 
reproduce it
   
   This is the important part if you try to reproduce. **A fresh cluster 
running a single heavy load
   does not wedge.** In our run, the identical 85M-row repl=3 load on a 
freshly-bootstrapped 4.1.2
   cluster completed in **73s, clean**. The same SQL, same 
`single_replica_insert=OFF`, on a cluster
   that had **accumulated history** then **failed** (125s, then 365s on retry). 
The discriminating
   precondition is the cluster's accumulated/degraded state; heavy 
multi-replica load is the proximate
   trigger *on top of* that. To reproduce, expect to need sustained 
accumulation (a full multi-table
   build plus a burst of concurrent repl=3 loads), not one load on a clean 
cluster.
   
   We had two occurrences, with **complementary** confounds that cancel out:
   
   1. **First (fresh bootstrap, no restart):** a full ~222M-row build + churn 
ran clean, then the
      write path wedged at idle ~10 min later. This was on a cluster that had 
**never** been
      `systemctl restart`-ed — so it is **not** explained by the known 
clone-storm-after-hard-restart
      path. (It did have `single_replica_insert` toggled OFF→ON ~2 min before 
the wedge, so this
      occurrence alone couldn't rule out the toggle.)
   2. **Replay (no toggle):** we re-ran with `single_replica_insert` **OFF the 
whole time, no toggle**;
      the heavy load wedged again (the `[E1008]` / `failed to write enough 
replicas 1/3` /
      `failed to open streams to any BE` above). So the **toggle is not 
necessary**.
   
   Union of the two: the wedge appears **without the toggle** and **without a 
hard-restart-degraded
   cluster** — so neither is the cause. The common factor is an accumulated 
cluster under sustained
   multi-replica load, on 4.1.2.
   
   Net: **the brpc BE write-wedge is present on Doris 4.1.2 — it is not 
fixed.** The in-process
   signature (the `LoadStreamStub::open` stack, `[E1008]` on `:8060` across all 
BEs, zero-clone,
   all-`Alive=true`, full-restart-only recovery) is identical to the 4.0.6 
reports — the brpc #1168
   class. Full per-BE `be.WARNING` excerpts, the brpc `/vars`, and FE 
cluster-state for **both**
   occurrences are attached as `64708-4.1.2-evidence.tar.gz`
   (sha256 `e52e4c76c2f25dd87d8aab2434dc332530e20ea196eff0c85f67196d7855c2fd`). 
Original Question 1
   stands: is this a known brpc 1.4.0 load-stream socket defect, and is there a 
fixing PR or a
   version that resolves it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to