rajvarun77 opened a new pull request, #3339:
URL: https://github.com/apache/brpc/pull/3339
### Problem
`brpc_channel_unittest` bundles dozens of timing-sensitive `TEST_F` (backup
request, retry/backoff, timeouts, connection-failure) into a single test
binary. Each test does real-time waits (server-side `sleep_us`,
backup-request
timers, connection retries). gtest runs them serially in one process, so the
binary's wall time is the **sum** of all those waits.
On contended CI runners (GitHub-hosted `ubuntu-22.04`, ~4 shared vCPU with
hypervisor steal) that cumulative time exceeds Bazel's **default per-test
300s
limit** (`size = "medium"`), so the binary intermittently fails with
`TIMEOUT`
even though every assertion would pass given enough time.
### Evidence (reproduced on GitHub Actions)
Measured on `ubuntu-22.04`, `--nocache_test_results`:
| Configuration | Result |
| --- | --- |
| current (`size=medium`, 300s), 5 runs under load | **TIMEOUT in 4/5 @
300.0s** |
| `size=large` (900s), single run | **PASSED in 91.7s** |
| `size=large` (900s), 20 serialized no-cache runs | **20/20 PASSED, slowest
114.0s** |
The nominal run is ~92–114s, but under parallel-job contention the same
binary
balloons past 300s — a ~3× slowdown that crosses the medium ceiling. Raising
the
limit to `large` (900s) gives ~8× nominal headroom and absorbs the spike.
Bench runs (throwaway branch, not part of this PR):
- baseline `TIMEOUT 4/5` + rejected `shard_count=4` experiment `FAILED
20/20`:
https://github.com/rajvarun77/brpc/actions/runs/27396621709
- `size=large` validation (20/20 serialized + 91.7s timing):
https://github.com/rajvarun77/brpc/actions/runs/27453397271
### Fix
Add an optional `per_test_size` override to the `generate_unittests` macro
and
set `brpc_channel_unittest` to `size = "large"`. **No test source changes.**
### Why not shard it?
Sharding (`shard_count`) was tried first and **rejected**: it fails
deterministically (20/20). `brpc_channel_unittest`'s `TEST_F` share fixed
loopback endpoints and global state, so running shards as parallel processes
makes a "connection should be refused" test
(`ChannelTest.connection_failed_selective`) observe **another shard's live
server** on the same port and see a successful connection instead of
`ECONNREFUSED`. The tests are not shard-safe; raising the size limit is the
only
safe lever without rewriting the suite for isolation.
---
cc @chenBright for review.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]