The RCU grace period mechanism uses a two-phase FQS (Force Quiescent State) design where the first FQS saves dyntick-idle snapshots and the second FQS compares them. This results in long and unncessary latency for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with 1000HZ) whenever one FQS wait sufficed.
Some investigations showed that the GP kthread's CPU is the holdout CPU a lot of times after the first FQS as - it cannot be detected as "idle" because it's actively running the FQS scan in the GP kthread. Therefore, at the start of the first FQS, immediately report a quiescent state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The GP kthread cannot be in an RCU read-side critical section while running the FQS scan, so this is safe and results in significant tail latency improvements. I benchmarked 100 synchronize_rcu() calls, 6 runs each showing good tail latency improvements per synchronize_rcu() call (default settings for fqs jiffies): Baseline (without fix): | Run | Mean | Min | Max | |-----|----------|----------|-----------| | 1 | 4.036 ms | 3.509 ms | 7.973 ms | | 2 | 4.049 ms | 3.904 ms | 8.003 ms | | 3 | 4.033 ms | 1.160 ms | 10.083 ms | | 4 | 3.993 ms | 3.145 ms | 4.093 ms | | 5 | 3.988 ms | 2.675 ms | 4.123 ms | | 6 | 4.019 ms | 3.894 ms | 5.845 ms | With fix: | Run | Mean | Min | Max | |-----|----------|----------|----------| | 1 | 3.991 ms | 2.953 ms | 4.125 ms | | 2 | 3.995 ms | 3.439 ms | 4.081 ms | | 3 | 3.989 ms | 2.974 ms | 4.079 ms | | 4 | 3.997 ms | 3.667 ms | 4.072 ms | | 5 | 4.027 ms | 2.550 ms | 7.928 ms | | 6 | 3.989 ms | 2.886 ms | 4.076 ms | The fix reduces worst-case latency due to the second FQS wait not running when not needed. Tested rcutorture TREE and SRCU configurations. Signed-off-by: Joel Fernandes <[email protected]> --- kernel/rcu/tree.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 8293bae1dec1..c116ed7633d3 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -160,6 +160,7 @@ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp, unsigned long gps, unsigned long flags); static void invoke_rcu_core(void); static void rcu_report_exp_rdp(struct rcu_data *rdp); +static void rcu_report_qs_rdp(struct rcu_data *rdp); static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp); static bool rcu_rdp_is_offloaded(struct rcu_data *rdp); static bool rcu_rdp_cpu_online(struct rcu_data *rdp); @@ -2032,6 +2033,17 @@ static void rcu_gp_fqs(bool first_time) } if (first_time) { + /* + * Immediately report QS for the GP kthread's CPU. The GP kthread + * cannot be in an RCU read-side critical section while running + * the FQS scan. This eliminates the need for a second FQS wait + * when all CPUs are idle. + */ + preempt_disable(); + rcu_qs(); + rcu_report_qs_rdp(this_cpu_ptr(&rcu_data)); + preempt_enable(); + /* Collect dyntick-idle snapshots. */ force_qs_rnp(rcu_watching_snap_save); } else { -- 2.34.1

