This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.

For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
mptcp: better mptcp-level RTT estimator

The current MPTCP-level RTT estimator has several issues. On high speed
links, the MPTCP-level receive buffer auto-tuning happens with a
frequency well above the TCP-level's one. That in turn can cause
excessive/unneeded receive buffer increase.

On such links, the initial rtt_us value is considerably higher than the
actual delay, and the current mptcp_rcv_space_adjust() updates
msk->rcvq_space.rtt_us with a period equal to the such field previous
value. If the initial rtt_us is 40ms, its first update will happen after
40ms, even if the subflows see actual RTT orders of magnitude lower.

Additionally:
- setting the msk RTT to the maximum among all the subflows RTTs makes
  DRS constantly overshooting the rcvbuf size when a subflow has
  considerable higher latency than the other(s).

- during unidirectional bulk transfers with multiple active subflows,
  the TCP-level RTT estimator occasionally sees considerably higher
  value than the real link delay, i.e. when the packet scheduler reacts
  to an incoming ACK on given subflow pushing data on a different
  subflow.

- currently inactive but still open subflows (i.e. switched to backup
  mode) are always considered when computing the msk-level RTT.

Address the all the issues above with a more accurate RTT estimation
strategy: the MPTCP-level RTT is set to the minimum of all the subflows
actually feeding data into the MPTCP receive buffer, using a small
sliding window.

> While at it, also use EWMA to compute the msk-level scaling_ratio, to
> that MPTCP can avoid traversing the subflow list is
> mptcp_rcv_space_adjust().

This isn't a bug, but there are two typos here that make the sentence
hard to parse. Should this be 'so that MPTCP can avoid traversing the
subflow list in mptcp_rcv_space_adjust()'?

Use some care to avoid updating msk and ssk level fields too often.

Fixes: a6b118febbab ("mptcp: add receive buffer auto-tuning")

[ ... ]

> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index 0bd1ee860316..6ec65c0ae655 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h

[ ... ]

> @@ -422,6 +435,27 @@ static inline struct mptcp_data_frag 
> *mptcp_send_head(const struct sock *sk)
>       return msk->first_pending;
>  }
>
> +static inline void mptcp_init_rtt_est(struct mptcp_sock *msk)
> +{
> +     int i;
> +
> +     for (i = 0; i < MPTCP_RTT_SAMPLES; ++i)
> +             msk->rcv_rtt_est.samples[i] = U32_MAX;
> +     msk->rcv_rtt_est.next_sample = 0;
> +     msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
> +}
> +
> +static inline u32 mptcp_rtt_us_est(const struct mptcp_sock *msk)
> +{
> +     u32 rtt_us = msk->rcv_rtt_est.samples[0];
                         ^^^^

Is samples[0] missing a READ_ONCE()?

The function comment says 'Lockless access of collected samples' and the
writer mptcp_rcv_rtt_update() uses WRITE_ONCE() for all sample indices
(including index 0) with a comment 'Pairs with READ_ONCE() in
mptcp_rtt_us_est()':

net/mptcp/protocol.c:mptcp_rcv_rtt_update() {
        ...
        /* Pairs with READ_ONCE() in mptcp_rtt_us_est(). */
        id = msk->rcv_rtt_est.next_sample;
        WRITE_ONCE(msk->rcv_rtt_est.samples[id], rtt_us);
        ...
}

However, samples[0] is read with a plain load while samples[1..4] correctly
use READ_ONCE():

> +     int i;
> +
> +     /* Lockless access of collected samples. */
> +     for (i = 1; i < MPTCP_RTT_SAMPLES; ++i)
> +             rtt_us = min(rtt_us, READ_ONCE(msk->rcv_rtt_est.samples[i]));
> +     return rtt_us;
> +}

The writer runs under mptcp_data_lock (spin_lock_bh) while the reader runs
under the socket lock (msk_owned_by_me) -- these are different locks, so
WRITE_ONCE/READ_ONCE pairing is required. KCSAN would flag this as a data
race.

> +
>  static inline struct mptcp_data_frag *mptcp_send_next(struct sock *sk)
>  {
>       struct mptcp_sock *msk = mptcp_sk(sk);

[ ... ]

> @@ -523,6 +557,8 @@ struct mptcp_subflow_context {
>       u32     map_data_len;
>       __wsum  map_data_csum;
>       u32     map_csum_len;
> +     u32     prev_rtt_us;
                    ^^^^

Is this field used anywhere?

The field 'prev_rtt_us' is added to struct mptcp_subflow_context but doesn't
appear to be referenced anywhere in the codebase. Only 'prev_rtt_seq' is used
in mptcp_rcv_rtt_update().

> +     u32     prev_rtt_seq;
>       u32     request_mptcp : 1,  /* send MP_CAPABLE */
>               request_join : 1,   /* send MP_JOIN */
>               request_bkup : 1,
-- 
pw-bot: cr

Reply via email to