On Wed, Jul 08, 2020 at 11:56:21PM +0300, Vladimir Oltean wrote:
> The concept of timestamping DSA switches / Ethernet PHYs is becoming
> more and more popular, however the Linux kernel timestamping code has
> evolved quite organically and there's layers upon layers of new and old
> code that need to work together for things to behave as expected.
> 
> Add this chapter to explain what the overall goals are.

This is great.  I have one correction and a few formatting and style
suggestions, below.  With the correction in place,

Reviewed-by: Richard Cochran <richardcoch...@gmail.com>

> Loosely based upon this email discussion plus some more info:
> https://lkml.org/lkml/2020/7/6/481
> 
> Signed-off-by: Vladimir Oltean <olte...@gmail.com>
> ---
>  Documentation/networking/timestamping.rst | 149 ++++++++++++++++++++++
>  1 file changed, 149 insertions(+)
> 
> diff --git a/Documentation/networking/timestamping.rst 
> b/Documentation/networking/timestamping.rst
> index 1adead6a4527..14df58c24e8c 100644
> --- a/Documentation/networking/timestamping.rst
> +++ b/Documentation/networking/timestamping.rst
> @@ -589,3 +589,152 @@ Time stamps for outgoing packets are to be generated as 
> follows:
>    this would occur at a later time in the processing pipeline than other
>    software time stamping and therefore could lead to unexpected deltas
>    between time stamps.
> +
> +3.2 Special considerations for stacked PTP Hardware Clocks
> +----------------------------------------------------------
> +
> +There are situations when there may be more than one PHC (PTP Hardware Clock)
> +in the data path of a packet. The kernel has no explicit mechanism to allow 
> the
> +user to select which PHC to use for timestamping Ethernet frames. Instead, 
> the
> +assumption is that the outermost PHC is always the most preferable, and that
> +kernel drivers collaborate towards achieving that goal. Currently there are 3
> +cases of stacked PHCs, detailed below:
> +
> +- DSA (Distributed Switch Architecture) switches.

For readability, please start a new paragraph here.

     These are Ethernet switches
> +  which have one of their ports connected to an (otherwise completely 
> unaware)
> +  host Ethernet interface, and perform the role of a port multiplier with
> +  optional forwarding acceleration features.  Each DSA switch port is visible
> +  to the user as a standalone (virtual) network interface, however network 
> I/O
> +  is performed under the hood indirectly through the host interface.

new paragraph.

> +  When a DSA switch is attached to a host port, PTP synchronization has to
> +  suffer, since the switch's variable queuing delay introduces a path delay
> +  jitter between the host port and its PTP partner. For this reason, some DSA
> +  switches include a timestamping clock of their own, and have the ability to
> +  perform network timestamping on their own MAC, such that path delays only
> +  measure wire and PHY propagation latencies. Timestamping DSA switches are
> +  supported in Linux and expose the same ABI as any other network interface
> +  (save for the fact that the DSA interfaces are in fact virtual in terms of
> +  network I/O, they do have their own PHC).  It is typical, but not 
> mandatory,
> +  for all interfaces of a DSA switch to share the same PHC.

new paragraph.

     By design, PTP
> +  timestamping with a DSA switch does not need any special handling in the
> +  driver for the host port it is attached to.  However, when the host port 
> also
> +  supports PTP timestamping, DSA will take care of intercepting the
> +  ``.ndo_do_ioctl`` calls towards the host port, and block attempts to enable
> +  hardware timestamping on it. This is because the SO_TIMESTAMPING API does 
> not
> +  allow the delivery of multiple hardware timestamps for the same packet, so
> +  anybody else except for the DSA switch port must be prevented from doing 
> so.

new paragraph.

> +  In code, DSA provides for most of the infrastructure for timestamping
> +  already, in generic code: a BPF classifier (``ptp_classify_raw``) is used 
> to
> +  identify PTP event messages (any other packets, including PTP general
> +  messages, are not timestamped), and provides two hooks to drivers:
> +
> +    - ``.port_txtstamp()``: A clone of the timestampable skb to be 
> transmitted,
> +      before actually transmitting it. Typically, a switch will have a PTP TX

"A clone of the ..."  is not a sentence.

> +      timestamp register (or sometimes a FIFO) where the timestamp becomes
> +      available. There may be an IRQ that is raised upon this timestamp's
> +      availability, or the driver might have to poll after invoking
> +      ``dev_queue_xmit()`` towards the host interface. Either way, in the
> +      ``.port_txtstamp()`` method, the driver only needs to save the clone 
> for
> +      later use (when the timestamp becomes available). Each skb is annotated
> +      with a pointer to its clone, in ``DSA_SKB_CB(skb)->clone``, to ease the
> +      driver's job of keeping track of which clone belongs to which skb.
> +
> +   - ``.port_rxtstamp()``: The original (and only) timestampable skb is
> +     provided to the driver, for it to annotate it with a timestamp, if that 
> is
> +     immediately available, or defer to later. On reception, timestamps might
> +     either be available in-band (through metadata in the DSA header, or
> +     attached in other ways to the packet), or out-of-band (through another 
> RX
> +     timestamping FIFO). Deferral on RX is typically necessary when 
> retrieving
> +     the timestamp needs a sleepable context. In that case, it is the
> +     responsibility of the driver to call ``netif_rx_ni()`` on the freshly

responsibility of the *DSA* driver

> +     timestamped skb.
> +
> +
> +- Ethernet PHYs.

new paragraph.

     These are devices that typically fulfill a Layer 1 role in the
> +  network stack, hence they do not have a representation in terms of a 
> network
> +  interface as DSA switches do. However, PHYs may be able to detect and
> +  timestamp PTP packets, for performance reasons: timestamps taken as close 
> as
> +  possible to the wire have the potential to yield a more stable and precise
> +  synchronization.
> +  A PHY driver that supports PTP timestamping must create a ``struct
> +  mii_timestamper`` and add a pointer to it in ``phydev->mii_ts``. The 
> presence
> +  of this pointer will be checked by the networking stack.

new paragraph.

> +  Since PHYs do not have network interface representations, the timestamping
> +  and ethtool ioctl operations for them need to be mediated by their 
> respective
> +  MAC driver. Therefore, as opposed to DSA switches, modifications need to be
> +  done to each individual MAC driver for PHY timestamping support. This
> +  entails:
> +
> +    - Checking, in ``.ndo_do_ioctl``, whether
> +      ``phy_has_hwtstamp(netdev->phydev)`` is true or not. If it is, then 
> the MAC
> +      driver should not process this request but instead pass it on to the 
> PHY
> +      using ``phy_mii_ioctl()``.
> +
> +    - Checking, before delivering received skb's up the network stack (using
> +      ``napi_gro_receive`` or similar), whether 
> ``skb_defer_rx_timestamp(skb)``
> +      is necessary or not - and if it is, don't call ``napi_gro_receive`` at 
> all.

Actually, only drivers that use the traditional netif_rx() will need
to call skb_defer_rx_timestamp().

In the specific case of napi_gro_receive, we have:

napi_gro_receive
 napi_skb_finish
  gro_normal_one
   gro_normal_list
    netif_receive_skb_list_internal
     if (!skb_defer_rx_timestamp(skb))
      ...

The stack automatically also checks skb_defer_rx_timestamp() in other
cases, for example:

netif_receive_skb
 ret = netif_receive_skb_internal(skb);
  if (skb_defer_rx_timestamp(skb))
   return NET_RX_SUCCESS;

> +      If ``CONFIG_NETWORK_PHY_TIMESTAMPING`` is enabled, and
> +      ``skb->dev->phydev->mii_ts`` exists, its ``.rxtstamp()`` hook will be
> +      called now, to determine, using logic very similar to DSA, whether 
> deferral
> +      for RX timestamping is necessary.  Again like DSA, it becomes the
> +      responsibility of the PHY driver to send the packet up the stack when 
> the
> +      timestamp is available.
> +
> +    - On TX, special intervention from the MAC driver might or might not be
> +      needed. The function that calls the ``mii_ts->txtstamp()`` hook is 
> named
> +      ``skb_clone_tx_timestamp()``. This function can either be called 
> directly
> +      (case in which explicit MAC driver support is indeed needed), but the
> +      function also piggybacks from the ``skb_tx_timestamp()`` call, which 
> many
> +      MAC drivers already perform for software timestamping purposes. 
> Therefore,
> +      if a MAC supports software timestamping, it does not need to do 
> anything
> +      further at this stage.
> +
> +
> +- MII bus snooping devices.

new paragraph.

     These perform the same role as timestamping
> +  Ethernet PHYs, save for the fact that they are discrete devices and can
> +  therefore be used in conjunction with any PHY even if it doesn't support
> +  timestamping. In Linux, they are discoverable and attachable to a ``struct
> +  phy_device`` through Device Tree, and for the rest, they use the same 
> mii_ts
> +  infrastructure as those. See
> +  Documentation/devicetree/bindings/ptp/timestamper.txt for more details.
> +
> +One caveat with stacked PHCs, especially with DSA (but not only) - since that
> +doesn't require any modification to MAC drivers, so it is more difficult to
> +ensure correctness of all possible code paths - is that they uncover bugs 
> which
> +were impossible to trigger before the existence of stacked PTP clocks.
> +One example has to do with this line of code, already presented earlier::
> +
> +      skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
> +
> +Any TX timestamping logic, be it a plain MAC driver, a DSA switch driver, a 
> PHY
> +driver or a MII bus snooping device driver, should set this flag.
> +But a MAC driver that is unaware of PHC stacking might get tripped up by
> +somebody other than itself setting this flag, and deliver a duplicate
> +timestamp.
> +For example, a typical driver design for TX timestamping might be to split 
> the
> +transmission part into 2 portions:
> +
> +1. "TX": checks whether PTP timestamping has been previously enabled through
> +   the ``.ndo_do_ioctl`` ("``priv->hwtstamp_tx_enabled == true``") and the
> +   current skb requires a TX timestamp ("``skb_shinfo(skb)->tx_flags &
> +   SKBTX_HW_TSTAMP``"). If this is true, it sets the
> +   "``skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS``" flag. Note: as
> +   described above, in the case of a stacked PHC system, this condition 
> should
> +   never trigger, as this MAC is certainly not the outermost PHC. But this is
> +   not where the typical issue is.  Transmission proceeds with this packet.
> +
> +2. "TX confirmation": Transmission has finished. The driver checks whether it
> +   is necessary to collect any TX timestamp for it. Here is where the typical
> +   issues are: the MAC driver takes a shortcut and only checks whether
> +   "``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``" was set. With a 
> stacked
> +   PHC system, this is incorrect because this MAC driver is not the only 
> entity
> +   in the TX data path who could have enabled SKBTX_IN_PROGRESS in the first
> +   place.
> +
> +The correct solution for this problem is for MAC drivers to have a compound
> +check in their "TX confirmation" portion, not only for
> +"``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``", but also for
> +"``priv->hwtstamp_tx_enabled == true``". Because the rest of the system 
> ensures
> +that PTP timestamping is not enabled for anything other than the outermost 
> PHC,
> +this enhanced check will avoid delivering a duplicated TX timestamp to user
> +space.
> -- 
> 2.25.1
> 

Reply via email to