Re: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

Stephen Hemminger Tue, 03 Apr 2018 16:17:50 -0700

On Tue,  3 Apr 2018 17:55:25 -0400
Jon Rosen <jro...@cisco.com> wrote:


> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
> casues the ring to get corrupted by allowing multiple kernel threads
> to claim ownership of the same ring entry, Mark the ring entry as
> already being used within the spin_lock to prevent other kernel
> threads from reusing the same entry before it's fully filled in,
> passed to user space, and then eventually passed back to the kernel
> for use with a new packet.
> 
> Note that the proposed change may modify the semantics of the
> interface between kernel space and user space in a way which may cause
> some applications to no longer work properly. More discussion on this
> change can be found in the additional comments section titled
> "3. Discussion on packet_mmap ownership semantics:".
> 
> Signed-off-by: Jon Rosen <jro...@cisco.com>
> ---
> 
> Additional Comments Section
> ---------------------------
> 
> 1. Description of the diffs:
> ----------------------------
> 
>    TPACKET_V1 and TPACKET_V2 format rings:
>    -------------------------------------------
>    Mark each entry as TP_STATUS_IN_PROGRESS after allocating to
>    prevent other kernel threads from re-using the same entry.
> 
>    This is necessary because there may be a delay from the time the
>    spin_lock is released to the time that the packet is completed and
>    the corresponding ring entry is marked as owned by user space.  If
>    during this time other kernel threads enqueue more packets to the
>    ring than the size of the ring then it will cause mutliple kernel
>    threads to operate on the same entry at the same time, corrupting
>    packets and the ring state.
> 
>    By marking the entry as allocated (IN_PROGRESS) we prevent other
>    kernel threads from incorrectly re-using an entry that is still in
>    the progress of being filled in before it is passed to user space.
> 
>    This forces each entry through the following states:
> 
>    +-> 1. (tp_status == TP_STATUS_KERNEL)
>    |      Free: For use by any kernel thread to store a new packet
>    |
>    |   2. !(tp_status == TP_STATUS_KERNEL) && !(tp_status & TP_STATUS_USER)
>    |      Allocated: In use by a *specific* kernel thread
>    |
>    |   3. (tp_status & TP_STATUS_USER)
>    |      Available: Packet available for user space to process
>    |
>    +-- Loop back to #1 when user space writes entry as TP_STATUS_KERNEL
> 
> 
>    No impact on TPACKET_V3 format rings:
>    -------------------------------------
>    Packet entry ownership is already protected from other kernel
>    threads potentially re-using the same entry. This is done inside
>    packet_current_rx_frame() where storage is allocated for the
>    current packet. Since this is done within the spin_lock no
>    additional status updates for this entry are required.
> 
> 
>    Defining TP_STATUS_IN_PROGRESS:
>    -------------------------------
>    Rather than defining a new-bit we re-use an existing bit for this
>    intermediate state.  Status will eventually be overwritten with the
>    actual true status when passed to user space.  Any bit used to pass
>    information to user space other than the one that passes ownership
>    is suitable (can't use TP_STATUS_USER).  Alternatively a new bit
>    could be defined.
> 
> 
> 2. More detailed discussion:
> ----------------------------
>    Ring entries basically have 2 states, owned by the kernel or owned by
>    user space. For single producer/single consumer this works fine. For
>    multiple producers there is a window between the call to spin_unlock
>    [F] and the call to __packet_set_status [J] where if there are enough
>    packets added to the ring by other kernel threads then the ring can
>    wrap and multiple threads will end up using the same ring entries.
> 
>    This occurs because the ring entry alocated at [C] did not modify the
>    state of the entry so it continues to appear as owned by the kernel
>    and available for use for new packets even though it has already been
>    allocated.
> 
>    A simple fix is to temporarily mark the ring entries within the spin
>    lock such that user space will still think it?s owned by the kernel
>    and other kernel threads will not see it as available to be used for
>    new packets. If a kernel thread gets delayed between [F] and [J] for
>    an extended period of time and the ring wraps back to the same point
>    then subsiquent kernel threads attempts to allocate will fail and be
>    treated as the ring being full.
> 
>    The change below at [D] uses a newly defined TP_STATUS_IN_PROGRESS bit
>    to prevent other kernel threads from re-using the same entry. Note that
>    any existing bit other than TP_STATUS_USER could have been used.
> 
>    af_packet.c:tpacket_rcv()
>       ... code removed for brevity ...
> 
>       // Acquire spin lock
> A:    spin_lock(&sk->sk_receive_queue.lock);
> 
>             // Preemption is disabled
> 
>             // Get current ring entry
> B:          h.raw = packet_current_rx_frame(
>                         po, skb, TP_STATUS_KERNEL, (macoff+snaplen));
> 
>             // Get out if ring is full
>             // Code not show but it will also release lock
>             if (!h.raw) goto ring_is_full;
> 
>             // Allocate ring entry
> C:          packet_increment_rx_head(po, &po->rx_ring);
> 
>             // Protect against allocation overrun (the simple fix)
>             // by changing TP_STATUS_KERNEL to TP_STATUS_IN_PROGRESS
> D:          if (po->tp_version <= TPACKET_V2) 
>                __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
> 
>             // Update counter
> E:          po->stats.stats1.tp_packets++;
> 
>       // Release spin lock
> F:    spin_unlock(&sk->sk_receive_queue.lock);
> 
>       // Copy packet payload
> G:    skb_copy_bits(skb, 0, h.raw + macoff, snaplen);
> 
>       // Get current timestamp
> H:    if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp))) 
>          getnstimeofday(&ts);
>       status |= ts_status;
> 
>       // Fill in header portions of ring entry (removed a bunch of
>       // writes for brevity)
>       ...
>       h.h1->tp_sec = ts.tv_sec;
>       h.h1->tp_usec = ts.tv_nsec / NSEC_PER_USEC;
>       sll->sll_halen = dev_parse_header(skb, sll->sll_addr);
>       ...
> 
>       // Memory Barrier to make sure all data is written before
>       // passing ownership to user space
> I:    smp_mb();
> 
>       // Update Status, passing ownership to user space
> J:    __packet_set_status(po, h.raw, status | TP_STATUS_USER);
> 
> 
> 3. Discussion on packet_mmap ownership semantics:
> -------------------------------------------------
>    One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
>    is that the documentation of the tp_status field is somewhat
>    inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
>    meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
>    meaning the entry is owned by user space.  In other places ownership
>    by user space is defined by the TP_STATUS_USER(1) bit being set.
> 
>    Relevant section of packet_mmap documentation from
>    https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt
>    follow but a summary of the semantics in question are described as:
> 
>    - At the beginning of each frame there is an status field
>    - If this field is 0 means that the frame is ready to be
>      used for the kernel
>    - If not, there is a frame the user can read
> 
>    This would indicate that 0 means owned by the kernel and non-0 is
>    owned by the user.  The simple fix above is to set the status to
>    something that neither the kernel nor the user thinks they own.  That
>    means that 0 vs. non-0 would be insufficient.
> 
>    The description and examples in packet_mmap.txt actually talk about
>    using a specific bit, the TP_STATUS_USER bit, to indicate something is
>    owned by the kernel vs something is owned by the user.  The relevant
>    references from packet_mmap.txt to this bit are:
> 
>    - The kernel initializes all frames to TP_STATUS_KERNEL(0)
>    - when the kernel receives a packet it puts in the buffer and
>      updates the status with at least the TP_STATUS_USER(1) flag
> 
>    This description contradicts the previous description which implied
>    that non-0 means owned by the user. In the above statement it implies
>    that there needs to be more than just non-0 and that the
>    TP_STATUS_USER bit must be set for it to be owned by user space.
> 
>    If the above proposed fix of using a new TP_STATUS_IN_PROGRESS bit (or
>    any existing bit other than TP_STATUS_USER) were to be used and an
>    application were written assuming !TP_STATUS_KERNEL implies owned by
>    user space then the application would incorrectly assume there was a
>    valid packet entry to be processed when the entry is not ready.  If an
>    application were instead written to look specificaly for
>    TP_STATUS_USER then the above proposed fix would work correctly.
> 
>    A more complex solution might be to create a shadow data structure
>    which is private to the kernel and keeps status bits for each ring
>    entry to indicate the intermediate state between owned by kernel and
>    owned by user space.
> 
> 
> 4. Snipet from packet_mmap.txt
> ------------------------------
>    At the beginning of each frame there is an status field (see 
>    struct tpacket_hdr). If this field is 0 means that the frame is ready
>    to be used for the kernel, If not, there is a frame the user can read 
>    and the following flags apply:
> 
>    +++ Capture process:
>    from include/linux/if_packet.h
> 
>    #define TP_STATUS_COPY          (1 << 1)
>    #define TP_STATUS_LOSING        (1 << 2)
>    #define TP_STATUS_CSUMNOTREADY  (1 << 3)
>    #define TP_STATUS_CSUM_VALID    (1 << 7)
> 
>    TP_STATUS_COPY        : This flag indicates that the frame (and associated
>    meta information) has been truncated because it's 
>    larger than tp_frame_size. This packet can be 
>    read entirely with recvfrom().
> 
>    In order to make this work it must to be
>    enabled previously with setsockopt() and 
>    the PACKET_COPY_THRESH option. 
> 
>    The number of frames that can be buffered to
>    be read with recvfrom is limited like a normal socket.
>    See the SO_RCVBUF option in the socket (7) man page.
> 
>    TP_STATUS_LOSING      : indicates there were packet drops from last time 
>    statistics where checked with getsockopt() and
>    the PACKET_STATISTICS option.
> 
>    TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which 
>    its checksum will be done in hardware. So while
>    reading the packet we should not try to check the 
>    checksum. 
> 
>    TP_STATUS_CSUM_VALID  : This flag indicates that at least the transport
>    header checksum of the packet has been already
>    validated on the kernel side. If the flag is not set
>    then we are free to check the checksum by ourselves
>    provided that TP_STATUS_CSUMNOTREADY is also not set.
> 
>    for convenience there are also the following defines:
> 
>    #define TP_STATUS_KERNEL        0
>    #define TP_STATUS_USER          1
> 
>    The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
>    receives a packet it puts in the buffer and updates the status with
>    at least the TP_STATUS_USER flag. Then the user can read the packet,
>    once the packet is read the user must zero the status field, so the kernel 
>    can use again that frame buffer.
> 
>    The user can use poll (any other variant should apply too) to check if new
>    packets are in the ring:
> 
>    struct pollfd pfd;
> 
>    pfd.fd = fd;
>    pfd.revents = 0;
>    pfd.events = POLLIN|POLLRDNORM|POLLERR;
> 
>    if (status == TP_STATUS_KERNEL)
>    retval = poll(&pfd, 1, timeout);
> 
>    It doesn't incur in a race condition to first check the status value and 
>    then poll for frames.
> 
> 
> 5. Snipet from man pages for packet(7):
> ---------------------------------------
> 
>    See portion marked with "***" for reference to use of TP_STATUS_USER bit.
> 
>        PACKET_RX_RING
> 
>               Create a memory-mapped ring buffer for asynchronous
>               packet reception.  The packet socket reserves a
>               contiguous region of application address space, lays it
>               out into an array of packet slots and copies packets (up
>               to tp_snaplen) into subsequent slots.  Each packet is
>               preceded by a metadata structure similar to
>               tpacket_auxdata.  The protocol fields encode the offset
>               to the data from the start of the metadata header.
>               tp_net stores the offset to the network layer.  If the
>               packet socket is of type SOCK_DGRAM, then tp_mac is the
>               same.  If it is of type SOCK_RAW, then that field stores
>               the offset to the link-layer frame.  Packet socket and
>               application communi? cate the head and tail of the ring
>               through the tp_status field.  The packet socket owns all
>               slots with tp_status equal to TP_STATUS_KERNEL.  After
>               filling a slot, it changes the status of the slot to
>        ***    transfer ownership to the application.  During normal
>        ***    operation, the new tp_status value has at least the
>        ***    TP_STATUS_USER bit set to signal that a received packet
>        ***    has been stored.  When the application has finished
>               processing a packet, it transfers ownership of the slot
>               back to the socket by setting tp_status equal to
>               TP_STATUS_KERNEL.
> 
>               Packet sockets implement multiple variants of the packet
>               ring.  The implementation details are described in
>               Documentation/networking/packet_mmap.txt in the Linux
>               kernel source tree.
> 
> 
> 6. Relevant files:
> ------------------
>   net/packet/af_packet.c
>   include/uapi/linux/if_packet.h
>   Documentation/networking/packet_mmap.txt
> 
> Jon Rosen (1):
>   packet: mark ring entry as in-use inside spin_lock to prevent RX ring
>     overrun
> 
>  net/packet/af_packet.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index e0f3f4a..264d7b2 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
> net_device *dev,
>               if (po->stats.stats1.tp_drops)
>                       status |= TP_STATUS_LOSING;
>       }
> +
> +        /*
> +         * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other
> +         * kernel threads from re-using this same entry.
> +         */
> +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING
> +     if (po->tp_version <= TPACKET_V2)
> +            __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
> +
>       po->stats.stats1.tp_packets++;
>       if (copy_skb) {
>               status |= TP_STATUS_COPY;

This patch looks correct. Please resend it with proper signed-off-by
and with a kernel code indenting style (tabs).  Is this bug present
since the beginning of af_packet and multiqueue devices or did it get
introduced in some previous kernel?

Re: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

Reply via email to