On Tue, 3 Apr 2018 17:55:25 -0400 Jon Rosen <jro...@cisco.com> wrote:
> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which > casues the ring to get corrupted by allowing multiple kernel threads > to claim ownership of the same ring entry, Mark the ring entry as > already being used within the spin_lock to prevent other kernel > threads from reusing the same entry before it's fully filled in, > passed to user space, and then eventually passed back to the kernel > for use with a new packet. > > Note that the proposed change may modify the semantics of the > interface between kernel space and user space in a way which may cause > some applications to no longer work properly. More discussion on this > change can be found in the additional comments section titled > "3. Discussion on packet_mmap ownership semantics:". > > Signed-off-by: Jon Rosen <jro...@cisco.com> > --- > > Additional Comments Section > --------------------------- > > 1. Description of the diffs: > ---------------------------- > > TPACKET_V1 and TPACKET_V2 format rings: > ------------------------------------------- > Mark each entry as TP_STATUS_IN_PROGRESS after allocating to > prevent other kernel threads from re-using the same entry. > > This is necessary because there may be a delay from the time the > spin_lock is released to the time that the packet is completed and > the corresponding ring entry is marked as owned by user space. If > during this time other kernel threads enqueue more packets to the > ring than the size of the ring then it will cause mutliple kernel > threads to operate on the same entry at the same time, corrupting > packets and the ring state. > > By marking the entry as allocated (IN_PROGRESS) we prevent other > kernel threads from incorrectly re-using an entry that is still in > the progress of being filled in before it is passed to user space. > > This forces each entry through the following states: > > +-> 1. (tp_status == TP_STATUS_KERNEL) > | Free: For use by any kernel thread to store a new packet > | > | 2. !(tp_status == TP_STATUS_KERNEL) && !(tp_status & TP_STATUS_USER) > | Allocated: In use by a *specific* kernel thread > | > | 3. (tp_status & TP_STATUS_USER) > | Available: Packet available for user space to process > | > +-- Loop back to #1 when user space writes entry as TP_STATUS_KERNEL > > > No impact on TPACKET_V3 format rings: > ------------------------------------- > Packet entry ownership is already protected from other kernel > threads potentially re-using the same entry. This is done inside > packet_current_rx_frame() where storage is allocated for the > current packet. Since this is done within the spin_lock no > additional status updates for this entry are required. > > > Defining TP_STATUS_IN_PROGRESS: > ------------------------------- > Rather than defining a new-bit we re-use an existing bit for this > intermediate state. Status will eventually be overwritten with the > actual true status when passed to user space. Any bit used to pass > information to user space other than the one that passes ownership > is suitable (can't use TP_STATUS_USER). Alternatively a new bit > could be defined. > > > 2. More detailed discussion: > ---------------------------- > Ring entries basically have 2 states, owned by the kernel or owned by > user space. For single producer/single consumer this works fine. For > multiple producers there is a window between the call to spin_unlock > [F] and the call to __packet_set_status [J] where if there are enough > packets added to the ring by other kernel threads then the ring can > wrap and multiple threads will end up using the same ring entries. > > This occurs because the ring entry alocated at [C] did not modify the > state of the entry so it continues to appear as owned by the kernel > and available for use for new packets even though it has already been > allocated. > > A simple fix is to temporarily mark the ring entries within the spin > lock such that user space will still think it?s owned by the kernel > and other kernel threads will not see it as available to be used for > new packets. If a kernel thread gets delayed between [F] and [J] for > an extended period of time and the ring wraps back to the same point > then subsiquent kernel threads attempts to allocate will fail and be > treated as the ring being full. > > The change below at [D] uses a newly defined TP_STATUS_IN_PROGRESS bit > to prevent other kernel threads from re-using the same entry. Note that > any existing bit other than TP_STATUS_USER could have been used. > > af_packet.c:tpacket_rcv() > ... code removed for brevity ... > > // Acquire spin lock > A: spin_lock(&sk->sk_receive_queue.lock); > > // Preemption is disabled > > // Get current ring entry > B: h.raw = packet_current_rx_frame( > po, skb, TP_STATUS_KERNEL, (macoff+snaplen)); > > // Get out if ring is full > // Code not show but it will also release lock > if (!h.raw) goto ring_is_full; > > // Allocate ring entry > C: packet_increment_rx_head(po, &po->rx_ring); > > // Protect against allocation overrun (the simple fix) > // by changing TP_STATUS_KERNEL to TP_STATUS_IN_PROGRESS > D: if (po->tp_version <= TPACKET_V2) > __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS); > > // Update counter > E: po->stats.stats1.tp_packets++; > > // Release spin lock > F: spin_unlock(&sk->sk_receive_queue.lock); > > // Copy packet payload > G: skb_copy_bits(skb, 0, h.raw + macoff, snaplen); > > // Get current timestamp > H: if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp))) > getnstimeofday(&ts); > status |= ts_status; > > // Fill in header portions of ring entry (removed a bunch of > // writes for brevity) > ... > h.h1->tp_sec = ts.tv_sec; > h.h1->tp_usec = ts.tv_nsec / NSEC_PER_USEC; > sll->sll_halen = dev_parse_header(skb, sll->sll_addr); > ... > > // Memory Barrier to make sure all data is written before > // passing ownership to user space > I: smp_mb(); > > // Update Status, passing ownership to user space > J: __packet_set_status(po, h.raw, status | TP_STATUS_USER); > > > 3. Discussion on packet_mmap ownership semantics: > ------------------------------------------------- > One issue with the above proposed change to use TP_STATUS_IN_PROGRESS > is that the documentation of the tp_status field is somewhat > inconsistent. In some places it's described as TP_STATUS_KERNEL(0) > meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0) > meaning the entry is owned by user space. In other places ownership > by user space is defined by the TP_STATUS_USER(1) bit being set. > > Relevant section of packet_mmap documentation from > https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt > follow but a summary of the semantics in question are described as: > > - At the beginning of each frame there is an status field > - If this field is 0 means that the frame is ready to be > used for the kernel > - If not, there is a frame the user can read > > This would indicate that 0 means owned by the kernel and non-0 is > owned by the user. The simple fix above is to set the status to > something that neither the kernel nor the user thinks they own. That > means that 0 vs. non-0 would be insufficient. > > The description and examples in packet_mmap.txt actually talk about > using a specific bit, the TP_STATUS_USER bit, to indicate something is > owned by the kernel vs something is owned by the user. The relevant > references from packet_mmap.txt to this bit are: > > - The kernel initializes all frames to TP_STATUS_KERNEL(0) > - when the kernel receives a packet it puts in the buffer and > updates the status with at least the TP_STATUS_USER(1) flag > > This description contradicts the previous description which implied > that non-0 means owned by the user. In the above statement it implies > that there needs to be more than just non-0 and that the > TP_STATUS_USER bit must be set for it to be owned by user space. > > If the above proposed fix of using a new TP_STATUS_IN_PROGRESS bit (or > any existing bit other than TP_STATUS_USER) were to be used and an > application were written assuming !TP_STATUS_KERNEL implies owned by > user space then the application would incorrectly assume there was a > valid packet entry to be processed when the entry is not ready. If an > application were instead written to look specificaly for > TP_STATUS_USER then the above proposed fix would work correctly. > > A more complex solution might be to create a shadow data structure > which is private to the kernel and keeps status bits for each ring > entry to indicate the intermediate state between owned by kernel and > owned by user space. > > > 4. Snipet from packet_mmap.txt > ------------------------------ > At the beginning of each frame there is an status field (see > struct tpacket_hdr). If this field is 0 means that the frame is ready > to be used for the kernel, If not, there is a frame the user can read > and the following flags apply: > > +++ Capture process: > from include/linux/if_packet.h > > #define TP_STATUS_COPY (1 << 1) > #define TP_STATUS_LOSING (1 << 2) > #define TP_STATUS_CSUMNOTREADY (1 << 3) > #define TP_STATUS_CSUM_VALID (1 << 7) > > TP_STATUS_COPY : This flag indicates that the frame (and associated > meta information) has been truncated because it's > larger than tp_frame_size. This packet can be > read entirely with recvfrom(). > > In order to make this work it must to be > enabled previously with setsockopt() and > the PACKET_COPY_THRESH option. > > The number of frames that can be buffered to > be read with recvfrom is limited like a normal socket. > See the SO_RCVBUF option in the socket (7) man page. > > TP_STATUS_LOSING : indicates there were packet drops from last time > statistics where checked with getsockopt() and > the PACKET_STATISTICS option. > > TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which > its checksum will be done in hardware. So while > reading the packet we should not try to check the > checksum. > > TP_STATUS_CSUM_VALID : This flag indicates that at least the transport > header checksum of the packet has been already > validated on the kernel side. If the flag is not set > then we are free to check the checksum by ourselves > provided that TP_STATUS_CSUMNOTREADY is also not set. > > for convenience there are also the following defines: > > #define TP_STATUS_KERNEL 0 > #define TP_STATUS_USER 1 > > The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel > receives a packet it puts in the buffer and updates the status with > at least the TP_STATUS_USER flag. Then the user can read the packet, > once the packet is read the user must zero the status field, so the kernel > can use again that frame buffer. > > The user can use poll (any other variant should apply too) to check if new > packets are in the ring: > > struct pollfd pfd; > > pfd.fd = fd; > pfd.revents = 0; > pfd.events = POLLIN|POLLRDNORM|POLLERR; > > if (status == TP_STATUS_KERNEL) > retval = poll(&pfd, 1, timeout); > > It doesn't incur in a race condition to first check the status value and > then poll for frames. > > > 5. Snipet from man pages for packet(7): > --------------------------------------- > > See portion marked with "***" for reference to use of TP_STATUS_USER bit. > > PACKET_RX_RING > > Create a memory-mapped ring buffer for asynchronous > packet reception. The packet socket reserves a > contiguous region of application address space, lays it > out into an array of packet slots and copies packets (up > to tp_snaplen) into subsequent slots. Each packet is > preceded by a metadata structure similar to > tpacket_auxdata. The protocol fields encode the offset > to the data from the start of the metadata header. > tp_net stores the offset to the network layer. If the > packet socket is of type SOCK_DGRAM, then tp_mac is the > same. If it is of type SOCK_RAW, then that field stores > the offset to the link-layer frame. Packet socket and > application communi? cate the head and tail of the ring > through the tp_status field. The packet socket owns all > slots with tp_status equal to TP_STATUS_KERNEL. After > filling a slot, it changes the status of the slot to > *** transfer ownership to the application. During normal > *** operation, the new tp_status value has at least the > *** TP_STATUS_USER bit set to signal that a received packet > *** has been stored. When the application has finished > processing a packet, it transfers ownership of the slot > back to the socket by setting tp_status equal to > TP_STATUS_KERNEL. > > Packet sockets implement multiple variants of the packet > ring. The implementation details are described in > Documentation/networking/packet_mmap.txt in the Linux > kernel source tree. > > > 6. Relevant files: > ------------------ > net/packet/af_packet.c > include/uapi/linux/if_packet.h > Documentation/networking/packet_mmap.txt > > Jon Rosen (1): > packet: mark ring entry as in-use inside spin_lock to prevent RX ring > overrun > > net/packet/af_packet.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index e0f3f4a..264d7b2 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct > net_device *dev, > if (po->stats.stats1.tp_drops) > status |= TP_STATUS_LOSING; > } > + > + /* > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other > + * kernel threads from re-using this same entry. > + */ > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING > + if (po->tp_version <= TPACKET_V2) > + __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS); > + > po->stats.stats1.tp_packets++; > if (copy_skb) { > status |= TP_STATUS_COPY; This patch looks correct. Please resend it with proper signed-off-by and with a kernel code indenting style (tabs). Is this bug present since the beginning of af_packet and multiqueue devices or did it get introduced in some previous kernel?