> -----Original Message----- > From: Morten Brørup [mailto:m...@smartsharesystems.com] > Sent: Thursday, October 10, 2019 8:30 AM > To: Stephen Hemminger <step...@networkplumber.org> > Cc: Ananyev, Konstantin <konstantin.anan...@intel.com>; dpdk-dev > <dev@dpdk.org>; Jerin Jacob <jer...@marvell.com> > Subject: RE: [dpdk-dev] packet data access bug in bpf and pdump libs > > > -----Original Message----- > > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Stephen Hemminger > > Sent: Wednesday, October 9, 2019 7:25 PM > > > > On Wed, 9 Oct 2019 17:20:58 +0200 > > Morten Brørup <m...@smartsharesystems.com> wrote: > > > > > > -----Original Message----- > > > > From: Stephen Hemminger [mailto:step...@networkplumber.org] > > > > Sent: Wednesday, October 9, 2019 5:15 PM > > > > > > > > On Wed, 9 Oct 2019 17:06:24 +0200 > > > > Morten Brørup <m...@smartsharesystems.com> wrote: > > > > > > > > > > -----Original Message----- > > > > > > From: Stephen Hemminger [mailto:step...@networkplumber.org] > > > > > > Sent: Wednesday, October 9, 2019 5:02 PM > > > > > > > > > > > > On Wed, 9 Oct 2019 11:11:46 +0000 > > > > > > "Ananyev, Konstantin" <konstantin.anan...@intel.com> wrote: > > > > > > > > > > > > > Hi Morten, > > > > > > > > > > > > > > > > > > > > > > > Hi Konstantin and Stephen, > > > > > > > > > > > > > > > > I just noticed the same bug in your bpf and pcap libraries: > > > > > > > > > > > > > > > > You are using rte_pktmbuf_mtod(), but should be using > > > > > > rte_pktmbuf_read(). Otherwise you cannot read data across > > multiple > > > > > > segments. > > > > > > > > > > > > > > In plain data buffer mode expected input for BPF program is > > start > > > > of > > > > > > first segment packet data. > > > > > > > Other segments are simply not available to BPF program in > > that > > > > mode. > > > > > > > AFAIK, cBPF uses the same model. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Med venlig hilsen / kind regards > > > > > > > > - Morten Brørup > > > > > > > > > > > > > > > > > > > For packet capture, the BPF program is only allowed to look at > > > > first > > > > > > segment. > > > > > > pktmbuf_read is expensive and can cause a copy. > > > > > > > > > > It is only expensive if going beyond the first segment: > > > > > > > > > > static inline const void *rte_pktmbuf_read(const struct rte_mbuf > > *m, > > > > > uint32_t off, uint32_t len, void *buf) > > > > > { > > > > > if (likely(off + len <= rte_pktmbuf_data_len(m))) > > > > > return rte_pktmbuf_mtod_offset(m, char *, off); > > > > > else > > > > > return __rte_pktmbuf_read(m, off, len, buf); > > > > > } > > > > > > > > But it would mean potentially big buffer on the stack (in case) > > > > > > No, the buffer only needs to be the size of the accessed data. I use > > it like this: > > > > > > char buffer[sizeof(uint32_t)]; > > > > > > for (;; pc++) { > > > switch (pc->code) { > > > case BPF_LD_ABS_32: > > > p = rte_pktmbuf_read(m, pc->k, sizeof(uint32_t), buffer); > > > if (unlikely(p == NULL)) return 0; /* Attempting to read > > beyond packet. Bail out. */ > > > a = rte_be_to_cpu_32(*(const uint32_t *)p); > > > continue; > > > case BPF_LD_ABS_16: > > > p = rte_pktmbuf_read(m, pc->k, sizeof(uint16_t), buffer); > > > if (unlikely(p == NULL)) return 0; /* Attempting to read > > beyond packet. Bail out. */ > > > a = rte_be_to_cpu_16(*(const uint16_t *)p); > > > continue; > > > > > > > Reading down the chain of mbuf segments to find a uint32_t (and that > > potentially crosses) > > seems like a waste.
+1 Again just imagine how painful it would be to support it in JIT... Another thing - in librte_bpf, if RTE_BPF_ARG_PTR is used, it means that input parameter for exec/jit is just a plain data buffer and it doesn’t make any assumptions about it (mbuf or not, etc.). Though there is a possibility to access mbuf metadata and call rte_pktmbuf_read() or any other functions from your eBPF code, but for that you need to specify RTE_BPF_ARG_PTR_MBUF at load stage. > > > > Slow and painful is the only way to read beyond the first segment, I agree. > > But when reading from the first segment, rte_pktmbuf_read() basically does > the same as your code. So there shouldn't be any performance > penalty from supporting both by using rte_pktmbuf_read() instead. > > I think the modification in the pdump library is simple, as you already pass > the mbuf. But the bpf library requires more work, as it passes a > pointer to the data in the first segment to the processing function instead > of passing the mbuf. > > > The purpose of the filter is to look at packet headers. > > Some might look deeper. So why prevent it? E.g. our StraighShaper appliance > sometimes looks deeper, but for performance reasons we > stopped using BPF for this a long time ago. > > > Any driver > > making mbufs that > > are dripples of data is broken. > > I agree very much with you on this regarding general-purpose NICs! Although I > know of an exception that confirms the rule... a few year > ago we worked with a component vendor with some very clever performance > optimizations doing exactly this for specific purposes. > Unfortunately it's under NDA, so I can't go into details. > > > chaining is really meant for case of jumbo or tso. > > > > Try thinking beyond PMD ingress. There are multiple use cases in egress. Here > are a couple: > > - IP Multicast to multiple subnets on a Layer 3 switch. The VLAN ID and > Source MAC must be replaced in each packet; this can be done using > segments. > - Tunnel encapsulation. E.g. putting a packet into a VXLAN tunnel could be > done using segments.