On Wed, Apr 11, 2018 at 05:02:50PM +0300, Andrew Rybchenko wrote: > On 04/11/2018 02:39 PM, Ananyev, Konstantin wrote: > > Hi Yongseok, > > > > > > > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote: > > > > > > Hi Yongseok, > > > > > > > > > > > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote: > > > > > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote: > > > > > > > > > When attaching a mbuf, indirect mbuf has to point to start of > > > > > > > > > buffer of > > > > > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this > > > > > > > > > becomes more > > > > > > > > > flexible. Indirect mbuf can point to any part of direct mbuf > > > > > > > > > by calling > > > > > > > > > rte_pktmbuf_attach_at(). > > > > > > > > > > > > > > > > > > Possible use-cases could be: > > > > > > > > > - If a packet has multiple layers of encapsulation, multiple > > > > > > > > > indirect > > > > > > > > > buffers can reference different layers of the encapsulated > > > > > > > > > packet. > > > > > > > > > - A large direct mbuf can even contain multiple packets in > > > > > > > > > series and > > > > > > > > > each packet can be referenced by multiple mbuf > > > > > > > > > indirections. > > > > > > > > > > > > > > > > > > Signed-off-by: Yongseok Koh <ys...@mellanox.com> > > > > > > > > I think the current API is already able to do what you want. > > > > > > > > > > > > > > > > 1/ Here is a mbuf m with its data > > > > > > > > > > > > > > > > off > > > > > > > > <--> > > > > > > > > len > > > > > > > > +----+ <----------> > > > > > > > > | | > > > > > > > > +-|----v----------------------+ > > > > > > > > | | -----------------------| > > > > > > > > m | buf | XXXXXXXXXXX || > > > > > > > > | -----------------------| > > > > > > > > +-----------------------------+ > > > > > > > > > > > > > > > > > > > > > > > > 2/ clone m: > > > > > > > > > > > > > > > > c = rte_pktmbuf_alloc(pool); > > > > > > > > rte_pktmbuf_attach(c, m); > > > > > > > > > > > > > > > > Note that c has its own offset and length fields. > > > > > > > > > > > > > > > > > > > > > > > > off > > > > > > > > <--> > > > > > > > > len > > > > > > > > +----+ <----------> > > > > > > > > | | > > > > > > > > +-|----v----------------------+ > > > > > > > > | | -----------------------| > > > > > > > > m | buf | XXXXXXXXXXX || > > > > > > > > | -----------------------| > > > > > > > > +------^----------------------+ > > > > > > > > | > > > > > > > > +----+ > > > > > > > > indirect | > > > > > > > > +-|---------------------------+ > > > > > > > > | | -----------------------| > > > > > > > > c | buf | || > > > > > > > > | -----------------------| > > > > > > > > +-----------------------------+ > > > > > > > > > > > > > > > > off len > > > > > > > > <--><----------> > > > > > > > > > > > > > > > > > > > > > > > > 3/ remove some data from c without changing m > > > > > > > > > > > > > > > > rte_pktmbuf_adj(c, 10) // at head > > > > > > > > rte_pktmbuf_trim(c, 10) // at tail > > > > > > > > > > > > > > > > > > > > > > > > Please let me know if it fits your needs. > > > > > > > No, it doesn't. > > > > > > > > > > > > > > Trimming head and tail with the current APIs removes data and > > > > > > > make the space > > > > > > > available. Adjusting packet head means giving more headroom, not > > > > > > > shifting the > > > > > > > buffer itself. If m has two indirect mbufs (c1 and c2) and those > > > > > > > are pointing to > > > > > > > difference offsets in m, > > > > > > > > > > > > > > rte_pktmbuf_adj(c1, 10); > > > > > > > rte_pktmbuf_adj(c2, 20); > > > > > > > > > > > > > > then the owner of c2 regard the first (off+20)B as available > > > > > > > headroom. If it > > > > > > > wants to attach outer header, it will overwrite the headroom even > > > > > > > though the > > > > > > > owner of c1 is still accessing it. Instead, another mbuf (h1) for > > > > > > > the outer > > > > > > > header should be linked by h1->next = c2. > > > > > > Yes, after these operations c1, c2 and m should become read-only. > > > > > > So, to > > > > > > prepend headers, another mbuf has to be inserted before as you > > > > > > suggest. It > > > > > > is possible to wrap this in a function rte_pktmbuf_clone_area(m, > > > > > > offset, > > > > > > length) that will: > > > > > > - alloc and attach indirect mbuf for each segment of m that is > > > > > > in the range [offset : length+offset]. > > > > > > - prepend an empty and writable mbuf for the headers > > > > > > > > > > > > > If c1 and c2 are attached with shifting buffer address by > > > > > > > adjusting buf_off, > > > > > > > which actually shrink the headroom, this case can be properly > > > > > > > handled. > > > > > > What do you mean by properly handled? > > > > > > > > > > > > Yes, prepending data or adding data in the indirect mbuf won't > > > > > > override > > > > > > the direct mbuf. But prepending data or adding data in the direct > > > > > > mbuf m > > > > > > won't be protected. > > > > > > > > > > > > From an application point of view, indirect mbufs, or direct mbufs > > > > > > that > > > > > > have refcnt != 1, should be both considered as read-only because > > > > > > they > > > > > > may share their data. How an application can know if the data is > > > > > > shared > > > > > > or not? > > > > > > > > > > > > Maybe we need a flag to differentiate mbufs that are read-only > > > > > > (something like SHARED_DATA, or simply READONLY). In your case, if > > > > > > my > > > > > > understanding is correct, you want to have indirect mbufs with RW > > > > > > data. > > > > > Agree that indirect mbuf must be treated as read-only, Then the > > > > > current code is > > > > > enough to handle that use-case. > > > > > > > > > > > > And another use-case (this is my actual use-case) is to make a > > > > > > > large mbuf have > > > > > > > multiple packets in series. AFAIK, this will also be helpful for > > > > > > > some FPGA NICs > > > > > > > because it transfers multiple packets to a single large buffer to > > > > > > > reduce PCIe > > > > > > > overhead for small packet traffic like the Multi-Packet Rx of > > > > > > > mlx5 does. > > > > > > > Otherwise, packets should be memcpy'd to regular mbufs one by one > > > > > > > instead of > > > > > > > indirect referencing. > > > > But just to make HW to RX multiple packets into one mbuf, > > > > data_off inside indirect mbuf should be enough, correct? > > > Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough > > > for mlx5 > > > to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 > > > packets in > > > a buffer. So, it needs ~32kB buffer. Having more bits in length fields > > > would be > > > better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in > > > order > > > to saturate the network link. > > There were few complains that 64KB max is a limitation for some use-cases. > > I am not against increasing it, but I don't think we have free space on > > first cache-line for that > > without another big rework of mbuf layout. > > Considering that we need to increase size for buf_len, data_off, data_len, > > and probably priv_size too. > > > > > > As I understand, what you'd like to achieve with this new field - > > > > ability to manipulate packet boundaries after RX, probably at upper > > > > layer. > > > > As Olivier pointed above, that doesn't sound as safe approach - as you > > > > have multiple > > > > indirect mbufs trying to modify same direct buffer. > > > I agree that there's an implication that indirect mbuf or mbuf having > > > refcnt > 1 > > > is read-only. What that means, all the entities which own such mbufs have > > > to be > > > aware of that and keep the principle as DPDK can't enforce the rule and > > > there > > > can't be such sanity check. In this sense, HW doesn't violate it because > > > the > > > direct mbuf is injected to HW before indirection. When packets are > > > written by > > > HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to > > > application layer with freeing the original direct mbuf (decrement refcnt > > > by 1). > > > So, HW doesn't touch the direct buffer once it reaches to upper layer. > > Yes, I understand that. But as I can see you introduced functions to adjust > > head and tail, > > which implies that it should be possible by some entity (upper layer?) to > > manipulate these > > indirect mbufs. > > And we don't know how exactly it will be done. > > > > > The direct buffer will be freed and get available for reuse when all the > > > attached > > > indirect mbufs are freed. > > > > > > > Though if you really need to do that, why it can be achieved by > > > > updating buf_len and priv_size > > > > Fields for indirect mbufs, straight after attach()? > > > Good point. > > > Actually that was my draft (Mellanox internal) version of this patch :-) > > > But I > > > had to consider a case where priv_size is really given by user. Even > > > though it > > > is less likely, but if original priv_size is quite big, it can't cover > > > entire > > > buf_len. For this, I had to increase priv_size to 32-bit but adding > > > another > > > 16bit field (buf_off) looked more plausible. > > As I remember, we can't have mbufs bigger then 64K, > > so priv_size + buf_len should be always less than 64K, correct? > > It sounds like it is suggested to use/customize priv_size to limit indirect > mbuf range in the direct one. It does not work from the box since priv_size is > used to find out direct mbuf by indirect (see rte_mbuf_from_indirect()).
?? That's exactly why he suggested to use priv_size... Thanks, Yongseok