Hi Saurabh, That may not be a very easy answer, but, a better way would be, I believe: headRoom -> encapBytes copySize -> headersSize copySize + headRoom -> (new variable) newMdlSize
It would also be very useful to have documentation comments on the parameters, e.g. /* * headersSize: the number of bytes from the beginning of the packet buffer to be copied to a new MDL. * The original NBL will be advanced by headersSize, after which the NBL will be cloned, causing * the cloned NBL's NB to have DataOffset = 0. This will make a Retreat operation on the cloned NBL * to allocate a new MDL. This allows us to modify the packet buffer headers on the clone later on, * while preserving the original NBL unmodified. * encapBytes: additional bytes, which will be used for prepending the encapsulation headers (e.g. VXLAN + ipv4 + eth) */ /* * newMdlSize: the size of the new MDL that will be allocated for the cloned NBL. It takes into account the size of the * packet headers (if some header field needs to be modified), and the additional bytes required for encapsulation. */ This is a suggestion. As for the documentary comments suggestion, please feel free to alter it in any way you think it would make code clearer. Thanks, Sam ________________________________________ From: Saurabh Shah [ssaur...@vmware.com] Sent: Tuesday, August 26, 2014 1:30 AM To: Samuel Ghinet Cc: dev@openvswitch.org Subject: Re: Cloning packets for "action: set field" Hi Samuel, >Hello Saurabh, > >Thanks for the clarifications! >The param name "copySize" was very ambiguous. What name do you think will make it easy to understand? I can take this in to account when I post a patch for - https://github.com/openvswitch/ovs-issues/issues/28. > >I did not debug this exact scenario, but you must be right: >- Cloning an original NB that has DataOffset > 0 causes the clone to have >DataOffset = 0 >- NdisRetreatNetBufferDataStart doc says >" >DataOffsetDelta [in] > > The amount of used data space to add. NDIS adjusts the DataOffset >member of the NET_BUFFER structure accordingly. If there is not enough >unused data space to satisfy the request, NDIS allocates additional >memory. >" >Therefore, a retreat on the clone must allocate a new MDL. The >NdisCopyFromNetBufferToNetBuffer, then, makes sense. > >And how do you do when you need to, say, set dest ipv4 addr, then set src >ipv4 addr, then out to ports, for the same packet? Do you make two >clones, one for each set action? Yes, we would clone from the first clone. > >And if you need to: >clone to clone1, set ipv4 dest addr1, out to port 1; >then >clone to clone2, set ipv4 dest addr2, out to port 2; >clone2 will be a clone of clone1? Yes. > >Thanks! >Sam >________________________________________ >From: Saurabh Shah [ssaur...@vmware.com] >Sent: Friday, August 15, 2014 8:03 AM >To: Samuel Ghinet >Cc: dev@openvswitch.org >Subject: Re: Cloning packets for "action: set field" > >Hi Samuel, > >Thanks for your patience. I believe all of your concerns are arising due >to a fundamental misunderstanding of the NDIS apis & the buffer management >code. I thought it would useful if I walk you through the >OvsPartialCopyNBL logic by giving an example. > >Let's say there is a call to OvsPartialCopyNBL(), with copySize = 50. This >is presumably the size of the header that you want to be able to modify >it. > >Say, the original NBL looks like following: > >Original NBL >------------ > |-> NB1 [DataLength = 1500, CurrentMdlOffset = 0, Next = NULL] > |-> MdlChain --> MDL1 [ByteCount = 1500, ByteOffset = 0, >MappedSystemVa = 0x1000, Next = NULL] > > >So, here is how the partial copy logic works - > > >01. Inside partial copy API, first we advance the original NBL by copySize >(50 bytes). > >After this operation the original NBL looks like following: > >Original NBL >------------ > |-> NB1 [DataLength = 1450, CurrentMdlOffset = 50, Next = NULL] > |-> MdlChain --> MDL1 [ByteCount = 1500, ByteOffset = 0, >MappedSystemVa = 0x1000, Next = NULL] > >Note: CurrentMdlOffset & DataLength get modified. > >02. Allocate a clone from the Original NBL. > >The original NBL is unchanged & the cloned NBL looks like following: > > > >Clone NBL >--------- > |-> NB1 [DataLength = 1450, CurrentMdlOffset = 0, Next = NULL] > |-> MdlChain --> MDL2 [ByteCount = 1450, ByteOffset = 50, >MappedSystemVa >= 0x1000, Next = NULL] > >Note: The cloned NBL has a new MDL shell (i.e MDL2), but is pointing to >the same data buffer VA:0x1000. The ByteOffset is appropriately adjusted >in MDL2 (by 50 bytes). > >03. The original NBL gets retreated by copySize. > >With this operation, the original NBL becomes *exactly* like how the >caller had given it. > >Original NBL >------------ > |-> NB1 [DataLength = 1500, CurrentMdlOffset = 0, Next = NULL] > |-> MdlChain --> MDL1 [ByteCount = 1500, ByteOffset = 0, >MappedSystemVa = 0x1000, Next = NULL] > > >04. The cloned NBL is retreated by copySize bytes (assume headRoom = 0). > >The clone NBL will look like following: > >Clone NBL >--------- > |-> NB1 [DataLength = 1500, CurrentMdlOffset = 0, Next = NULL] > |-> MdlChain --> MDL3 [ByteCount = 50, ByteOffset = 0, >MappedSystemVa = >0x2000, Next = MDL2] > |--> MDL2 [ByteCount = 1450, ByteOffset = 50, >MappedSystemVa = >0x1000, Next = NULL] > > > >Note: A new memory descriptor (MDL3) is added in front of the chain and >the new descriptor points to a completely new data buffer, VA:0x2000. > >05. We copy ŒcopySize¹ bytes from original NBL to the head of the cloned >NBL. For this we use the ndis call, NdisCopyFromNetBufferToNetBuffer. NDIS >api takes care that 'copySize¹ bytes get copied correctly. This means that >even if the bytes where split in multiple MDLs they will correctly be >copied to a contiguous new data buffer of the cloned NBL. > >Note: At the end of this step, the cloned NBL is guaranteed to have >ŒcopySize¹ amount of data in a contiguous data buffer. And the data buffer >is solely owned by the cloned NBL. > >06. Allocate NBL context for the cloned NBL, set appropriate flags in the >NBL context & take a ref on the parent NBL context. > >To summarize, when partial copy returns, the caller gets a new NBL with >copySize amount of data in a contiguous *writeable* buffer. > > >Completion Path: >---------------- > >For most operations that require you to modify the packet, we would have >done a partial copy at some point. After manipulating the partially copied >NBL, the packet gets sent to the appropriate port. Completion is triggered >when NDIS calls our 'FilterSendNetBufferListsComplete' callback. The >complete callback will go ahead and complete the NBL it receives from >NDIS. This will trigger a chain of NBL free's, till it reaches the root >NBL. The root NBL is the original NBL that the extension received in the >ŒFilterSendNetBufferLists¹ callback. When the refcount of the root NBL >reaches 0, we propagate the completion by calling >NdisFSendNetBufferListsComplete for all those root NBLs whose refcount >have become 0. > > >In theory this is how things are supposed to work. If you still think that >this approach has holes, then let me know and we can do a separate IRC >session just to discuss this. :) > >Thanks, >Saurabh > > >From: Samuel Ghinet <sghi...@cloudbasesolutions.com> >Date: Thursday, August 7, 2014 at 10:45 AM >To: Saurabh Shah <ssaur...@vmware.com> >Cc: "dev@openvswitch.org" <dev@openvswitch.org> >Subject: RE: Cloning packets for "action: set field" > > >>"it sometimes happens for packets to come in FilterSentNetBufferLists >>callback: NBL1, NBL2, NBL3 and NBL4, and only after NBL4 was finished >>being sent >>via NdisFSendNetBufferLists, FilterSentNetBufferListsComplete starts to >>be called." >> >>I mean, to enter the FilterSentNetBufferLists callback 4 times, each time >>for a NBL. And only after the 4th was done, the >>FilterSentNetBufferListsComplete callback to start being called. >> >>This may mean that it is possible that: >>1) cloning NBL1 into NBL2 >>2) call NdisFSendNetBufferLists on NBL1 >>3) modify NBL2 >>may still corrupt the buffer of NBL1. >> >>Reading from MSDN: >>"As soon as a filter driver calls the NdisFSendNetBufferLists function, >>it relinquishes ownership of the NET_BUFFER_LIST structures and all >>associated resources. " >> >>"Until NDIS calls FilterSendNetBufferListsComplete, the current status of >>the send request is not available to the filter driver. A filter driver >>temporarily releases ownership of all resources that are associated with >>a send request when it calls NdisFSendNetBufferLists. A filter driver >>should never try to examine the NET_BUFFER_LIST structures or any >>associated data after calling NdisFSendNetBufferLists." >>(https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com/en- >>u >>s/library/windows/hardware/ff562616%28v%3Dvs.85%29.aspx&k=oIvRg1%2BdGAgOo >>M >>1BIlLLqw%3D%3D%0A&r=pEkjsHfytvHEWufeZPpgqSOJMdMjuZPbesVsNhCUc0E%3D%0A&m=O >>0 >>y%2BQrr%2FNHgPJBj7fBjzyZ1IpChLYg49C9nKwT7y0M4%3D%0A&s=891f0324d2d10174698 >>9 >>9e64476a53d80d5825ba5053c86f396bfdee07663515) >>________________________________________ >>From: Samuel Ghinet >>Sent: Thursday, August 07, 2014 8:39 PM >>To: Saurabh Shah >>Cc: dev@openvswitch.org >>Subject: RE: Cloning packets for "action: set field" >> >>Hello Saurabh, >> >>[QUOTE]For the set case, we actually send out the packet to all the >>existing output ports before going ahead with the set. This is what the >>function OvsOutputBeforeSetAction does. So, at any given point we only >>have one NBL to deal with.[/QUOTE] >>This may still not solve the problem: when the NdisFSendNetBufferLists >>function is called, the packet is being forwarded to the underlying >>drivers, and thus may spend some time until it is actually sent out. I >>believe this is an asynchronous operation >>(meaning that you just request a "send packet to other drivers" and may >>move forward before all is finished). I say this because I remember in my >>tests, it sometimes happens for packets to come in >>FilterSentNetBufferLists callback: NBL1, NBL2, NBL3 and NBL4, and only >>after NBL4 was finished being sent >>via NdisFSendNetBufferLists, FilterSentNetBufferListsComplete starts to >>be called. >> >>[QUOTE]We actually can copy parts of the header (even if they are in a >>single MDL), in fact our code has the infrastructure to do this. Please >>take a look at OvsPartialCopyNBL.[/QUOTE] >>I believe you missunderstand me: I was speaking about how >>NdisAllocateCloneNetBufferList and NdisRetreatNetBufferDataStart work. >>Both of them are being used by OvsPartialCopyNBL and the problem is not >>solved. >>The point with NdisAllocateCloneNetBufferList: >>"The clone NET_BUFFER_LIST structure describes the same data that is >>described by the NET_BUFFER_LIST structure at OriginalNetBufferList. NDIS >>does not copy the data that is described by the original MDLs to new data >>buffers. Instead, the cloned structures reference the original data >>buffers. " >>(https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com/en- >>u >>s/library/windows/hardware/ff560705%28v%3Dvs.85%29.aspx&k=oIvRg1%2BdGAgOo >>M >>1BIlLLqw%3D%3D%0A&r=pEkjsHfytvHEWufeZPpgqSOJMdMjuZPbesVsNhCUc0E%3D%0A&m=O >>0 >>y%2BQrr%2FNHgPJBj7fBjzyZ1IpChLYg49C9nKwT7y0M4%3D%0A&s=f85e9ea4310586763be >>f >>feb77684b3e99a81a8f4a0cb5eadfb4298ee124d7d4f) >>This means that by writing to the clone buffer you actually write to the >>original buffer. The buffer is REFERENCED, not a new one allocated. >> >>The point with NdisRetreatNetBufferDataStart: >>"If there isn't enough unused data space, this function allocates a new >>buffer and an MDL to describe the new buffer and chains the new MDL to >>the beginning of the MDL chain. " >>(https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com/en- >>u >>s/library/windows/hardware/ff564527%28v%3Dvs.85%29.aspx&k=oIvRg1%2BdGAgOo >>M >>1BIlLLqw%3D%3D%0A&r=pEkjsHfytvHEWufeZPpgqSOJMdMjuZPbesVsNhCUc0E%3D%0A&m=O >>0 >>y%2BQrr%2FNHgPJBj7fBjzyZ1IpChLYg49C9nKwT7y0M4%3D%0A&s=6cd96cceddd38b61d06 >>4 >>fdc3c602d2b0e4e4d8c0f4629ff45f16361d29cf8418) >>This means that if you have an NB with data offset = 40, you retreat by >>40 and write 40 bytes from the "new beginning" of the buffer, you >>actually write to the same MDL. >> >>[QUOTE]The Buffer Management subsystem is one of the major difference >>between both the drivers. Since doing a deep copy of all packets is >>pretty sub-optimal, we (Guolin, in particular) designed the sophisticated >>buffer management system. Yes, it is intricate at first, but we already >>have it! (And for a while now. :))[/QUOTE] >>I particularly find it very intricate. Perhaps it should be refactored a >>bit and documented a bit more. >>For instance, functions such as OvsPartialCopyNBL and >>OvsPartialCopyToMultipleNBLs and OvsCopySinglePacketNBL and >>OvsFullCopyNBL and OvsFullCopyToMultipleNBLs and OvsCompleteNBL I find >>very daunting. >>Perhaps some comments for several parameters used and some comments in >>the function body would help as well. >>What does "Partial copy" actually mean? I understand full copy should >>mean something like "deep copy" / duplicate, but partial copy means >>create new NBL and new NB, and copy src buffer -> dst buffer ( = src >>buffer, because it's the same buffer)?? >> >>IMPORTANT: >>I have just made a test with your functionality from OvsPartialCopyNBL, >>and checked the cloned NBL against original NBL: >>original NBL: 0xffffe000`02e74b20 >>cloned NBL: 0xffffe000`02a53ac0 (ok) >> >>NBL's FirstNetBuffer >>original FirstNetBuffer: 0xffffe000`02e74c80; DataOffset = 0x26; >>DataLength = 0x5c >>cloned FirstNetBuffer: 0xffffe000`02fe4890; DataOffset = 0; DataLength = >>0x5c (ok) >> >>CurrentMdl of the FirstNetBuffer >>original CurrentMdl: 0xffffe000`02e74da0; ByteCount = 0x50; (used space >>starting after 0x26 bytes, see above); Next MDL count bytes = 0x32; (0x50 >>- 0x26) + 0x32 = 0x5c >>cloned CurrentMdl: 0xffffe000`02e74dc6; ByteCount = 0x2a; Next MDL count >>bytes = 0x32; total size from MDLs = 0x5c (NOT OK -- see below) >> >>If you study CurrentMdl of both, you will see: >>original: is spanning from ...74da0 -> ...74DF0 (i.e. + 0x50), but >>excluding the unused area: (offset = 0x26): 74DC6 -> 74DF0 >>cloned: is spanning from ...74DC6 -> ...74DF0 (i.e. + 0x32). >>BOTH MDLS MAP THE SAME MEMORY AREA! So writing to cloneNb also writes to >>originalNb! >> >>Did you study the layout of NBLs, NBs and MDLs and buffers in your memory >>management functions? >> >>Sam >> >>________________________________________ >>From: Saurabh Shah [ssaur...@vmware.com] >>Sent: Wednesday, August 06, 2014 12:50 AM >>To: Samuel Ghinet >>Cc: dev@openvswitch.org >>Subject: RE: Cloning packets for "action: set field" >> >>Hi Samuel, >> >>> -----Original Message----- >>> From: Samuel Ghinet [mailto:sghi...@cloudbasesolutions.com] >>> Sent: Tuesday, August 05, 2014 5:00 AM >>> To: Saurabh Shah >>> Cc: dev@openvswitch.org >>> Subject: RE: Cloning packets for "action: set field" >>> >>> Hello Saurabh, >>> >>> I believe there are more complexities associated to this packet >>>cloning. >>> Consider for instance the following flow: >>> if (conditions on packet) => >>> out_to_port: vm1; >>> set ip_dest: X1; out_to_port: vm2 >>> set ip_dest: X2; out_to_port: vm3 >>> >>> Given that multiple destinations are set, and only then the packet is >>> outputted, then, given the issue with cloning, all packets (that will >>>be >>> outputted to vm1, vm2, vm3) will have the same, last destination ip. >>> >> >>For the set case, we actually send out the packet to all the existing >>output ports before going ahead with the set. This is what the function >>OvsOutputBeforeSetAction does. So, at any given point we only have one >>NBL to deal with. >> >>> I believe we cannot actually copy only the headers: if the packet >>>contains a >>> single MDL, there is nothing else we can do but to duplicate the >>>packet. >>> For the case where the NB has multiple MDLs chained, we must copy the >>>first; >>> MDL(s), i.e. the MDL or MDLs that contain the headers that must be >>>copied. >>> >> >>We actually can copy parts of the header (even if they are in a single >>MDL), in fact our code has the infrastructure to do this. Please take a >>look at OvsPartialCopyNBL. >> >>> There is also a problem with NBL completion (say we do the MDL copy and >>> replacement): >>> for the case above, consider: >>> out_to_port: vm1; (NB1) >>> set ip_dest: X1; out_to_port: vm2 (clone to NB2, create new MDL, store >>>ptr >>> to original MDL) set ip_dest: X2; out_to_port: vm3(clone to NB3, create >>>new >>> MDL, store ptr to original MDL) >>> >>> At completion, we may need to do this: >>> NB3: Put back old MDL into original NB, free clone NB3 >>> NB2: Put back old MDL (but already done before), and free NB2 >>> NB1: it's the original, so nothing special. >>> >>> In order to implement this "but already done before" case, we need to >>>use >>> ref counting for NBL context data (which I have seen you do have some >>>use >>> of ref counting) However, consider the complexity of restoring an >>>original NB >>> when multiple MDLs have been replaced, by multiple clones. The big >>> problem here is that the whole MDL link would be broken. >>> >> >>The Buffer Management subsystem is one of the major difference between >>both the drivers. Since doing a deep copy of all packets is pretty >>sub-optimal, we (Guolin, in particular) designed the sophisticated buffer >>management system. Yes, it is intricate at first, but we already have it! >>(And for a while now. :)) So, all that you have mentioned here has >>already been implemented. Please take a look at our Buffer Management >>code & let me know if something is still unclear. >> >>> I strongly believe that building this component with the mind set >>>"optimize >>> first, fix bugs later" is not at all a good way to do it. Plus, it >>>would make the >>> code more and more intrincate, as we continue to both find new problems >>> and we get to have large pieces of code (dealing with NBLs) spread all >>>over. >>> I personally believe a better approach would be "start with something >>> simpler, which is fully functional even though suboptimal, and then >>>continue >>> by building on top of it, improving incrementally for each of various >>>cases". >>> >> >>Quite the contrary. The whole point of implementing the buffer management >>code was to layout the right infrastructure to help simplify NBL >>management for all the consumers & avoid doing a major enhancement at a >>later point. I can go off on a rant explaining our rationales in detail, >>but to allay your concerns, let me assure you that we don't have a >>mindset of optimize-over-bugfix. Once the Netlink integration is >>complete, I expect all of us to work towards closing all critical bugs at >>a priority. >> >>> Thanks! >>> Sam >> >>HTH, >>Saurabh _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev