"Let me think a little more about this. The functionality is meant to be generic, so I don't like calling the variable 'encapBytes', since it becomes a specific usage of the API."
I know, but I think "encapBytes" is the only usage we could ever have for the project. We could also use something like "additionalBytes", but it might be too vague. Sam ________________________________________ From: Saurabh Shah [ssaur...@vmware.com] Sent: Tuesday, August 26, 2014 10:29 PM To: Samuel Ghinet Cc: dev@openvswitch.org Subject: RE: Cloning packets for "action: set field" Hi Samuel, > -----Original Message----- > From: Samuel Ghinet [mailto:sghi...@cloudbasesolutions.com] > Sent: Tuesday, August 26, 2014 10:40 AM > To: Saurabh Shah > Cc: dev@openvswitch.org > Subject: RE: Cloning packets for "action: set field" > > Hi Saurabh, > > That may not be a very easy answer, but, a better way would be, I believe: > headRoom -> encapBytes > copySize -> headersSize > copySize + headRoom -> (new variable) newMdlSize > Let me think a little more about this. The functionality is meant to be generic, so I don't like calling the variable 'encapBytes', since it becomes a specific usage of the API. > It would also be very useful to have documentation comments on the > parameters, e.g. > /* > * headersSize: the number of bytes from the beginning of the packet > buffer to be copied to a new MDL. > * The original NBL will be advanced by headersSize, > after which > the NBL will be cloned, causing > * the cloned NBL's NB to have DataOffset = 0. This will > make a > Retreat operation on the cloned NBL > * to allocate a new MDL. This allows us to modify the > packet > buffer headers on the clone later on, > * while preserving the original NBL unmodified. > * encapBytes: additional bytes, which will be used for prepending the > encapsulation headers (e.g. VXLAN + ipv4 + eth) > */ > > /* > * newMdlSize: the size of the new MDL that will be allocated for the cloned > NBL. It takes into account the size of the > * packet headers (if some header field needs to be > modified), and > the additional bytes required for encapsulation. > */ > I definitely plan to document each parameter, return value and any other assumption that the API may have. > This is a suggestion. > As for the documentary comments suggestion, please feel free to alter it in > any way you think it would make code clearer. > Thanks for the suggestions, it helps to get a different prespective. I will send a patch soon. -- Saurabh > Thanks, > Sam > > ________________________________________ > From: Saurabh Shah [ssaur...@vmware.com] > Sent: Tuesday, August 26, 2014 1:30 AM > To: Samuel Ghinet > Cc: dev@openvswitch.org > Subject: Re: Cloning packets for "action: set field" > > Hi Samuel, > > >Hello Saurabh, > > > >Thanks for the clarifications! > >The param name "copySize" was very ambiguous. > > What name do you think will make it easy to understand? I can take this in to > account when I post a patch for - > https://urldefense.proofpoint.com/v1/url?u=https://github.com/openvswit > ch/ovs- > issues/issues/28&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=pEkjsHfy > tvHEWufeZPpgqSOJMdMjuZPbesVsNhCUc0E%3D%0A&m=v6onB5k72z1unZ > NkjO1VAZRXD%2Fkz98AovnuDeYkye6A%3D%0A&s=9a95e049cbdfedbaabc91 > 364ac73e0d558e18c32e79bfcd2f1cb1eb8aaf71894. > > > > >I did not debug this exact scenario, but you must be right: > >- Cloning an original NB that has DataOffset > 0 causes the clone to > >have DataOffset = 0 > >- NdisRetreatNetBufferDataStart doc says " > >DataOffsetDelta [in] > > > > The amount of used data space to add. NDIS adjusts the DataOffset > >member of the NET_BUFFER structure accordingly. If there is not enough > >unused data space to satisfy the request, NDIS allocates additional > >memory. > >" > >Therefore, a retreat on the clone must allocate a new MDL. The > >NdisCopyFromNetBufferToNetBuffer, then, makes sense. > > > >And how do you do when you need to, say, set dest ipv4 addr, then set > >src > >ipv4 addr, then out to ports, for the same packet? Do you make two > >clones, one for each set action? > > Yes, we would clone from the first clone. > > > > >And if you need to: > >clone to clone1, set ipv4 dest addr1, out to port 1; then clone to > >clone2, set ipv4 dest addr2, out to port 2; > >clone2 will be a clone of clone1? > > Yes. > > > > >Thanks! > >Sam > >________________________________________ > >From: Saurabh Shah [ssaur...@vmware.com] > >Sent: Friday, August 15, 2014 8:03 AM > >To: Samuel Ghinet > >Cc: dev@openvswitch.org > >Subject: Re: Cloning packets for "action: set field" > > > >Hi Samuel, > > > >Thanks for your patience. I believe all of your concerns are arising > >due to a fundamental misunderstanding of the NDIS apis & the buffer > >management code. I thought it would useful if I walk you through the > >OvsPartialCopyNBL logic by giving an example. > > > >Let's say there is a call to OvsPartialCopyNBL(), with copySize = 50. > >This is presumably the size of the header that you want to be able to > >modify it. > > > >Say, the original NBL looks like following: > > > >Original NBL > >------------ > > |-> NB1 [DataLength = 1500, CurrentMdlOffset = 0, Next = NULL] > > |-> MdlChain --> MDL1 [ByteCount = 1500, ByteOffset = 0, > >MappedSystemVa = 0x1000, Next = NULL] > > > > > >So, here is how the partial copy logic works - > > > > > >01. Inside partial copy API, first we advance the original NBL by > >copySize > >(50 bytes). > > > >After this operation the original NBL looks like following: > > > >Original NBL > >------------ > > |-> NB1 [DataLength = 1450, CurrentMdlOffset = 50, Next = NULL] > > |-> MdlChain --> MDL1 [ByteCount = 1500, ByteOffset = 0, > >MappedSystemVa = 0x1000, Next = NULL] > > > >Note: CurrentMdlOffset & DataLength get modified. > > > >02. Allocate a clone from the Original NBL. > > > >The original NBL is unchanged & the cloned NBL looks like following: > > > > > > > >Clone NBL > >--------- > > |-> NB1 [DataLength = 1450, CurrentMdlOffset = 0, Next = NULL] > > |-> MdlChain --> MDL2 [ByteCount = 1450, ByteOffset = 50, > >MappedSystemVa = 0x1000, Next = NULL] > > > >Note: The cloned NBL has a new MDL shell (i.e MDL2), but is pointing to > >the same data buffer VA:0x1000. The ByteOffset is appropriately > >adjusted in MDL2 (by 50 bytes). > > > >03. The original NBL gets retreated by copySize. > > > >With this operation, the original NBL becomes *exactly* like how the > >caller had given it. > > > >Original NBL > >------------ > > |-> NB1 [DataLength = 1500, CurrentMdlOffset = 0, Next = NULL] > > |-> MdlChain --> MDL1 [ByteCount = 1500, ByteOffset = 0, > >MappedSystemVa = 0x1000, Next = NULL] > > > > > >04. The cloned NBL is retreated by copySize bytes (assume headRoom = 0). > > > >The clone NBL will look like following: > > > >Clone NBL > >--------- > > |-> NB1 [DataLength = 1500, CurrentMdlOffset = 0, Next = NULL] > > |-> MdlChain --> MDL3 [ByteCount = 50, ByteOffset = 0, > >MappedSystemVa = 0x2000, Next = MDL2] > > |--> MDL2 [ByteCount = 1450, ByteOffset = > >50, MappedSystemVa = 0x1000, Next = NULL] > > > > > > > >Note: A new memory descriptor (MDL3) is added in front of the chain and > >the new descriptor points to a completely new data buffer, VA:0x2000. > > > >05. We copy ŒcopySize¹ bytes from original NBL to the head of the > >cloned NBL. For this we use the ndis call, > >NdisCopyFromNetBufferToNetBuffer. NDIS api takes care that 'copySize¹ > >bytes get copied correctly. This means that even if the bytes where > >split in multiple MDLs they will correctly be copied to a contiguous new data > buffer of the cloned NBL. > > > >Note: At the end of this step, the cloned NBL is guaranteed to have > >ŒcopySize¹ amount of data in a contiguous data buffer. And the data > >buffer is solely owned by the cloned NBL. > > > >06. Allocate NBL context for the cloned NBL, set appropriate flags in > >the NBL context & take a ref on the parent NBL context. > > > >To summarize, when partial copy returns, the caller gets a new NBL with > >copySize amount of data in a contiguous *writeable* buffer. > > > > > >Completion Path: > >---------------- > > > >For most operations that require you to modify the packet, we would > >have done a partial copy at some point. After manipulating the > >partially copied NBL, the packet gets sent to the appropriate port. > >Completion is triggered when NDIS calls our > >'FilterSendNetBufferListsComplete' callback. The complete callback will > >go ahead and complete the NBL it receives from NDIS. This will trigger > >a chain of NBL free's, till it reaches the root NBL. The root NBL is > >the original NBL that the extension received in the > >ŒFilterSendNetBufferLists¹ callback. When the refcount of the root NBL > >reaches 0, we propagate the completion by calling > >NdisFSendNetBufferListsComplete for all those root NBLs whose refcount > have become 0. > > > > > >In theory this is how things are supposed to work. If you still think > >that this approach has holes, then let me know and we can do a separate > >IRC session just to discuss this. :) > > > >Thanks, > >Saurabh > > > > > >From: Samuel Ghinet <sghi...@cloudbasesolutions.com> > >Date: Thursday, August 7, 2014 at 10:45 AM > >To: Saurabh Shah <ssaur...@vmware.com> > >Cc: "dev@openvswitch.org" <dev@openvswitch.org> > >Subject: RE: Cloning packets for "action: set field" > > > > > >>"it sometimes happens for packets to come in FilterSentNetBufferLists > >>callback: NBL1, NBL2, NBL3 and NBL4, and only after NBL4 was finished > >>being sent via NdisFSendNetBufferLists, > >>FilterSentNetBufferListsComplete starts to be called." > >> > >>I mean, to enter the FilterSentNetBufferLists callback 4 times, each > >>time for a NBL. And only after the 4th was done, the > >>FilterSentNetBufferListsComplete callback to start being called. > >> > >>This may mean that it is possible that: > >>1) cloning NBL1 into NBL2 > >>2) call NdisFSendNetBufferLists on NBL1 > >>3) modify NBL2 > >>may still corrupt the buffer of NBL1. > >> > >>Reading from MSDN: > >>"As soon as a filter driver calls the NdisFSendNetBufferLists > >>function, it relinquishes ownership of the NET_BUFFER_LIST structures > >>and all associated resources. " > >> > >>"Until NDIS calls FilterSendNetBufferListsComplete, the current status > >>of the send request is not available to the filter driver. A filter > >>driver temporarily releases ownership of all resources that are > >>associated with a send request when it calls NdisFSendNetBufferLists. > >>A filter driver should never try to examine the NET_BUFFER_LIST > >>structures or any associated data after calling NdisFSendNetBufferLists." > >>(https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com > / > >>en- > >>u > >>s/library/windows/hardware/ff562616%28v%3Dvs.85%29.aspx&k=oIvRg1 > %2BdGA > >>gOo > >>M > >>1BIlLLqw%3D%3D%0A&r=pEkjsHfytvHEWufeZPpgqSOJMdMjuZPbesVsNh > CUc0E%3D%0A& > >>m=O > >>0 > >>y%2BQrr%2FNHgPJBj7fBjzyZ1IpChLYg49C9nKwT7y0M4%3D%0A&s=891f03 > 24d2d10174 > >>698 > >>9 > >>9e64476a53d80d5825ba5053c86f396bfdee07663515) > >>________________________________________ > >>From: Samuel Ghinet > >>Sent: Thursday, August 07, 2014 8:39 PM > >>To: Saurabh Shah > >>Cc: dev@openvswitch.org > >>Subject: RE: Cloning packets for "action: set field" > >> > >>Hello Saurabh, > >> > >>[QUOTE]For the set case, we actually send out the packet to all the > >>existing output ports before going ahead with the set. This is what > >>the function OvsOutputBeforeSetAction does. So, at any given point we > >>only have one NBL to deal with.[/QUOTE] This may still not solve the > >>problem: when the NdisFSendNetBufferLists function is called, the > >>packet is being forwarded to the underlying drivers, and thus may > >>spend some time until it is actually sent out. I believe this is an > >>asynchronous operation (meaning that you just request a "send packet > >>to other drivers" and may move forward before all is finished). I say > >>this because I remember in my tests, it sometimes happens for packets > >>to come in FilterSentNetBufferLists callback: NBL1, NBL2, NBL3 and > >>NBL4, and only after NBL4 was finished being sent via > >>NdisFSendNetBufferLists, FilterSentNetBufferListsComplete starts to be > >>called. > >> > >>[QUOTE]We actually can copy parts of the header (even if they are in a > >>single MDL), in fact our code has the infrastructure to do this. > >>Please take a look at OvsPartialCopyNBL.[/QUOTE] I believe you > >>missunderstand me: I was speaking about how > >>NdisAllocateCloneNetBufferList and NdisRetreatNetBufferDataStart work. > >>Both of them are being used by OvsPartialCopyNBL and the problem is > >>not solved. > >>The point with NdisAllocateCloneNetBufferList: > >>"The clone NET_BUFFER_LIST structure describes the same data that is > >>described by the NET_BUFFER_LIST structure at OriginalNetBufferList. > >>NDIS does not copy the data that is described by the original MDLs to > >>new data buffers. Instead, the cloned structures reference the > >>original data buffers. " > >>(https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com > / > >>en- > >>u > >>s/library/windows/hardware/ff560705%28v%3Dvs.85%29.aspx&k=oIvRg1 > %2BdGA > >>gOo > >>M > >>1BIlLLqw%3D%3D%0A&r=pEkjsHfytvHEWufeZPpgqSOJMdMjuZPbesVsNh > CUc0E%3D%0A& > >>m=O > >>0 > >>y%2BQrr%2FNHgPJBj7fBjzyZ1IpChLYg49C9nKwT7y0M4%3D%0A&s=f85e9e > a431058676 > >>3be > >>f > >>feb77684b3e99a81a8f4a0cb5eadfb4298ee124d7d4f) > >>This means that by writing to the clone buffer you actually write to > >>the original buffer. The buffer is REFERENCED, not a new one allocated. > >> > >>The point with NdisRetreatNetBufferDataStart: > >>"If there isn't enough unused data space, this function allocates a > >>new buffer and an MDL to describe the new buffer and chains the new > >>MDL to the beginning of the MDL chain. " > >>(https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com > / > >>en- > >>u > >>s/library/windows/hardware/ff564527%28v%3Dvs.85%29.aspx&k=oIvRg1 > %2BdGA > >>gOo > >>M > >>1BIlLLqw%3D%3D%0A&r=pEkjsHfytvHEWufeZPpgqSOJMdMjuZPbesVsNh > CUc0E%3D%0A& > >>m=O > >>0 > >>y%2BQrr%2FNHgPJBj7fBjzyZ1IpChLYg49C9nKwT7y0M4%3D%0A&s=6cd96c > ceddd38b61 > >>d06 > >>4 > >>fdc3c602d2b0e4e4d8c0f4629ff45f16361d29cf8418) > >>This means that if you have an NB with data offset = 40, you retreat > >>by > >>40 and write 40 bytes from the "new beginning" of the buffer, you > >>actually write to the same MDL. > >> > >>[QUOTE]The Buffer Management subsystem is one of the major > difference > >>between both the drivers. Since doing a deep copy of all packets is > >>pretty sub-optimal, we (Guolin, in particular) designed the > >>sophisticated buffer management system. Yes, it is intricate at first, > >>but we already have it! (And for a while now. :))[/QUOTE] I > >>particularly find it very intricate. Perhaps it should be refactored a > >>bit and documented a bit more. > >>For instance, functions such as OvsPartialCopyNBL and > >>OvsPartialCopyToMultipleNBLs and OvsCopySinglePacketNBL and > >>OvsFullCopyNBL and OvsFullCopyToMultipleNBLs and OvsCompleteNBL I > find > >>very daunting. > >>Perhaps some comments for several parameters used and some > comments in > >>the function body would help as well. > >>What does "Partial copy" actually mean? I understand full copy should > >>mean something like "deep copy" / duplicate, but partial copy means > >>create new NBL and new NB, and copy src buffer -> dst buffer ( = src > >>buffer, because it's the same buffer)?? > >> > >>IMPORTANT: > >>I have just made a test with your functionality from > >>OvsPartialCopyNBL, and checked the cloned NBL against original NBL: > >>original NBL: 0xffffe000`02e74b20 > >>cloned NBL: 0xffffe000`02a53ac0 (ok) > >> > >>NBL's FirstNetBuffer > >>original FirstNetBuffer: 0xffffe000`02e74c80; DataOffset = 0x26; > >>DataLength = 0x5c cloned FirstNetBuffer: 0xffffe000`02fe4890; > >>DataOffset = 0; DataLength = 0x5c (ok) > >> > >>CurrentMdl of the FirstNetBuffer > >>original CurrentMdl: 0xffffe000`02e74da0; ByteCount = 0x50; (used > >>space starting after 0x26 bytes, see above); Next MDL count bytes = > >>0x32; (0x50 > >>- 0x26) + 0x32 = 0x5c > >>cloned CurrentMdl: 0xffffe000`02e74dc6; ByteCount = 0x2a; Next MDL > >>count bytes = 0x32; total size from MDLs = 0x5c (NOT OK -- see below) > >> > >>If you study CurrentMdl of both, you will see: > >>original: is spanning from ...74da0 -> ...74DF0 (i.e. + 0x50), but > >>excluding the unused area: (offset = 0x26): 74DC6 -> 74DF0 > >>cloned: is spanning from ...74DC6 -> ...74DF0 (i.e. + 0x32). > >>BOTH MDLS MAP THE SAME MEMORY AREA! So writing to cloneNb also > writes > >>to originalNb! > >> > >>Did you study the layout of NBLs, NBs and MDLs and buffers in your > >>memory management functions? > >> > >>Sam > >> > >>________________________________________ > >>From: Saurabh Shah [ssaur...@vmware.com] > >>Sent: Wednesday, August 06, 2014 12:50 AM > >>To: Samuel Ghinet > >>Cc: dev@openvswitch.org > >>Subject: RE: Cloning packets for "action: set field" > >> > >>Hi Samuel, > >> > >>> -----Original Message----- > >>> From: Samuel Ghinet [mailto:sghi...@cloudbasesolutions.com] > >>> Sent: Tuesday, August 05, 2014 5:00 AM > >>> To: Saurabh Shah > >>> Cc: dev@openvswitch.org > >>> Subject: RE: Cloning packets for "action: set field" > >>> > >>> Hello Saurabh, > >>> > >>> I believe there are more complexities associated to this packet > >>>cloning. > >>> Consider for instance the following flow: > >>> if (conditions on packet) => > >>> out_to_port: vm1; > >>> set ip_dest: X1; out_to_port: vm2 > >>> set ip_dest: X2; out_to_port: vm3 > >>> > >>> Given that multiple destinations are set, and only then the packet > >>>is outputted, then, given the issue with cloning, all packets (that > >>>will be outputted to vm1, vm2, vm3) will have the same, last > >>>destination ip. > >>> > >> > >>For the set case, we actually send out the packet to all the existing > >>output ports before going ahead with the set. This is what the > >>function OvsOutputBeforeSetAction does. So, at any given point we only > >>have one NBL to deal with. > >> > >>> I believe we cannot actually copy only the headers: if the packet > >>>contains a single MDL, there is nothing else we can do but to > >>>duplicate the packet. > >>> For the case where the NB has multiple MDLs chained, we must copy > >>>the first; MDL(s), i.e. the MDL or MDLs that contain the headers > >>>that must be copied. > >>> > >> > >>We actually can copy parts of the header (even if they are in a single > >>MDL), in fact our code has the infrastructure to do this. Please take > >>a look at OvsPartialCopyNBL. > >> > >>> There is also a problem with NBL completion (say we do the MDL copy > >>>and > >>> replacement): > >>> for the case above, consider: > >>> out_to_port: vm1; (NB1) > >>> set ip_dest: X1; out_to_port: vm2 (clone to NB2, create new MDL, > >>>store ptr to original MDL) set ip_dest: X2; out_to_port: vm3(clone > >>>to NB3, create new MDL, store ptr to original MDL) > >>> > >>> At completion, we may need to do this: > >>> NB3: Put back old MDL into original NB, free clone NB3 > >>> NB2: Put back old MDL (but already done before), and free NB2 > >>> NB1: it's the original, so nothing special. > >>> > >>> In order to implement this "but already done before" case, we need > >>>to use ref counting for NBL context data (which I have seen you do > >>>have some use of ref counting) However, consider the complexity of > >>>restoring an original NB when multiple MDLs have been replaced, by > >>>multiple clones. The big problem here is that the whole MDL link > >>>would be broken. > >>> > >> > >>The Buffer Management subsystem is one of the major difference > between > >>both the drivers. Since doing a deep copy of all packets is pretty > >>sub-optimal, we (Guolin, in particular) designed the sophisticated > >>buffer management system. Yes, it is intricate at first, but we already have > it! > >>(And for a while now. :)) So, all that you have mentioned here has > >>already been implemented. Please take a look at our Buffer Management > >>code & let me know if something is still unclear. > >> > >>> I strongly believe that building this component with the mind set > >>>"optimize first, fix bugs later" is not at all a good way to do it. > >>>Plus, it would make the code more and more intrincate, as we > >>>continue to both find new problems and we get to have large pieces > >>>of code (dealing with NBLs) spread all over. > >>> I personally believe a better approach would be "start with > >>>something simpler, which is fully functional even though suboptimal, > >>>and then continue by building on top of it, improving incrementally > >>>for each of various cases". > >>> > >> > >>Quite the contrary. The whole point of implementing the buffer > >>management code was to layout the right infrastructure to help > >>simplify NBL management for all the consumers & avoid doing a major > >>enhancement at a later point. I can go off on a rant explaining our > >>rationales in detail, but to allay your concerns, let me assure you > >>that we don't have a mindset of optimize-over-bugfix. Once the Netlink > >>integration is complete, I expect all of us to work towards closing > >>all critical bugs at a priority. > >> > >>> Thanks! > >>> Sam > >> > >>HTH, > >>Saurabh _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev