On 6/3/21 12:33 PM, Andrew Rybchenko wrote: > On 6/3/21 12:29 PM, Ilya Maximets wrote: >> On 6/2/21 9:35 PM, Ivan Malov wrote: >>> On 02/06/2021 20:35, Ilya Maximets wrote: >>>> (Dropped Broadcom folks from CC. Mail server refuses to accept their >>>> emails for some reason: "Recipient address rejected: Domain not found." >>>> Please, try to ad them back on reply.) >>>> >>>> On 6/2/21 6:26 PM, Andrew Rybchenko wrote: >>>>> On 6/2/21 3:46 PM, Ilya Maximets wrote: >>>>>> On 6/1/21 4:28 PM, Ivan Malov wrote: >>>>>>> Hi Ilya, >>>>>>> >>>>>>> Thank you for reviewing the proposal at such short notice. I'm afraid >>>>>>> that prior discussions overlook the simple fact that the whole problem >>>>>>> is not limited to just VF representors. Action PORT_ID is also used >>>>>>> with respect to the admin PF's ethdev, which "represents itself" (and >>>>>>> by no means it represents the underlying physical/network port). In >>>>>>> this case, one cannot state that the application treats it as a >>>>>>> physical port, just like one states that the application perceives >>>>>>> representors as VFs themselves. >>>>>> >>>>>> >>>>>> I don't think that it was overlooked. If device is in a switchdev mode >>>>>> than >>>>>> there is a PF representor and VF representors. Application typically >>>>>> works >>>>>> only with representors in this case is it doesn't make much sense to have >>>>>> representor and the upstream port attached to the same application at the >>>>>> same time. Configuration that is applied by application to the >>>>>> representor >>>>>> (PF or VF, it doesn't matter) applies to the corresponding upstream port >>>>>> (actual PF or VF) by default. >>>>> >>>>> PF is not necessarily associated with a network port. It >>>>> could be many PFs and just one network port on NIC. >>>>> Extra PFs are like VFs in this case. These PFs may be >>>>> passed to a VM in a similar way. So, we can have PF >>>>> representors similar to VF representors. I.e. it is >>>>> incorrect to say that PF in the case of switchdev is >>>>> a representor of a network port. >>>>> >>>>> If we prefer to talk in representors terminology, we >>>>> need 4 types of prepresentors: >>>>> - PF representor for PCIe physical function >>>>> - VF representor for PCIe virtual function >>>>> - SF representor for PCIe sub-function (PASID) >>>>> - network port representor >>>>> In fact above is PCIe oriented, but there are >>>>> other buses and ways to deliver traffic to applications. >>>>> Basically representor for any virtual port in virtual >>>>> switch which DPDK app can control using transfer rules. >>>>> >>>>>> Exactly same thing here with PORT_ID action. You have a packet and >>>>>> action >>>>>> to send it to the port, but it's not specified if HW needs to send it to >>>>>> the representor or the upstream port (again, VF or PF, it doesn't >>>>>> matter). >>>>>> Since there is no extra information, HW should send it to the upstream >>>>>> port by default. The same as configuration applies by default to the >>>>>> upstream port. >>>>>> >>>>>> Let's look at some workflow examples: >>>>>> >>>>>> DPDK Application >>>>>> | | >>>>>> | | >>>>>> +--PF-rep------VF-rep---+ >>>>>> | | >>>>>> | NIC (switchdev) | >>>>>> | | >>>>>> +---PF---------VF-------+ >>>>>> | | >>>>>> | | >>>>>> External VM or whatever >>>>>> Network >>>>> >>>>> See above. PF <-> External Network is incorrect above >>>>> since it not always the case. It should be >>>>> "NP <-> External network" and "NP-rep" above (NP - >>>>> network port). Sometimes PF is an NP-rep, but sometimes >>>>> it is not. It is just a question of default rules in >>>>> switchdev on what to do with traffic incoming from >>>>> network port. >>>>> >>>>> A bit more complicated picture is: >>>>> >>>>> +----------------------------------------+ >>>>> | DPDK Application | >>>>> +----+---------+---------+---------+-----+ >>>>> |PF0 |PF1 | | >>>>> | | | | >>>>> +--NP1-rep---NP2-rep---PF2-rep---VF-rep--+ >>>>> | | >>>>> | NIC (switchdev) | >>>>> | | >>>>> +---NP1-------NP2-------PF2--------VF----+ >>>>> | | | | >>>>> | | | | >>>>> External External VM or VM or >>>>> Network 1 Network 2 whatever whatever >>>>> >>>>> So, sometimes PF plays network port representor role (PF0, >>>>> PF1), sometimes it requires representor itself (PF2). >>>>> What to do if PF2 itself is attached to application? >>>>> Can we route traffic to it using PORT_ID action? >>>>> It has DPDK ethdev port. It is one of arguments why >>>>> plain PORT_ID should route DPDK application. >>>> >>>> OK. This is not very different from my understanding. The key >>>> is that there is a pair of interfaces, one is more visible than >>>> the other one. >>>> >>>>> >>>>> Of course, some applications would like to see it as >>>>> (simpler is better): >>>>> >>>>> +----------------------------------------+ >>>>> | DPDK Application | >>>>> | | >>>>> +---PF0-------PF1------PF2-rep---VF-rep--+ >>>>> | | | | >>>>> | | | | >>>>> External External VM or VM or >>>>> Network 1 Network 2 whatever whatever >>>>> >>>>> but some, I believe, require full picture. For examples, >>>>> I'd really like to know how much traffic goes via all 8 >>>>> switchdev ports and running rte_eth_stats_get(0, ...) >>>>> (i.e. DPDK port 0 attached to PF0) I'd like to get >>>>> NP1-rep stats (not NP1 stats). It will match exactly >>>>> what I see in DPDK application. It is an argument why >>>>> plain PORT_ID should be treated as a DPDK ethdev port, >>>>> not a represented (upstream) entity. >>>> >>>> The point is that if application doesn't require full picture, >>>> it should not care. If application requires the full picture, >>>> it could take extra steps by setting extra bits. I don't >>>> understand why we need to force all applications to care about >>>> the full picture if we can avoid that? >>>> >>>>> >>>>>> a. Workflow for "DPDK Application" to set MAC to VF: >>>>>> >>>>>> 1. "DPDK Application" calls rte_set_etheraddr("VF-rep", new_mac); >>>>>> 2. DPDK sets MAC for "VF". >>>>>> >>>>>> b. Workflow for "DPDK Application" to set MAC to PF: >>>>>> >>>>>> 1. "DPDK Application" calls rte_set_etheraddr("PF-rep", new_mac); >>>>>> 2. DPDK sets MAC for "PF". >>>>>> >>>>>> c. Workflow for "DPDK Application" to send packet to the external >>>>>> network: >>>>>> >>>>>> 1. "DPDK Application" calls rte_eth_tx_burst("PF-rep", packet); >>>>>> 2. NIC receives the packet from "PF-rep" and sends it to "PF". >>>>>> 3. packet egresses to the external network from "PF". >>>>>> >>>>>> d. Workflow for "DPDK Application" to send packet to the "VM or >>>>>> whatever": >>>>>> >>>>>> 1. "DPDK Application" calls rte_eth_tx_burst("VF-rep", packet); >>>>>> 2. NIC receives the packet from "VF-rep" and sends it to "VF". >>>>>> 3. "VM or whatever" receives the packet from "VF". >>>>>> >>>>>> In two workflows above there is no rte_flow processing on step 2, i.e., >>>>>> NIC does not perform any lookups/matches/actions, because it's not >>>>>> possible >>>>>> to configure actions for packets received from "PF-rep" or >>>>>> "VF-rep" as these ports doesn't own a port id and all the configuration >>>>>> and rte_flow actions translated and applied for the devices that these >>>>>> ports represents ("PF" and "VF") and not representors themselves >>>>>> ("PF-rep" >>>>>> or "VF-rep"). >>>>>> >>>>>> e. Workflow for the packet received on PF and PORT_ID action: >>>>>> >>>>>> 1. "DPDK Application" configures rte_flow for all packets from "PF-rep" >>>>>> to execute PORT_ID "VF-rep". >>>>>> 2. NIC receives packet on "PF". >>>>>> 3. NIC executes 'PORT_ID "VF-rep"' action by sending packet to "VF". >>>>>> 4. "VM or whatever" receives the packet from "VF". >>>>>> >>>>>> f. Workflow for the packet received on VF and PORT_ID action: >>>>>> >>>>>> 1. "DPDK Application" configures rte_flow for all packets from "VF-rep" >>>>>> to execute 'PORT_ID "PF-rep"'. >>>>>> 2. NIC receives packet on "VF". >>>>>> 3. NIC executes 'PORT_ID "PF-rep"' action by sending packet to "PF". >>>>>> 4. Packet egresses from the "PF" to the external network. >>>>>> >>>>>> Above is what, IMHO, the logic should look like and this matches with >>>>>> the overall switchdev design in kernel. >>>>>> >>>>>> I understand that this logic could seem flipped-over from the HW point >>>>>> of view, but it's perfectly logical from the user's perspective, because >>>>>> user should not care if the application works with representors or >>>>>> some real devices. If application configures that all packets from port >>>>>> A should be sent to port B, user will expect that these packets will >>>>>> egress from port B once received from port A. That will be highly >>>>>> inconvenient if the packet will ingress from port B back to the >>>>>> application instead. >>>>>> >>>>>> DPDK Application >>>>>> | | >>>>>> | | >>>>>> port A port B >>>>>> | | >>>>>> *****MAGIC***** >>>>>> | | >>>>>> External Another Network >>>>>> Network or VM or whatever >>>>>> >>>>>> It should not matter if there is an extra layer between ports A and B >>>>>> and the external network and VM. Everything should work in exactly the >>>>>> same way, transparently for the application. >>>>>> >>>>>> The point of hardware offloading, and therefore rte_flow API, is to take >>>>>> what user does in software and make this "magically" work in hardware in >>>>>> the exactly same way. And this will be broken if user will have to >>>>>> use different logic based on the mode the hardware works in, i.e. based >>>>>> on >>>>>> the fact if the application works with ports or their representors. >>>>>> >>>>>> If some specific use case requires application to know if it's an >>>>>> upstream port or the representor and demystify the internals of the >>>>>> switchdev >>>>>> NIC, there should be a different port id for the representor itself that >>>>>> could be used in all DPDK APIs including rte_flow API or a special bit >>>>>> for >>>>>> that matter. IIRC, there was an idea to add a bit directly to the >>>>>> port_id >>>>>> for that purpose that will flip over behavior in all the workflow >>>>>> scenarios >>>>>> that I described above. >>>>> >>>>> As I understand we're basically on the same page, but just >>>>> fighting for defaults in DPDK. >>>> >>>> Yep. >>>> >>>>> >>>>>>> >>>>>>> Given these facts, it would not be quite right to just align the >>>>>>> documentation with the de-facto action meaning assumed by OvS. >>>>>> >>>>>> It's not a "meaning assumed by OvS", it's the original design and the >>>>>> main idea of a switchdev based on a common sense. >>>>> >>>>> If so, common sense is not that common :) >>>>> My "common sense" says me that PORT_ID action >>>>> should route traffic to DPDK ethdev port to be >>>>> received by the DPDK application. >>>> >>>> By this logic rte_eth_tx_burst("VF-rep", packet) should send a packet >>>> to "VF-rep", i.e. this packet will be received back by the application >>>> on this same interface. But that is counter-intuitive and this is not >>>> how it works in linux kernel if you're opening socket and sending a >>>> packet to the "VF-rep" network interface. >>>> >>>> And if rte_eth_tx_burst("VF-rep", packet) sends packet to "VF" and not >>>> to "VF-rep", than I don't understand why PORT_ID action should work in >>>> the opposite way. >>> >>> There's no contradiction here. >>> >>> In rte_eth_tx_burst(X, packet) example, "X" is the port which the >>> application sits on and from where it sends the packet. In other words, >>> it's the point where the packet originates from, and not where it goes to. >>> >>> At the same time, flow *action* PORT_ID (ID = "X") is clearly the opposite: >>> it specifies where the packet will go. Port ID is the characteristic of a >>> DPDK ethdev. So the packet goes *to* an ethdev with the given ID ("X"). >>> >>> Perhaps consider action PHY_PORT: the index is the characteristic of the >>> network port. The packet goes *to* network through this NP. And not the >>> opposite way. Hopefully, nobody is going to claim that action PHY_PORT >>> should mean re-injecting the packet back to the HW flow engine "as if it >>> just came from the network port". Then why does one try to skew the PORT_ID >>> meaning this way? PORT_ID points to an ethdev - the packet goes *to* the >>> ethdev. Isn't that simple? >> >> It's not simple. And PHY_PORT action would be hard to use from the >> application that doesn't really need to know how underlying hardware >> structured. > > Yes, I agree. Basically above paragraph just try to highlight > existing consistent semantics in various actions which set > traffic direction and highlight inconsistency if we interpret > PORT_ID default as egress in accordance with terminology > suggested below. PORT_ID is a DPDK port and default direction > should be to DPDK port. I'll continue on the topic below. > >>> >>>> >>>> Application receives a packet from port A and puts it to the port B. >>>> TC rule to forward packets from port A to port B will provide same result. >>>> So, why the similar rte_flow should do the opposite and send the packet >>>> back to the application? >>> >>> Please see above. Action VF sends the packet *to* VF and *not* to the >>> upstream entity which this VF is connected to. Action PHY_PORT sends the >>> packet *to* network and does *not* make it appear as if it entered the NIC >>> from the network side. Action QUEUE sends the packet *to* the Rx queue and >>> does *not* make it appear as if it just egressed from the Tx queue with the >>> same index. Action PORT_ID sends the packet *to* an ethdev with the given >>> ID and *not* to the upstream entity which this ethdev is connected to. It's >>> just that transparent. It's just "do what the name suggests". >>> >>> Yes, an application (say, OvS) might have a high level design which >>> perceives the "high-level" ports plugged to it as a "patch-panel" of sorts. >>> Yes, when a high-level mechanism/logic of such application invokes a >>> *datapath-unaware* wrapper to offload a rule and request that the packet be >>> delivered to the given "high-level" port, it therefore requests that the >>> packet be delivered to the opposite end of the wire. But then the >>> lower-level datapath-specific (DPDK) handler kicks in. Since it's >>> DPDK-specific, it knows *everything* about the underlying flow library it >>> works with. In particular it knows that action PORT_ID delivers the packet >>> to an *ethdev*, at the same time, it knows that the upper caller >>> (high-level logic) for sure wants the opposite, so it (the lower-level DPDK >>> component) sets the "upstream" bit when translating the higher-level port >>> action to an RTE action "PORT_ID". >> >> I don't understand that. DPDK user is the application and DPDK >> doesn't translate anything, application creates PORT_ID action >> directly and passes it to DPDK. So, you're forcing the *end user* >> (a.k.a. application) to know *everything* about the hardware the >> application runs on. Of course, it gets this information about >> the hardware from the DPDK library (otherwise this would be >> completely ridiculous), but this doesn't change the fact that it's >> the application that needs to think about the structure of the >> underlying hardware while it's absolutely not necessary in vast >> majority of cases. > > Yes, that's all true, but I think that specification of the > direction is *not* diving to deep in hardware details. > > For DPDK I think it is important to have consistent semantics > and interpretation of input parameters. That will make the > library easier to use and make it less error-prone. > >>> Then the resulting action is correct, and the packet indeed doesn't end up >>> in the ethdev but goes >>> to the opposite end of the wire. That's it. >>> >>> I have an impression that for some reason people are tempted to ignore the >>> two nominal "layers" in such applications (generic, or high-level one and >>> DPDK-specific one) thus trying to align DPDK logic with high-level logic of >>> the applications. That's simply not right. What I'm trying to point out is >>> that it *is* the true job of DPDK-specific data path handler in such >>> application - to properly translate generic flow actions to DPDK-specific >>> ones. It's the duty of DPDK component in such applications to be aware of >>> the genuine meaning of action PORT_ID. >> >> The reason is very simple: if application don't need to know the >> full picture (how the hardware structured inside) it shouldn't >> care and it's a duty of DPDK to abstract the hardware and provide >> programming interfaces that could be easily used by application >> developers who are not experts in the architecture of a hardware >> that they want to use (basically, application developer should not >> care at all in most cases on which hardware application will work). >> It's basically in almost every single DPDK API, EAL means environment >> *abstraction* layer, not an environment *proxy/passthrough* layer. >> We can't assume that DPDK-specific layers in applications are always >> written by hardware experts and, IMHO, DPDK should not force users >> to learn underlying structures of switchdev devices. They might not >> even have such devices for testing, so the application that works >> on simple NICs should be able to run correctly on switchdev-capable >> NICs too. >> >> I think that "***MAGIC***" abstraction (see one of my previous ascii >> graphics) is very important here. > > I've answered it above. Specification of the direction is *not* > diving to deep in HW details.
Yes, I agree that specification of direction doesn't require any knowledge of any HW details and this option is perfectly fine for me as described in 'ingress/egress' suggestion below. My argument is about 'upstream' flag specifically. Wording is important, because I think that 'upstream' implies some knowledge that there are two different ports. > >>> >>> This way, mixing up the two meanings is ruled out. >> >> Looking closer to how tc flower rules configured I noticed that >> 'mirred' action requires user to specify the direction in which >> the packet will appear on the destination port. And I suppose >> this will solve your issues with PORT_ID action without exposing >> the "full picture" of the architecture of an underlying hardware. >> >> It looks something like this: >> >> tc filter add dev A ... action mirred egress redirect dev B >> ^^^^^^ >> >> Direction could be 'ingress' or 'egress', so the packet will >> ingress from the port B back to application/kernel or it will >> egress from this port to the external network. Same thing >> could be implemented in rte_flow like this: >> >> flow create A ingress transfer pattern eth / end >> action port_id id B egress / end >> >> So, application that needs to receive the packet from the port B >> will specify 'ingress', others that just want to send packet from >> the port B will specify 'egress'. Will that work for you? >> >> (BTW, 'ingress' seems to be not implemented in TC and that kind >> of suggests that it's not very useful at least for kernel use cases) >> >> One might say that it's actually the same what is proposed in >> this RFC, but I will argue that 'ingress/egress' schema doesn't >> break the "***MAGIC***" abstraction because user is not obligated >> to know the structure of the underlying hardware, while 'upstream' >> flag is something very unclear from that perspective and makes >> no sense for plane ports (non-representors). > > I think it is really an excellent idea and suggested > terminology looks very good to me. However, we should > agree on technical details on API level (not testpmd > commands). I think we have 4 options: > > A. Add "ingress" bit with "egress" as unset meaning. > Yes, that's what is current behaviour assumed and > used by OvS and implemented in some PMDs. > My problem with it that it is, IMHO, inconsistent > default value (as explained above). > > B. Add "egress" bit with "ingress" as unset meaning. > Basically it is what is suggested in the RFC, but > the problem of the suggestion is the silent breakage > of existing users (let's put it a side if it is > correct usage or misuse). It is still the fact. > > C. Encode above in ethdev port ID MSB. > The problem of the solution is that encoding > makes sense for representors, but the problem > exists for non-representor ports as well. > I have no good ideas on terminology in the case > if we try to solve it for non-representors. > > D. Break API and ABI and add enum with unset(default)/ > ingress/egress members to enforce application to > specify direction. > > It is unclear what we'll do in the case of A, B and D > if we encode representor in port ID MSB in any case. My opinion: - Option D is the best choice for rte_flow. No defaults, users forced to explicitly choose the direction in HW-independent way. - I agree that option C somewhat conflicts with the 'ingress/egress' flag idea and it is more hardware-specific. Therefore if option C is going to be implemented it should be implemented in concept of option A, i.e. 'egress' is default option if port ID MSB is not set. > >>>>>>> On 01/06/2021 15:10, Ilya Maximets wrote: >>>>>>>> On 6/1/21 1:14 PM, Ivan Malov wrote: >>>>>>>>> By its very name, action PORT_ID means that packets hit an ethdev >>>>>>>>> with the >>>>>>>>> given DPDK port ID. At least the current comments don't state the >>>>>>>>> opposite. >>>>>>>>> That said, since port representors had been adopted, applications >>>>>>>>> like OvS >>>>>>>>> have been misusing the action. They misread its purpose as sending >>>>>>>>> packets >>>>>>>>> to the opposite end of the "wire" plugged to the given ethdev, for >>>>>>>>> example, >>>>>>>>> redirecting packets to the VF itself rather than to its representor >>>>>>>>> ethdev. >>>>>>>>> Another example: OvS relies on this action with the admin PF's ethdev >>>>>>>>> port >>>>>>>>> ID specified in it in order to send offloaded packets to the physical >>>>>>>>> port. >>>>>>>>> >>>>>>>>> Since there might be applications which use this action in its valid >>>>>>>>> sense, >>>>>>>>> one can't just change the documentation to greenlight the opposite >>>>>>>>> meaning. >>>>>>>>> This patch adds an explicit bit to the action configuration which >>>>>>>>> will let >>>>>>>>> applications, depending on their needs, leverage the two meanings >>>>>>>>> properly. >>>>>>>>> Applications like OvS, as well as PMDs, will have to be corrected >>>>>>>>> when the >>>>>>>>> patch has been applied. But the improved clarity of the action is >>>>>>>>> worth it. >>>>>>>>> >>>>>>>>> The proposed change is not the only option. One could avoid changes >>>>>>>>> in OvS >>>>>>>>> and PMDs if the new configuration field had the opposite meaning, >>>>>>>>> with the >>>>>>>>> action itself meaning delivery to the represented port and not to >>>>>>>>> DPDK one. >>>>>>>>> Alternatively, one could define a brand new action with the said >>>>>>>>> behaviour. >>>>>>>> >>>>>>>> We had already very similar discussions regarding the understanding of >>>>>>>> what >>>>>>>> the representor really is from the DPDK API's point of view, and the >>>>>>>> last >>>>>>>> time, IIUC, it was concluded by a tech. board that representor should >>>>>>>> be >>>>>>>> a "ghost of a VF", i.e. DPDK APIs should apply configuration by >>>>>>>> default to >>>>>>>> VF and not to the representor device: >>>>>>>> >>>>>>>> https://patches.dpdk.org/project/dpdk/cover/20191029185051.32203-1-tho...@monjalon.net/#104376 >>>>>>>> This wasn't enforced though, IIUC, for existing code and semantics is >>>>>>>> still mixed. >>>>>>>> >>>>>>>> I still think that configuration should be applied to VF, and the same >>>>>>>> applies >>>>>>>> to rte_flow API. IMHO, average application should not care if device >>>>>>>> is >>>>>>>> a VF itself or its representor. Everything should work exactly the >>>>>>>> same. >>>>>>>> I think this matches with the original idea/design of the switchdev >>>>>>>> functionality >>>>>>>> in the linux kernel and also matches with how the average user thinks >>>>>>>> about >>>>>>>> representor devices. >>>>>>>> >>>>>>>> If some specific use-case requires to distinguish VF from the >>>>>>>> representor, >>>>>>>> there should probably be a separate special API/flag for that. >>>>>>>> >>>>>>>> Best regards, Ilya Maximets. >>>>>>>> >>>>>>> >>>>> >>> >