On 07/06/2021 22:27, Ilya Maximets wrote:
On 6/3/21 1:29 PM, Ivan Malov wrote:
On 03/06/2021 12:29, Ilya Maximets wrote:
On 6/2/21 9:35 PM, Ivan Malov wrote:
On 02/06/2021 20:35, Ilya Maximets wrote:
(Dropped Broadcom folks from CC.  Mail server refuses to accept their
emails for some reason: "Recipient address rejected: Domain not found."
Please, try to ad them back on reply.)

On 6/2/21 6:26 PM, Andrew Rybchenko wrote:
On 6/2/21 3:46 PM, Ilya Maximets wrote:
On 6/1/21 4:28 PM, Ivan Malov wrote:
Hi Ilya,

Thank you for reviewing the proposal at such short notice. I'm afraid that prior 
discussions overlook the simple fact that the whole problem is not limited to just VF 
representors. Action PORT_ID is also used with respect to the admin PF's ethdev, which 
"represents itself" (and by no means it represents the underlying 
physical/network port). In this case, one cannot state that the application treats it as 
a physical port, just like one states that the application perceives representors as VFs 
themselves.


I don't think that it was overlooked.  If device is in a switchdev mode than
there is a PF representor and VF representors.  Application typically works
only with representors in this case is it doesn't make much sense to have
representor and the upstream port attached to the same application at the
same time.  Configuration that is applied by application to the representor
(PF or VF, it doesn't matter) applies to the corresponding upstream port
(actual PF or VF) by default.

PF is not necessarily associated with a network port. It
could  be many PFs and just one network port on NIC.
Extra PFs are like VFs in this case. These PFs may be
passed to a VM in a similar way. So, we can have PF
representors similar to VF representors. I.e. it is
incorrect to say that PF in the case of switchdev is
a representor of a network port.

If we prefer to talk in representors terminology, we
need 4 types of prepresentors:
    - PF representor for PCIe physical function
    - VF representor for PCIe virtual function
    - SF representor for PCIe sub-function (PASID)
    - network port representor
In fact above is PCIe oriented, but there are
other buses and ways to deliver traffic to applications.
Basically representor for any virtual port in virtual
switch which DPDK app can control using transfer rules.

Exactly same thing here with PORT_ID action.  You have a packet and action
to send it to the port, but it's not specified if HW needs to send it to
the representor or the upstream port (again, VF or PF, it doesn't matter).
Since there is no extra information, HW should send it to the upstream
port by default.  The same as configuration applies by default to the
upstream port.

Let's look at some workflow examples:

         DPDK Application
           |         |
           |         |
      +--PF-rep------VF-rep---+
      |                       |
      |    NIC (switchdev)    |
      |                       |
      +---PF---------VF-------+
          |          |
          |          |
      External       VM or whatever
      Network

See above. PF <-> External Network is incorrect above
since it not always the case. It should be
"NP <-> External network" and "NP-rep" above (NP -
network port). Sometimes PF is an NP-rep, but sometimes
it is not. It is just a question of default rules in
switchdev on what to do with traffic incoming from
network port.

A bit more complicated picture is:

       +----------------------------------------+
       |            DPDK Application            |
       +----+---------+---------+---------+-----+
            |PF0      |PF1      |         |
            |         |         |         |
       +--NP1-rep---NP2-rep---PF2-rep---VF-rep--+
       |                                        |
       |             NIC (switchdev)            |
       |                                        |
       +---NP1-------NP2-------PF2--------VF----+
            |         |         |         |
            |         |         |         |
        External   External    VM or     VM or
       Network 1  Network 2  whatever   whatever

So, sometimes PF plays network port representor role (PF0,
PF1), sometimes it requires representor itself (PF2).
What to do if PF2 itself is attached to application?
Can we route traffic to it using PORT_ID action?
It has DPDK ethdev port. It is one of arguments why
plain PORT_ID should route DPDK application.

OK.  This is not very different from my understanding.  The key
is that there is a pair of interfaces, one is more visible than
the other one.


Of course, some applications would like to see it as
(simpler is better):

       +----------------------------------------+
       |            DPDK Application            |
       |                                        |
       +---PF0-------PF1------PF2-rep---VF-rep--+
            |         |         |         |
            |         |         |         |
        External   External    VM or     VM or
       Network 1  Network 2  whatever   whatever

but some, I believe, require full picture. For examples,
I'd really like to know how much traffic goes via all 8
switchdev ports and running rte_eth_stats_get(0, ...)
(i.e. DPDK port 0 attached to PF0) I'd like to get
NP1-rep stats (not NP1 stats). It will match exactly
what I see in DPDK application. It is an argument why
plain PORT_ID should be treated as a DPDK ethdev port,
not a represented (upstream) entity.

The point is that if application doesn't require full picture,
it should not care.  If application requires the full picture,
it could take extra steps by setting extra bits.  I don't
understand why we need to force all applications to care about
the full picture if we can avoid that?


a. Workflow for "DPDK Application" to set MAC to VF:

1. "DPDK Application" calls rte_set_etheraddr("VF-rep", new_mac);
2.  DPDK sets MAC for "VF".

b. Workflow for "DPDK Application" to set MAC to PF:

1. "DPDK Application" calls rte_set_etheraddr("PF-rep", new_mac);
2.  DPDK sets MAC for "PF".

c. Workflow for "DPDK Application" to send packet to the external network:

1. "DPDK Application" calls rte_eth_tx_burst("PF-rep", packet);
2. NIC receives the packet from "PF-rep" and sends it to "PF".
3. packet egresses to the external network from "PF".

d. Workflow for "DPDK Application" to send packet to the "VM or whatever":

1. "DPDK Application" calls rte_eth_tx_burst("VF-rep", packet);
2. NIC receives the packet from "VF-rep" and sends it to "VF".
3. "VM or whatever" receives the packet from "VF".

In two workflows above there is no rte_flow processing on step 2, i.e.,
NIC does not perform any lookups/matches/actions, because it's not possible
to configure actions for packets received from "PF-rep" or
"VF-rep" as these ports doesn't own a port id and all the configuration
and rte_flow actions translated and applied for the devices that these
ports represents ("PF" and "VF") and not representors themselves ("PF-rep"
or "VF-rep").

e. Workflow for the packet received on PF and PORT_ID action:

1. "DPDK Application" configures rte_flow for all packets from "PF-rep"
      to execute PORT_ID "VF-rep".
2. NIC receives packet on "PF".
3. NIC executes 'PORT_ID "VF-rep"' action by sending packet to "VF".
4. "VM or whatever" receives the packet from "VF".

f. Workflow for the packet received on VF and PORT_ID action:

1. "DPDK Application" configures rte_flow for all packets from "VF-rep"
      to execute 'PORT_ID "PF-rep"'.
2. NIC receives packet on "VF".
3. NIC executes 'PORT_ID "PF-rep"' action by sending packet to "PF".
4. Packet egresses from the "PF" to the external network.

Above is what, IMHO, the logic should look like and this matches with
the overall switchdev design in kernel.

I understand that this logic could seem flipped-over from the HW point
of view, but it's perfectly logical from the user's perspective, because
user should not care if the application works with representors or
some real devices.  If application configures that all packets from port
A should be sent to port B, user will expect that these packets will
egress from port B once received from port A.  That will be highly
inconvenient if the packet will ingress from port B back to the
application instead.

         DPDK Application
           |          |
           |          |
        port A     port B
           |          |
         *****MAGIC*****
           |          |
      External      Another Network
      Network       or VM or whatever

It should not matter if there is an extra layer between ports A and B
and the external network and VM.  Everything should work in exactly the
same way, transparently for the application.

The point of hardware offloading, and therefore rte_flow API, is to take
what user does in software and make this "magically" work in hardware in
the exactly same way.  And this will be broken if user will have to
use different logic based on the mode the hardware works in, i.e. based on
the fact if the application works with ports or their representors.

If some specific use case requires application to know if it's an
upstream port or the representor and demystify the internals of the switchdev
NIC, there should be a different port id for the representor itself that
could be used in all DPDK APIs including rte_flow API or a special bit for
that matter.  IIRC, there was an idea to add a bit directly to the port_id
for that purpose that will flip over behavior in all the workflow scenarios
that I described above.

As I understand we're basically on the same page, but just
fighting for defaults in DPDK.

Yep.



Given these facts, it would not be quite right to just align the documentation 
with the de-facto action meaning assumed by OvS.

It's not a "meaning assumed by OvS", it's the original design and the
main idea of a switchdev based on a common sense.

If so, common sense is not that common :)
My "common sense" says me that PORT_ID action
should route traffic to DPDK ethdev port to be
received by the DPDK application.

By this logic rte_eth_tx_burst("VF-rep", packet) should send a packet
to "VF-rep", i.e. this packet will be received back by the application
on this same interface.  But that is counter-intuitive and this is not
how it works in linux kernel if you're opening socket and sending a
packet to the "VF-rep" network interface.

And if rte_eth_tx_burst("VF-rep", packet) sends packet to "VF" and not
to "VF-rep", than I don't understand why PORT_ID action should work in
the opposite way.

There's no contradiction here.

In rte_eth_tx_burst(X, packet) example, "X" is the port which the application 
sits on and from where it sends the packet. In other words, it's the point where the 
packet originates from, and not where it goes to.

At the same time, flow *action* PORT_ID (ID = "X") is clearly the opposite: it specifies 
where the packet will go. Port ID is the characteristic of a DPDK ethdev. So the packet goes *to* 
an ethdev with the given ID ("X").

Perhaps consider action PHY_PORT: the index is the characteristic of the network port. 
The packet goes *to* network through this NP. And not the opposite way. Hopefully, nobody 
is going to claim that action PHY_PORT should mean re-injecting the packet back to the HW 
flow engine "as if it just came from the network port". Then why does one try 
to skew the PORT_ID meaning this way? PORT_ID points to an ethdev - the packet goes *to* 
the ethdev. Isn't that simple?

It's not simple.  And PHY_PORT action would be hard to use from the
application that doesn't really need to know how underlying hardware
structured.

I'm not trying to suggest using PHY_PORT. I provide it as an example to point 
out the inconsistency between these actions and action PORT_ID in its de-facto 
sense assumed by apps like OvS. Sorry.




Application receives a packet from port A and puts it to the port B.
TC rule to forward packets from port A to port B will provide same result.
So, why the similar rte_flow should do the opposite and send the packet
back to the application?

Please see above. Action VF sends the packet *to* VF and *not* to the upstream entity 
which this VF is connected to. Action PHY_PORT sends the packet *to* network and does 
*not* make it appear as if it entered the NIC from the network side. Action QUEUE sends 
the packet *to* the Rx queue and does *not* make it appear as if it just egressed from 
the Tx queue with the same index. Action PORT_ID sends the packet *to* an ethdev with the 
given ID and *not* to the upstream entity which this ethdev is connected to. It's just 
that transparent. It's just "do what the name suggests".

Yes, an application (say, OvS) might have a high level design which perceives the "high-level" ports plugged to it as a 
"patch-panel" of sorts. Yes, when a high-level mechanism/logic of such application invokes a *datapath-unaware* wrapper 
to offload a rule and request that the packet be delivered to the given "high-level" port, it therefore requests that 
the packet be delivered to the opposite end of the wire. But then the lower-level datapath-specific (DPDK) handler kicks in. 
Since it's DPDK-specific, it knows *everything* about the underlying flow library it works with. In particular it knows that 
action PORT_ID delivers the packet to an *ethdev*, at the same time, it knows that the upper caller (high-level logic) for sure 
wants the opposite, so it (the lower-level DPDK component) sets the "upstream" bit when translating the higher-level 
port action to an RTE action "PORT_ID".

I don't understand that.  DPDK user is the application and DPDK
doesn't translate anything, application creates PORT_ID action
directly and passes it to DPDK.  So, you're forcing the *end user*
(a.k.a. application) to know *everything* about the hardware the
application runs on.  Of course, it gets this information about
the hardware from the DPDK library (otherwise this would be
completely ridiculous), but this doesn't change the fact that it's
the application that needs to think about the structure of the
underlying hardware while it's absolutely not necessary in vast
majority of cases.

Me forcing the application to know everything about the hardware? I do no such thing. What I mean 
is very simple: the application developer is expected to read DPDK documentation carefully before 
using a flow primitive like PORT_ID. If they understood it, they would realise the unfitness of the 
existing action behaviour for their needs. The developer would then raise the issue and extend the 
action in one way ("upstream" bit) or another ("egress" bit). And only *then* 
they would have everything in place to finally put the action into usage by the application. But 
what we have in reality is the opposite. Let's just face it.


Then the resulting action is correct, and the packet indeed doesn't end up in 
the ethdev but goes
to the opposite end of the wire. That's it.

I have an impression that for some reason people are tempted to ignore the two nominal 
"layers" in such applications (generic, or high-level one and DPDK-specific 
one) thus trying to align DPDK logic with high-level logic of the applications. That's 
simply not right. What I'm trying to point out is that it *is* the true job of 
DPDK-specific data path handler in such application - to properly translate generic flow 
actions to DPDK-specific ones. It's the duty of DPDK component in such applications to be 
aware of the genuine meaning of action PORT_ID.

The reason is very simple: if application don't need to know the
full picture (how the hardware structured inside) it shouldn't
care and it's a duty of DPDK to abstract the hardware and provide
programming interfaces that could be easily used by application
developers who are not experts in the architecture of a hardware
that they want to use (basically, application developer should not
care at all in most cases on which hardware application will work).
It's basically in almost every single DPDK API, EAL means environment
*abstraction* layer, not an environment *proxy/passthrough* layer.
We can't assume that DPDK-specific layers in applications are always
written by hardware experts and, IMHO, DPDK should not force users
to learn underlying structures of switchdev devices.  They might not
even have such devices for testing, so the application that works
on simple NICs should be able to run correctly on switchdev-capable
NICs too.

I think that "***MAGIC***" abstraction (see one of my previous ascii
graphics) is very important here.

Look. Imagine that the hypothetical "observer" sits inside the application. The application receives packets on one port X and sends them using Tx burst API on 
port Y. Now it wants to offload this work and use ID=Y to reference location where the packet should go "as if it was sent using Tx burst on port Y". OK. But 
the application uses attribute "transfer". And that really makes difference because it "teleports" (or "transfers") the 
"observer" from the application down to the HW eSwitch level. Yes, the application doesn't know anything about HW eSwitch structure but it at leas knows about 
its very existence; otherwise, it wouldn't use attribute "transfer" at all. And now, when the "observer" sits inside the HW eSwitch, from that 
standpoint, PORT_ID ID=Y looks like delivery to the ethdev port. The "observer"'s location changes the perceived direction. In this case, correct default 
behaviour is delivery to the ethdev. So. if the application developer was capable to correctly
put attribute "transfer" in use, why wouldn't they be capable to also put flag 
"upstream" of the action PORT_ID in use, too?

I don't agree with the statement that "transfer" changes the point
of view from the application to the point inside the eSwitch.
For me it simply means "move", therefore means moving of the packet
from one port to another and I would assume that this happens in the
same way as it happens if application receives the packet and then
sends it to a different port.  So this argument is not valid.

I hate to say it, but simply disagreeing with this argument is not helpful. Doing so doesn't make the argument invalid unless the documentation is fixed to say super-clearly what in fact "transfer" attribute is and what point of view (application or eSwitch) prevails.

Here's what existing documentation says:

"Instead of simply matching the properties of traffic as it would appear on a given DPDK port ID, enabling this attribute transfers a flow rule to the lowest possible level of any device endpoints found in the pattern".

You talk about "moving a packet" while this paragraph talks about "transferring a rule" - it's all vague, confusing, and, after all, it is the reason why we have the problem with action PORT_ID semantics and the likes. The documentation should not use weasel words but rather write out in full what point of view is, and where things start to become HW-specific. If action PORT_ID is not for delivery of packets to an ethdev, one should have explained this in the comment.

If I understand correctly, not only does DPDK intend to be helpful to end users, it also tries to be clear and concise to NIC drivers. So, for sure, one should admit the problem existence. Secondly, all parties should finally agree on the prevailing point of view (end user or PMD/eSwitch) with respect to attribute "transfer" and action PORT_ID. And all of that should result in documentation fix/update or semantics change (whatever is more correct).

Simply ignoring the problem or assuming undocumented meanings (like this "end user" point of view) while leaving the documentation poor doesn't bear fruit. The raised issues are not just some petty concerns but rather real significant issues, and pretending that they don’t exist is not right.

Moving away from opinion-driven approach and fixing the documentation is what we all might benefit from.


For me having an "observer" inside the eSwitch makes no sense if it's
not a real switch.  And it's a not a real switch just because it has
too many shortcuts and implicit rules that can not be controlled by
the user, e.g. the thing that packet sent to VF-rep bypasses the
rte_flow processing and goes directly out from the VF port.

Unless it's a fully programmable switch (which it is not), I prefer
to have it fully hidden from end users as a magic box.  Making it
fully programmable will make applications very complex, so I'd prefer
to not deal with this too.




This way, mixing up the two meanings is ruled out.

Looking closer to how tc flower rules configured I noticed that
'mirred' action requires user to specify the direction in which
the packet will appear on the destination port.  And I suppose
this will solve your issues with PORT_ID action without exposing
the "full picture" of the architecture of an underlying hardware.

It looks something like this:

    tc filter add dev A ... action mirred egress redirect dev B
                                          ^^^^^^

Direction could be 'ingress' or 'egress', so the packet will
ingress from the port B back to application/kernel or it will
egress from this port to the external network.  Same thing
could be implemented in rte_flow like this:

    flow create A ingress transfer pattern eth / end
        action port_id id B egress / end

So, application that needs to receive the packet from the port B
will specify 'ingress', others that just want to send packet from
the port B will specify 'egress'.  Will that work for you?

(BTW, 'ingress' seems to be not implemented in TC and that kind
of suggests that it's not very useful at least for kernel use cases)

One might say that it's actually the same what is proposed in
this RFC, but I will argue that 'ingress/egress' schema doesn't
break the "***MAGIC***" abstraction because user is not obligated
to know the structure of the underlying hardware, while 'upstream'
flag is something very unclear from that perspective and makes
no sense for plane ports (non-representors).

Perhaps indeed "upstream" is barely the best choice, and I'd be tempted to accept the "egress" 
proposal, but I strongly disagree with the suggestion that "upstream" purportedly implies some HW knowledge. 
It simply means that the packet goes to the location where the wire leads to and not back to the application's own end 
of the wire. Must the port be connected to some location. And for non-representors, like admin's PF (isn't that what 
you call a plane port?) it perfectly makes sense because saying "PORT_ID id 0 upstream" means (not to the 
application, but to us) that the packet will be sent to network through the network port connected to this PF. That's 
it.






On 01/06/2021 15:10, Ilya Maximets wrote:
On 6/1/21 1:14 PM, Ivan Malov wrote:
By its very name, action PORT_ID means that packets hit an ethdev with the
given DPDK port ID. At least the current comments don't state the opposite.
That said, since port representors had been adopted, applications like OvS
have been misusing the action. They misread its purpose as sending packets
to the opposite end of the "wire" plugged to the given ethdev, for example,
redirecting packets to the VF itself rather than to its representor ethdev.
Another example: OvS relies on this action with the admin PF's ethdev port
ID specified in it in order to send offloaded packets to the physical port.

Since there might be applications which use this action in its valid sense,
one can't just change the documentation to greenlight the opposite meaning.
This patch adds an explicit bit to the action configuration which will let
applications, depending on their needs, leverage the two meanings properly.
Applications like OvS, as well as PMDs, will have to be corrected when the
patch has been applied. But the improved clarity of the action is worth it.

The proposed change is not the only option. One could avoid changes in OvS
and PMDs if the new configuration field had the opposite meaning, with the
action itself meaning delivery to the represented port and not to DPDK one.
Alternatively, one could define a brand new action with the said behaviour.

We had already very similar discussions regarding the understanding of what
the representor really is from the DPDK API's point of view, and the last
time, IIUC, it was concluded by a tech. board that representor should be
a "ghost of a VF", i.e. DPDK APIs should apply configuration by default to
VF and not to the representor device:
      
https://patches.dpdk.org/project/dpdk/cover/20191029185051.32203-1-tho...@monjalon.net/#104376
This wasn't enforced though, IIUC, for existing code and semantics is still 
mixed.

I still think that configuration should be applied to VF, and the same applies
to rte_flow API.  IMHO, average application should not care if device is
a VF itself or its representor.  Everything should work exactly the same.
I think this matches with the original idea/design of the switchdev 
functionality
in the linux kernel and also matches with how the average user thinks about
representor devices.

If some specific use-case requires to distinguish VF from the representor,
there should probably be a separate special API/flag for that.

Best regards, Ilya Maximets.






--
Ivan M

Reply via email to