On Fri, 23 Sep 2016 07:23:26 -0700, John Fastabend wrote: > On 16-09-23 05:55 AM, Jakub Kicinski wrote: > > On Fri, 23 Sep 2016 11:06:09 +0200, Jiri Benc wrote: > >> On Fri, 23 Sep 2016 08:34:29 +0200, Jiri Pirko wrote: > >>> So if I understand that correctly, this would need some "shared netdev" > >>> which would effectively serve only as a sink for all port netdevices to > >>> tx packets to. On RX, this would be completely avoided. This lower > >>> device looks like half zombie to me. > >> > >> Looks more like a quarter zombie. Even tx would not be allowed unless > >> going through one of the ports, as all skbs without > >> METADATA_HW_PORT_MUX metadata_dst would be dropped. But it would be > >> possible to attach qdisc to the "lower" netdevice and it would actually > >> have an effect. On rx this netdevice would be ignored completely. This > >> is very weird behavior. > >> > >>> I don't like it :( I wonder if the > >>> solution would not be possible without this lower netdev. > >> > >> I agree. This approach doesn't sound correct. The skbs should not be > >> requeued. > > > > Thanks for the responses! > > Nice timing we were just thinking about this. > > > > > I think SR-IOV NICs are coming at this problem from a different angle, > > we already have a big, feature-full per-port netdevs and now we want to > > spawn representators for VFs to handle fallback traffic. This patch > > would help us mux VFR traffic on all the queues of the physical port > > netdevs (the ones which were already present in legacy mode, that's the > > lower device). > > Yep, I like the idea in general. I had a slightly different approach in > mind though. If you look at __dev_queue_xmit() there is a void > accel_priv pointer (gather you found this based on your commit note). > My take was we could extend this a bit so it can be used by the VFR > devices and they could do a dev_queue_xmit_accel(). In this way there is > no need to touch /net/core/{filter, dst, ip_tunnel}.c etc. Maybe the > accel logic needs to be extended to push the priv pointer all the way > through the xmit routine of the target netdev though. This should look > a lot like the macvlan accelerated xmit device path without the > switching logic. > > Of course maybe the name would be extended to dev_queue_xmit_extended() > or something. > > So the flow on ingress would be, > > 1. pkt_received_by_PF_netdev > 2. PF_netdev reads some tag off packet/descriptor and sets correct > skb->dev field. This is needed so stack "sees" packets from > correct VF ports. > 3. packet passed up to stack. > > I guess it is a bit "zombie" like on the receive path because the packet > is never actually handled by VF netdev code per se and on egress can > traverse both the VFR and PF netdevs qdiscs. But on the other hand the > VFR netdevs and PF netdevs are all in the same driver. Plus using a > queue per VFR is a bit of a waste as its not needed and also hardware > may not have any mechanism to push VF traffic onto a rx queue. > > On egress, > > 1. VFR xmit is called > 2. VFR xmit calls dev_queue_xmit_accel() with some meta-data if needed > for the lower netdev > 3. lower netdev sends out the packet. > > Again we don't need to waste any queues for each VFR and the VFR can be > a LLTX device. In this scheme I think you avoid much of the changes in > your patch and keep it all contained in the driver. Any thoughts?
Goes without saying that you have a much better understanding of packet scheduling so please bear with me :) My target model is that I have n_cpus x "n_tc/prio" queues on the PF and I want to transmit the fallback traffic over those same queues. So no new HW queues are used for VFRs at all. This is a reverse of macvlan offload which AFAICT has "bastard hw queues" which actually TX for a separate software device. My understanding was that I can rework this model to have software queues for VFRs (#sw queues == #PF queues + #VFRs) but no extra HW queues (#hw queues == #PF queues) but then when the driver sees a packet on sw-only VFR queue it has to pick one of the PF queues (which one?), lock PF software queue to own it, and only then can it transmit. With the dst_metadata there is no need for extra locking or queue selection. > To address 'I wonder if the solution can be done without this lower > netdev' I think it can be but it creates two issues which I'm not sure > have a good solution. > > Without a lowerdev we either (a) give each VFR its own queue which I > don't like because it complicates mgmt and uses resources or (b) we > implicitly share queues. The later could be fine it just looks a bit > cleaner IMO to make it explicit. > > With regard to VF-PF flow rules if we allow matching on ingress port > then can all your flow rules be pushed through the PF netdevices or > if you want any of the VFR netdevs? After all I expsect the flow rule > table is actually a shared resource between all attached ports. With the VF-PF forwarding rules I was just inching towards re-opening the discussion on whether there should be an CPU port netdev. I guess there are good reasons why there isn't so maybe let's not go there :) The meaning of PF netdevs in SR-IOV switchdev mode is "external ports" AFAICT which could make it cumbersome to reach the host.