Re: [PATCH] tuntap: add flow control to support back pressure

Michael S. Tsirkin Sun, 13 Apr 2014 22:42:07 -0700

On Sun, Apr 13, 2014 at 09:28:51PM -0400, Steven Galgano wrote:
> On 04/13/2014 10:14 AM, Michael S. Tsirkin wrote:
> > 
> > Steven, Brian,
> > 
> > thanks for reporting this issue.
> > Please see my comments below.
> > 
> > On Fri, Apr 11, 2014 at 12:41:42PM -0400, Brian Adamson wrote:
> >> To weigh in on the desire to have support (at least as an optional 
> >> behavior) for the legacy flow control behavior, there are many existing 
> >> uses of it.  Many these are related to experimental purposes where the 
> >> tuntap driver can be used (with a little user space code) as a surrogate 
> >> for a network interface type that may not even yet exist.  And in some 
> >> cases these experimental purposes have had utility for actual deployment 
> >> (e.g. disaster relief wireless networks where  the TAP device has provided 
> >> some intermediate assistance for routing or other protocols, even an 
> >> underwater acoustic sensor network proposed for reef monitoring, etc where 
> >> a TAP device provides a network interface and the sound card is used as a 
> >> modem on an embedded system).  Some of these networks have low data rates 
> >> or packet loss and delays that make TCP (which provides flow control as 
> >> part of its usual reliable transport for more typical networking purpose) 
> >> not an ideal protocol to use and so UDP or oth!
>  er alterna
> tives or used.  To keep this short, I'll list a few use cases here I know 
> (and was involved with the implementation of some) with some links (where I 
> know them):
> >>
> >> 1) CORE network emulation tool  (http://code.google.com/p/coreemu/)
> >>
> >> 2) EMANE network emulation tool (https://github.com/adjacentlink/emane)
> >>
> >> (likely other network emulation tools exist that have used tuntap as 
> >> surrogates for real physical interfaces and expect the same backpressure 
> >> to sockets and queues that physical interfaces provide)
> >>
> >> 3) I don't have a link to it but I implemented an experimental IP 
> >> interface/ MAC protocol called SLIDE (serial-link internet daemon) that 
> >> implemented a user-space CSMA MAC protocol where an underwater acoustic 
> >> modem was connected to the serial port and TAP was used to present a 
> >> virtual network interface to the IP stack.  Because of the low data rates 
> >> involved, the back pressure flow control to application sockets (and 
> >> protocol daemons and qdiscs applied)  was important.
> >>
> >> 4)  User space implementation of Simplified Multicast Forwarding (SMF) of 
> >> RFC 6621 has a "device" option that establishes TAP interfaces to perform 
> >> distributed "backpressure" based flow control (and potentially routing) 
> >> for MANET wireless networks.  
> >> (http://www.nrl.navy.mil/itd/ncs/products/smf)
> >>
> >> There are probably some more, among the more esoteric wireless and other 
> >> special networking communities, where host (or routing/gateway/proxy 
> >> non-host), e.g. special embedded system devices based on Linux such as 
> >> sensors, etc) have a first hop network attachment that is _not_ the 
> >> typical Ethernet or something and may be using tuntap along with a sort of 
> >> user-space "driver" to present an IP interface to the network stack. some 
> >> of this stuff, especially embedded systems, tend to lag behind with 
> >> respect to kernel versions and this behavior change in Linux may be yet 
> >> undiscovered so far even though the change was put in a couple years ago.
> >>
> >> Several of these are implemented across multiple platforms, and, for 
> >> example, BSD-based systems tuntap provides the same flow control behavior. 
> >>  Even if it was never formally documented, I think this behavior was 
> >> fairly well known (at least for these sorts of experimental purposes) and 
> >> used.  I understand the concern that a single bad behaving flow can 
> >> possibly block the flow of others unless traffic control queuing 
> >> disciplines (as done for other network interfaces).  For the purposes of 
> >> which I'm aware, I think having this behavior as _optional_ is probably OK 
> >> … If accepted, and something is implemented here, it may be a good 
> >> opportunity to have it documented (and the pros and cons of its use) for 
> >> the more general Linux community.
> > 
> > Yes, a UDP socket with sufficiently deep qdisc and tun queues
> > would previously get slowed down so it matches the speed of
> > the interface.
> > 
> > But IIUC this was not really designed to be a flow control measure,
> > so depending on what else is in the qdisc you could easily get
> > into a setup where it behaves exactly as it does now.
> > For example, have several UDP sockets send data out a single
> > interface.
> > 
> > Another problem is that this depends on userspace to be
> > well-behaved and consume packets in a timely manner:
> > a misbehaving userspace operating a tun device can cause other
> > tun devices and/or sockets to get blocked forever and prevent them
> > from communicating with all destinations (not just the misbehaving one)
> > as their wmem limit is exhausted.
> > 
> > It should be possible to reproduce with an old kernel and your userspace
> > drivers, too - just stop the daemon temporarily.
> > I realize that your daemon normally is well-behaved, and
> > simply moves all incoming packets to the backend without
> > delay, but I'd like to find a solution that addresses
> > this without trusting userspace to be responsive.
> > 
> > 
> > 
> > 
> > At the moment, for this use-case it's possible to limit the speed of tun
> > interface using a non work conserving qdisc.  Make that match the speed
> > of the backend device, and you get back basically the old behaviour
> > without the security problem in that the rate is basically ensured
> > by kernel and all packets queued are eventually consumed.
> > 
> > Does this solve the problem for you?
> > 
> > 
> I do not believe this solves the problem.
> 
> The issue is with what happens when a queue is full. When multiqueue
> support was introduced, the tun went from stopping the queue to dropping
> packets.


This actually happened previously too on qdisc queue overrun.

> In the back pressure use case, the application code using the character
> device side of the tun interface stops reading frames in order to cause
> the tun queue to backup and eventually stop. This leads to tx socket
> buffer backup which provides back pressure to the applications sending
> traffic over the tun interface.
>
> The determination as to when to stop reading frames or when to throttle
> back from reading frames is dynamic. It can be based on many things:
> congestion on the backend media, power level, data rate, etc.
> 
> And the application response to back pressure can be as simple as just
> blocking on a send() or more complex like dynamically throttling data
> rates or changing compression ratios to reduce overall network load so
> that when congestion clears there is less of a strain on the network.
> 
> Once the tun started dropping packets instead of stopping the queue, it
> removed the ability for the application controlling the tun to apply its
> own queue management policy: read frames and possibly discard or stop
> reading frames. That is the problem, the fixed discard policy.
> 
> I'll be posting another version of the patch that allows you to specify
> the flow control behavior on a queue by queue basis. This would let the
> application decide what should happen once a queue is full: discard or
> netif_tx_stop_queue().
> 
> I'm not hopeful this solution will be accepted but it might be helpful
> to someone else.
> 
> >> BTW, in my initial noticing this issue, it _seemed_ that even the default 
> >> interface pfifo_fast priority bands were not being properly enforced for 
> >> the tap interface without the old flow control behavior?.  I need to do a 
> >> little more "old vs new" comparison testing on this regard.
> >>
> >> best regards,
> >>
> >> Brian 
> > 
> > I think that as tun is never stopped, nothing is ever queued in qdisc.
> > 
> > 
> >> On Apr 10, 2014, at 9:42 PM, Steven Galgano <sgalg...@adjacentlink.com> 
> >> wrote:
> >>
> >>> On 04/10/2014 06:29 AM, Michael S. Tsirkin wrote:
> >>>> On Wed, Apr 09, 2014 at 10:19:40PM -0400, Steven Galgano wrote:
> >>>>> Add tuntap flow control support for use by back pressure routing 
> >>>>> protocols. Setting the new TUNSETIFF flag IFF_FLOW_CONTROL, will signal 
> >>>>> resources as unavailable when the tx queue limit is reached by issuing 
> >>>>> a netif_tx_stop_all_queues() rather than discarding frames. A 
> >>>>> netif_tx_wake_all_queues() is issued after reading a frame from the 
> >>>>> queue to signal resource availability.
> >>>>>
> >>>>> Back pressure capability was previously supported by the legacy tun 
> >>>>> default mode. This change restores that functionality, which was last 
> >>>>> present in v3.7.
> >>>>>
> >>>>> Reported-by: Brian Adamson <brian.adam...@nrl.navy.mil>
> >>>>> Tested-by: Joseph Giovatto <jgiova...@adjacentlink.com>
> >>>>> Signed-off-by: Steven Galgano <sgalg...@adjacentlink.com>
> >>>>
> >>>> I don't think it's a good idea.
> >>>>
> >>>> This trivial flow control really created more problems than it was worth.
> >>>>
> >>>> In particular this blocks all flows so it's trivially easy for one flow
> >>>> to block and starve all others: just send a bunch of packets to loopback
> >>>> destinations that get queued all over the place.
> >>>>
> >>>> Luckily it was never documented so we changed the default and nothing
> >>>> seems to break, but we won't be so lucky if we add an explicit API.
> >>>>
> >>>>
> >>>> One way to implement this would be with ubuf_info callback this is
> >>>> already invoked in most places where a packet might get stuck for a long
> >>>> time.  It's still incomplete though: this will prevent head of queue
> >>>> blocking literally forever, but a single bad flow can still degrade
> >>>> performance significantly.
> >>>>
> >>>> Another alternative is to try and isolate the flows that we
> >>>> can handle and throttle them.
> >>>>
> >>>> It's all fixable but we really need to fix the issues *before*
> >>>> exposing the interface to userspace.
> >>>>
> >>>>
> >>>>
> >>>
> >>> It was only after recent upgrades that we picked up a newer kernel and
> >>> discovered the change to the default tun mode.
> >>>
> >>> The new default behavior has broken applications that depend on the
> >>> legacy behavior. Although not documented, the legacy behavior was well
> >>> known at least to those working in the back pressure research community.
> >>> The default legacy mode was/is a valid use case although I am not sure
> >>> it fits with recent multiqueue support.
> >>>
> >>> When back pressure protocols are running over a tun interface they
> >>> require legacy flow control in order to communicate congestion detected
> >>> on the physical media they are using. Multiqueues do not apply here.
> >>> These protocols only use one queue, so netif_tx_stop_all_queues() is the
> >>> necessary behavior.
> >>>
> >>> I'm not tied to the idea of IFF_FLOW_CONTROL. I was aiming for middle
> >>> ground where an application controlling the tun interface can state it
> >>> wants the legacy flow control behavior understanding its limitations
> >>> concerning multiple queues.
> >>>
> >>> What if we resurrect IFF_ONE_QUEUE and use that as a mechanism to
> >>> indicate legacy flow control. A tun instance initially configured with
> >>> IFF_ONE_QUEUE would not be allowed to attach or detach queues with
> >>> TUNSETQUEUE and any additional opens with the same device name would
> >>> fail. This mode would use the
> >>> netif_tx_stop_all_queues()/netif_tx_wake_all_queues() flow control
> >>> mechanism.
> >>>
> >>> If a tun application wants the current default behavior with only a
> >>> single queue, it would not set the IFF_ONE_QUEUE flag. Not setting
> >>> IFF_MULTI_QUEUE would not imply IFF_ONE_QUEUE.
> >>>
> >>> I'd be happy to implement this change if it is an acceptable solution.
> >>> This would allow multiqueue tun development to advance while still
> >>> supporting use cases dependent on legacy flow control.
> >>>
> >>> -steve
> >>>
> >>>>> ---
> >>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>> index ee328ba..268130c 100644
> >>>>> --- a/drivers/net/tun.c
> >>>>> +++ b/drivers/net/tun.c
> >>>>> @@ -783,8 +783,19 @@ static netdev_tx_t tun_net_xmit(struct sk_buff 
> >>>>> *skb, struct net_device *dev)
> >>>>>          * number of queues.
> >>>>>          */
> >>>>>         if (skb_queue_len(&tfile->socket.sk->sk_receive_queue) * 
> >>>>> numqueues
> >>>>> -                         >= dev->tx_queue_len)
> >>>>> -               goto drop;
> >>>>> +                       >= dev->tx_queue_len) {
> >>>>> +               if (tun->flags & TUN_FLOW_CONTROL) {
> >>>>> +                       /* Resources unavailable stop transmissions */
> >>>>> +                       netif_tx_stop_all_queues(dev);
> >>>>> +
> >>>>> +                       /* We won't see all dropped packets 
> >>>>> individually, so
> >>>>> +                        * over run error is more appropriate.
> >>>>> +                        */
> >>>>> +                       dev->stats.tx_fifo_errors++;
> >>>>> +               } else {
> >>>>> +                       goto drop;
> >>>>> +               }
> >>>>> +       }
> >>>>>
> >>>>>         if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
> >>>>>                 goto drop;
> >>>>> @@ -1362,6 +1373,9 @@ static ssize_t tun_do_read(struct tun_struct 
> >>>>> *tun, struct tun_file *tfile,
> >>>>>                         continue;
> >>>>>                 }
> >>>>>
> >>>>> +               /* Wake in case resources previously signaled 
> >>>>> unavailable */
> >>>>> +               netif_tx_wake_all_queues(tun->dev);
> >>>>> +
> >>>>>                 ret = tun_put_user(tun, tfile, skb, iv, len);
> >>>>>                 kfree_skb(skb);
> >>>>>                 break;
> >>>>> @@ -1550,6 +1564,9 @@ static int tun_flags(struct tun_struct *tun)
> >>>>>         if (tun->flags & TUN_PERSIST)
> >>>>>                 flags |= IFF_PERSIST;
> >>>>>
> >>>>> +       if (tun->flags & TUN_FLOW_CONTROL)
> >>>>> +               flags |= IFF_FLOW_CONTROL;
> >>>>> +
> >>>>>         return flags;
> >>>>> }
> >>>>>
> >>>>> @@ -1732,6 +1749,11 @@ static int tun_set_iff(struct net *net, struct 
> >>>>> file *file, struct ifreq *ifr)
> >>>>>         else
> >>>>>                 tun->flags &= ~TUN_TAP_MQ;
> >>>>>
> >>>>> +       if (ifr->ifr_flags & IFF_FLOW_CONTROL)
> >>>>> +               tun->flags |= TUN_FLOW_CONTROL;
> >>>>> +       else
> >>>>> +               tun->flags &= ~TUN_FLOW_CONTROL;
> >>>>> +
> >>>>>         /* Make sure persistent devices do not get stuck in
> >>>>>          * xoff state.
> >>>>>          */
> >>>>> @@ -1900,7 +1922,8 @@ static long __tun_chr_ioctl(struct file *file, 
> >>>>> unsigned int cmd,
> >>>>>                  * This is needed because we never checked for invalid 
> >>>>> flags on
> >>>>>                  * TUNSETIFF. */
> >>>>>                 return put_user(IFF_TUN | IFF_TAP | IFF_NO_PI | 
> >>>>> IFF_ONE_QUEUE |
> >>>>> -                               IFF_VNET_HDR | IFF_MULTI_QUEUE,
> >>>>> +                               IFF_VNET_HDR | IFF_MULTI_QUEUE |
> >>>>> +                               IFF_FLOW_CONTROL,
> >>>>>                                 (unsigned int __user*)argp);
> >>>>>         } else if (cmd == TUNSETQUEUE)
> >>>>>                 return tun_set_queue(file, &ifr);
> >>>>> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
> >>>>> index e9502dd..bcf2790 100644
> >>>>> --- a/include/uapi/linux/if_tun.h
> >>>>> +++ b/include/uapi/linux/if_tun.h
> >>>>> @@ -36,6 +36,7 @@
> >>>>> #define TUN_PERSIST     0x0100  
> >>>>> #define TUN_VNET_HDR    0x0200
> >>>>> #define TUN_TAP_MQ      0x0400
> >>>>> +#define TUN_FLOW_CONTROL 0x0800
> >>>>>
> >>>>> /* Ioctl defines */
> >>>>> #define TUNSETNOCSUM  _IOW('T', 200, int) 
> >>>>> @@ -70,6 +71,7 @@
> >>>>> #define IFF_MULTI_QUEUE 0x0100
> >>>>> #define IFF_ATTACH_QUEUE 0x0200
> >>>>> #define IFF_DETACH_QUEUE 0x0400
> >>>>> +#define IFF_FLOW_CONTROL 0x0010
> >>>>> /* read-only flag */
> >>>>> #define IFF_PERSIST     0x0800
> >>>>> #define IFF_NOFILTER    0x1000
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] tuntap: add flow control to support back pressure

Reply via email to