Re: [ovs-dev] [flow-compat 6/7] ofproto-dpif: Factor NetFlow active timeouts out of flow expiration.

Ben Pfaff Tue, 22 Nov 2011 16:46:09 -0800

On Tue, Nov 15, 2011 at 05:19:16PM -0800, Ben Pfaff wrote:
> On Tue, Nov 08, 2011 at 11:07:58AM -0800, Jesse Gross wrote:
> > On Mon, Nov 7, 2011 at 4:08 PM, Ben Pfaff <b...@nicira.com> wrote:
> > > On Mon, Nov 07, 2011 at 03:54:40PM -0800, Jesse Gross wrote:
> > >> On Mon, Nov 7, 2011 at 1:41 PM, Ben Pfaff <b...@nicira.com> wrote:
> > >> > On Mon, Nov 07, 2011 at 11:18:04AM -0800, Jesse Gross wrote:
> > >> >> On Mon, Nov 7, 2011 at 9:24 AM, Ben Pfaff <b...@nicira.com> wrote:
> > >> >> > On Sun, Nov 06, 2011 at 09:56:10PM -0800, Jesse Gross wrote:
> > >> >> >> On Fri, Nov 4, 2011 at 4:43 PM, Ben Pfaff <b...@nicira.com> wrote:
> > >> >> >> > NetFlow active timeouts were only mixed in with flow expiration 
> > >> >> >> > for
> > >> >> >> > convenience: both processes need to iterate all the facets. ??But
> > >> >> >> > an upcoming commit will change flow expiration to work in terms 
> > >> >> >> > of
> > >> >> >> > a new "subfacet" entity, so they will no longer fit together 
> > >> >> >> > well.
> > >> >> >> >
> > >> >> >> > This change could be seen as an optimization, since NetFlow 
> > >> >> >> > active
> > >> >> >> > timeouts don't ordinarily have to run as often as flow 
> > >> >> >> > expiration,
> > >> >> >> > especially when the flow expiration rate is stepped up due to a
> > >> >> >> > large volume of flows.
> > >> >> >>
> > >> >> >> This has a pretty significant effect on the accuracy of the 
> > >> >> >> timeouts
> > >> >> >> that I'm not sure is intended. ??Currently, active timeouts are 
> > >> >> >> done on
> > >> >> >> a per-flow basis starting from time of first use. ??However, this
> > >> >> >> essentially starts a per-bridge timer on first configuration that 
> > >> >> >> must
> > >> >> >> first expire in order to check the per-flow timer. ??So with the
> > >> >> >> default timeout of 10 minutes, the first active timeout will occur
> > >> >> >> somewhere between 10 and 20 minutes after first use. ??This only
> > >> >> >> happens for the first one though since they will tend to 
> > >> >> >> synchronize.
> > >> >> >> However, I think that there is a potential for the two timers to
> > >> >> >> desynchronize, resulting in apparently random doubling of 
> > >> >> >> intervals.
> > >> >> >> For example, netflow_run() is also called from gen_netflow_rec() 
> > >> >> >> when
> > >> >> >> it fills up a packet but does not check the return code, skipping 
> > >> >> >> the
> > >> >> >> active timeout if a timer tick occurred in that window. ??Finally, 
> > >> >> >> the
> > >> >> >> current active timeout code distributes reporting over a large 
> > >> >> >> span of
> > >> >> >> time but this concentrates all of them at once, which could cause a
> > >> >> >> load spike in the collector if a number of switches are brought up 
> > >> >> >> at
> > >> >> >> the same time.
> > >> >> >
> > >> >> > Hmm.
> > >> >> >
> > >> >> > Maybe I should just do NetFlow reporting once a second (as it was
> > >> >> > before). ??What do you think?
> > >> >>
> > >> >> I think either that or actually tracking when the next timeout will
> > >> >> occur are the only real solutions. ??However, I think the only
> > >> >> efficient way to do correct timeouts is to again combine this with the
> > >> >> flow expiration code, which gets us back to where we were before.
> > >> >
> > >> > I don't understand. ??NetFlow active timeouts are essentially
> > >> > independent of flow expiration, except to the extent that if a flow
> > >> > expires then it doesn't need active timeouts.
> > >>
> > >> Sorry, when I said correct timeouts I meant calculating the timeout
> > >> for the next flow. ??I think this needs to be integrated with flow
> > >> expiration because if a flow expires from inactivity then you have to
> > >> check whether it was the cause of the next active timeout interval and
> > >> if so calculate a new one.
> > >
> > > Off-hand, I don't think it would pay to be *that* precise about the
> > > interval. ??I think that would require something like a min-heap or
> > > other priority queue data structure.
> > 
> > I agree that it's not important to be completely precise, especially
> > since it seems likely the most common cause of error is actually that
> > we check for active timeouts for flows that have already expired.
> > 
> > I do worry about the cost of maintaining a priority queue data
> > structure given that there will be a lot of churn for short lived
> > flows.  I guess you probably end up binning them together but then it
> > makes it difficult to remove unused bins and you probably end up
> > checking for flows with active timeouts once a second anyways.
> > 
> > >> >> When you say do reporting once a second do you mean essentially the
> > >> >> same as in this patch but use 1 second instead of the active timeout
> > >> >> interval or go back to the original version?
> > >> >
> > >> > The same as in this patch but go back to 1 second, which is the
> > >> > minimum rate at which we call the main "expire()" function in
> > >> > ofproto-dpif.c that actually runs the loop above.
> > >>
> > >> So that doubles the number of times that we are iterating over the
> > >> facets. ??Do you think that will be a problem for large numbers of
> > >> flows?
> > >
> > > I guess that it is strictly less expensive that what we have now for
> > > large numbers of flows, since in the limit we currently run both flow
> > > expiration and NetFlow active timeouts 10 times/s, whereas if we drop
> > > NetFlow active timeouts back to once a second then we iterate only
> > > 1/10th as often.
> > 
> > I think that checking to see whether a flow has an expired active
> > timeout has very little cost once you've already gone through the
> > effort of looking at all the flows and the number of flow expiration
> > events actually generated is the same either way so that cost is the
> > same there.
> > 
> > I don't know whether any of this actually matters, I just doubt that
> > it will be more efficient than what we have currently.
> > 
> > > For small numbers of flows, I don't think it matters.
> > 
> > Sure.
> > 
> > > I guess that the average flow expires long before its first active
> > > timeout with the default settings (10 minutes is a long time). ??The
> > > default socket buffer size is about 128 kB. ??We can fit about 2,600
> > > NetFlow v5 expirations in that socket buffer by my
> > > back-of-the-envelope calculation. ??So I think that we could probably
> > > increase the NetFlow active timeout interval well beyond a second and
> > > still have a hard time losing any of them.
> > 
> > I was mostly concerned about getting timely and consistent flow
> > information to the collector and less about filling up the socket
> > buffer.
> 
> This patch is now in the "vlan splinters" series that I just posted:
>       http://openvswitch.org/pipermail/dev/2011-November/012959.html
> I haven't addressed your comments yet but I will.


I decided that the easiest fix is to just use 1-second timeouts:

diff --git a/ofproto/netflow.c b/ofproto/netflow.c
index a128c5f..6e2ddb8 100644
--- a/ofproto/netflow.c
+++ b/ofproto/netflow.c
@@ -234,7 +234,7 @@ netflow_run(struct netflow *nf)
     }
 
     if (nf->active_timeout && time_msec() >= nf->next_timeout) {
-        nf->next_timeout = time_msec() + nf->active_timeout;
+        nf->next_timeout = time_msec() + 1000;
         return true;
     } else {
         return false;

We can do better but I question the value.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Re: [ovs-dev] [flow-compat 6/7] ofproto-dpif: Factor NetFlow active timeouts out of flow expiration.

Reply via email to