On Tue, Nov 15, 2011 at 05:19:16PM -0800, Ben Pfaff wrote: > On Tue, Nov 08, 2011 at 11:07:58AM -0800, Jesse Gross wrote: > > On Mon, Nov 7, 2011 at 4:08 PM, Ben Pfaff <b...@nicira.com> wrote: > > > On Mon, Nov 07, 2011 at 03:54:40PM -0800, Jesse Gross wrote: > > >> On Mon, Nov 7, 2011 at 1:41 PM, Ben Pfaff <b...@nicira.com> wrote: > > >> > On Mon, Nov 07, 2011 at 11:18:04AM -0800, Jesse Gross wrote: > > >> >> On Mon, Nov 7, 2011 at 9:24 AM, Ben Pfaff <b...@nicira.com> wrote: > > >> >> > On Sun, Nov 06, 2011 at 09:56:10PM -0800, Jesse Gross wrote: > > >> >> >> On Fri, Nov 4, 2011 at 4:43 PM, Ben Pfaff <b...@nicira.com> wrote: > > >> >> >> > NetFlow active timeouts were only mixed in with flow expiration > > >> >> >> > for > > >> >> >> > convenience: both processes need to iterate all the facets. ??But > > >> >> >> > an upcoming commit will change flow expiration to work in terms > > >> >> >> > of > > >> >> >> > a new "subfacet" entity, so they will no longer fit together > > >> >> >> > well. > > >> >> >> > > > >> >> >> > This change could be seen as an optimization, since NetFlow > > >> >> >> > active > > >> >> >> > timeouts don't ordinarily have to run as often as flow > > >> >> >> > expiration, > > >> >> >> > especially when the flow expiration rate is stepped up due to a > > >> >> >> > large volume of flows. > > >> >> >> > > >> >> >> This has a pretty significant effect on the accuracy of the > > >> >> >> timeouts > > >> >> >> that I'm not sure is intended. ??Currently, active timeouts are > > >> >> >> done on > > >> >> >> a per-flow basis starting from time of first use. ??However, this > > >> >> >> essentially starts a per-bridge timer on first configuration that > > >> >> >> must > > >> >> >> first expire in order to check the per-flow timer. ??So with the > > >> >> >> default timeout of 10 minutes, the first active timeout will occur > > >> >> >> somewhere between 10 and 20 minutes after first use. ??This only > > >> >> >> happens for the first one though since they will tend to > > >> >> >> synchronize. > > >> >> >> However, I think that there is a potential for the two timers to > > >> >> >> desynchronize, resulting in apparently random doubling of > > >> >> >> intervals. > > >> >> >> For example, netflow_run() is also called from gen_netflow_rec() > > >> >> >> when > > >> >> >> it fills up a packet but does not check the return code, skipping > > >> >> >> the > > >> >> >> active timeout if a timer tick occurred in that window. ??Finally, > > >> >> >> the > > >> >> >> current active timeout code distributes reporting over a large > > >> >> >> span of > > >> >> >> time but this concentrates all of them at once, which could cause a > > >> >> >> load spike in the collector if a number of switches are brought up > > >> >> >> at > > >> >> >> the same time. > > >> >> > > > >> >> > Hmm. > > >> >> > > > >> >> > Maybe I should just do NetFlow reporting once a second (as it was > > >> >> > before). ??What do you think? > > >> >> > > >> >> I think either that or actually tracking when the next timeout will > > >> >> occur are the only real solutions. ??However, I think the only > > >> >> efficient way to do correct timeouts is to again combine this with the > > >> >> flow expiration code, which gets us back to where we were before. > > >> > > > >> > I don't understand. ??NetFlow active timeouts are essentially > > >> > independent of flow expiration, except to the extent that if a flow > > >> > expires then it doesn't need active timeouts. > > >> > > >> Sorry, when I said correct timeouts I meant calculating the timeout > > >> for the next flow. ??I think this needs to be integrated with flow > > >> expiration because if a flow expires from inactivity then you have to > > >> check whether it was the cause of the next active timeout interval and > > >> if so calculate a new one. > > > > > > Off-hand, I don't think it would pay to be *that* precise about the > > > interval. ??I think that would require something like a min-heap or > > > other priority queue data structure. > > > > I agree that it's not important to be completely precise, especially > > since it seems likely the most common cause of error is actually that > > we check for active timeouts for flows that have already expired. > > > > I do worry about the cost of maintaining a priority queue data > > structure given that there will be a lot of churn for short lived > > flows. I guess you probably end up binning them together but then it > > makes it difficult to remove unused bins and you probably end up > > checking for flows with active timeouts once a second anyways. > > > > >> >> When you say do reporting once a second do you mean essentially the > > >> >> same as in this patch but use 1 second instead of the active timeout > > >> >> interval or go back to the original version? > > >> > > > >> > The same as in this patch but go back to 1 second, which is the > > >> > minimum rate at which we call the main "expire()" function in > > >> > ofproto-dpif.c that actually runs the loop above. > > >> > > >> So that doubles the number of times that we are iterating over the > > >> facets. ??Do you think that will be a problem for large numbers of > > >> flows? > > > > > > I guess that it is strictly less expensive that what we have now for > > > large numbers of flows, since in the limit we currently run both flow > > > expiration and NetFlow active timeouts 10 times/s, whereas if we drop > > > NetFlow active timeouts back to once a second then we iterate only > > > 1/10th as often. > > > > I think that checking to see whether a flow has an expired active > > timeout has very little cost once you've already gone through the > > effort of looking at all the flows and the number of flow expiration > > events actually generated is the same either way so that cost is the > > same there. > > > > I don't know whether any of this actually matters, I just doubt that > > it will be more efficient than what we have currently. > > > > > For small numbers of flows, I don't think it matters. > > > > Sure. > > > > > I guess that the average flow expires long before its first active > > > timeout with the default settings (10 minutes is a long time). ??The > > > default socket buffer size is about 128 kB. ??We can fit about 2,600 > > > NetFlow v5 expirations in that socket buffer by my > > > back-of-the-envelope calculation. ??So I think that we could probably > > > increase the NetFlow active timeout interval well beyond a second and > > > still have a hard time losing any of them. > > > > I was mostly concerned about getting timely and consistent flow > > information to the collector and less about filling up the socket > > buffer. > > This patch is now in the "vlan splinters" series that I just posted: > http://openvswitch.org/pipermail/dev/2011-November/012959.html > I haven't addressed your comments yet but I will.
I decided that the easiest fix is to just use 1-second timeouts: diff --git a/ofproto/netflow.c b/ofproto/netflow.c index a128c5f..6e2ddb8 100644 --- a/ofproto/netflow.c +++ b/ofproto/netflow.c @@ -234,7 +234,7 @@ netflow_run(struct netflow *nf) } if (nf->active_timeout && time_msec() >= nf->next_timeout) { - nf->next_timeout = time_msec() + nf->active_timeout; + nf->next_timeout = time_msec() + 1000; return true; } else { return false; We can do better but I question the value. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev