Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]

Bruce Evans Mon, 14 Jul 2008 05:35:27 -0700

On Mon, 7 Jul 2008, Robert Watson wrote:

On Mon, 7 Jul 2008, Bruce Evans wrote:
(1) sendto() to a specific address and port on a socket that has beenbound to
   INADDR_ANY and a specific port.
(2) sendto() on a specific address and port on a socket that has beenbound to
   a specific IP address (not INADDR_ANY) and a specific port.
(3) send() on a socket that has been connect()'d to a specific IP addressand
   a specific port, and bound to INADDR_ANY and a specific port.

(4) send() on a socket that has been connect()'d to a specific IP address
and a specific port, and bound to a specific IP address (notINADDR_ANY)
   and a specific port.
The last of these should really be quite a bit faster than the first ofthese, but I'd be interested in seeing specific measurements for each ifthat's possible!
Not sure if I understand networking well enough to set these up quickly.Does netrate use one of (3) or (4) now?
(3) and (4) are effectively the same thing, I think, since connect(2) shouldforce the selection of a source IP address, but I think it's not a bad ideato confirm that. :-)
The structure of the desired micro-benchmark here is basically:
...


I hacked netblast.c to do this:

% --- /usr/src/tools/tools/netrate/netblast/netblast.c  Fri Dec 16 17:02:44 2005
% +++ netblast.c        Mon Jul 14 21:26:52 2008
% @@ -44,9 +44,11 @@
%  {

%% - fprintf(stderr, "netblast [ip] [port] [payloadsize] [duration]\n");

% -     exit(-1);
% +     fprintf(stderr, "netblast ip port payloadsize duration bind connect\n");
% +     exit(1);
%  }

%% +static int gconnected;

%  static int   global_stop_flag;
% +static struct sockaddr_in *gsin;

%% static void

% @@ -116,6 +118,13 @@
%                       counter++;
%               }
% -             if (send(s, packet, packet_len, 0) < 0)
% +             if (gconnected && send(s, packet, packet_len, 0) < 0) {
%                       send_errors++;
% +                     usleep(1000);
% +             }
% +             if (!gconnected && sendto(s, packet, packet_len, 0,
% +                 (struct sockaddr *)gsin, sizeof(*gsin)) < 0) {
% +                     send_errors++;
% +                     usleep(1000);
% +             }
%               send_calls++;
%       }
% @@ -146,9 +155,10 @@
%       struct sockaddr_in sin;
%       char *dummy, *packet;
% -     int s;
% +     int bind_desired, connect_desired, s;

%% - if (argc != 5)

% +     if (argc != 7)
%               usage();

%% + gsin = &sin;

%       bzero(&sin, sizeof(sin));
%       sin.sin_len = sizeof(sin);
% @@ -176,4 +186,7 @@
%               usage();

%% + bind_desired = (strcmp(argv[5], "b") == 0);

% +     connect_desired = (strcmp(argv[6], "c") == 0);
% +
%       packet = malloc(payloadsize);
%       if (packet == NULL) {
% @@ -189,7 +202,19 @@
%       }

%% - if (connect(s, (struct sockaddr *)&sin, sizeof(sin)) < 0) {

% -             perror("connect");
% -             return (-1);
% +     if (bind_desired) {
% +             struct sockaddr_in osin;
% +
% +             osin = sin;
% +             if (inet_aton("0", &sin.sin_addr) == 0)
% +                     perror("inet_aton(0)");
% +             if (bind(s, (struct sockaddr *)&sin, sizeof(sin)) < 0)
% +                     err(-1, "bind");
% +             sin = osin;
% +     }
% +
% +     if (connect_desired) {
% +             if (connect(s, (struct sockaddr *)&sin, sizeof(sin)) < 0)
% +                     err(-1, "connect");
% +             gconnected = 1;
%       }
%

This also fixes some bugs in usage() (bogus [] around non-optional args and
bogus exit code) and adds a sleep after send failure.  Without the sleep,
netblast distorts the measurements by taking 100% CPU.  This depends on
kernel queues having enough buffering to not run dry during the sleep
time (rounded up to a tick boundary).  I use ifq_maxlen =
DRIVER_TX_RING_CNT + imax(2 * tick / 4, 10000) = 10512 for DRIVER = bge
and HZ = 100.  This is actually wrong now.  The magic 2 is to round up to
a tick boundary and the magic 4 is for bge taking a minimum of 4 usec per
packet on old hadware, but bge actually takes about 1.5 usec on the test
hardware and I'd like it to take 0.66 usec.  The queues rarely run dry in
practice, but running dry just a few times for a few msec each would
explain some anomalies.  Old SGI ttcp uses a select timeout of 18 msec here.
nttcp and netsend use more sophisticated methods that don't work unless HZ
is too small.  It's just impossible for a program to schedule its sleeps
with a fine enough resolution to ensure waking up before the queue runs
dry, unless HZ is too small or the queue is too large.  select() for
writing doesn't work for the queue part of socket i/o.

Results:
~5.2 sendto (1):  630 kpps   98% CPU  11   cm/p (cache misses/packet (min))
-cur sendto:      590 kpps  100% CPU  10   cm/p (July 8 -current)
            (2):  no significant difference - see below
~5.2 send   (3):  620 kpps   75% CPU   9.5 cm/p
-cur send:        520 kpps   60% CPU   8   cm/p
            (4):  no significant difference - see below

send() has lower CPU overheads as expected.  For some reason, send() gets
lower throughput than sendto().  I think the reason is just that the
queue runs dry due to the lower CPU overhead making it possible for
the userland sender to outrun the hardware -- userland sees more ENOBUFS
and sleeps more often, so it sometimes sleeps too long due to my out of
date hack for increasing the queue length.  For some reason, this affects
-current much more than ~5.2 (the bge drivers in each have lots of
modifications which are supposed to be equivalent here).  Probably the
same reason.  sendto() still 5-10% higher overhead in -current than in
~5.2 and runs out of CPU.  It runs out under ~5.2 testing ttcp too.

If you look at the design of the higher performance UDP applications, theywill generally bind a specific IP (perhaps every IP on the host with its ownsocket), and if they do sustained communication to a specific endpoint theywill use connect(2) rather than providing an address for each send(2) systemcall to the kernel.


I couldn't see any effect from binding.  I'm only testing sending, and it
doesn't seem to be possible to bind to anything except local addresses
(0.0.0.0, the NIC's address and 127.0.0.1) but these seem to be equivalent
(with no extra work for translation on every packet?) and seem to be used
by default anyway.  In the above, sin.sin_addr has to be set to the
receiver's ip from the command line (else it defaults to a local address),
and the above temporarily sets it back to 0.0.0.0 so as to use the same
sin for the local bind()).

udp_output(2) makes the trade-offs there fairly clear: with the most recentrev, the optimal case is one connect(2) has been called, allowing a singleinpcb read lock and no global data structure access, vs. an applicationcalling sendto(2) for each system call and the local binding remainingINADDR_ANY. Middle ground applications, such as named(8) will force a localbinding using bind(2), but then still have to pass an address to eachsendto(2). In the future, this case will be further optimized in our code byusing a global read lock rather than a global write lock: we have to checkfor collisions, but we don't actually have to reserve the new 4-tuple for theUDP socket as it's an ephemeral association rather than a connect(2).


The July 8 -current should have this rev.  Note that I'm not testing
SMP or stessing locking, or nontrivial routine tables, or forwarding,
and don't plan to.  UP with a direct connection is hard enough and
short of CPU enough to understand and make efficient.  Locking barely
shows up in older tests, only partly because it is mostly inline.

Bruce
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]

Reply via email to