On Mon, 7 Jul 2008, Robert Watson wrote:
On Mon, 7 Jul 2008, Bruce Evans wrote:
(1) sendto() to a specific address and port on a socket that has been
bound to
INADDR_ANY and a specific port.
(2) sendto() on a specific address and port on a socket that has been
bound to
a specific IP address (not INADDR_ANY) and a specific port.
(3) send() on a socket that has been connect()'d to a specific IP address
and
a specific port, and bound to INADDR_ANY and a specific port.
(4) send() on a socket that has been connect()'d to a specific IP address
and a specific port, and bound to a specific IP address (not
INADDR_ANY)
and a specific port.
The last of these should really be quite a bit faster than the first of
these, but I'd be interested in seeing specific measurements for each if
that's possible!
Not sure if I understand networking well enough to set these up quickly.
Does netrate use one of (3) or (4) now?
(3) and (4) are effectively the same thing, I think, since connect(2) should
force the selection of a source IP address, but I think it's not a bad idea
to confirm that. :-)
The structure of the desired micro-benchmark here is basically:
...
I hacked netblast.c to do this:
% --- /usr/src/tools/tools/netrate/netblast/netblast.c Fri Dec 16 17:02:44 2005
% +++ netblast.c Mon Jul 14 21:26:52 2008
% @@ -44,9 +44,11 @@
% {
%
% - fprintf(stderr, "netblast [ip] [port] [payloadsize] [duration]\n");
% - exit(-1);
% + fprintf(stderr, "netblast ip port payloadsize duration bind connect\n");
% + exit(1);
% }
%
% +static int gconnected;
% static int global_stop_flag;
% +static struct sockaddr_in *gsin;
%
% static void
% @@ -116,6 +118,13 @@
% counter++;
% }
% - if (send(s, packet, packet_len, 0) < 0)
% + if (gconnected && send(s, packet, packet_len, 0) < 0) {
% send_errors++;
% + usleep(1000);
% + }
% + if (!gconnected && sendto(s, packet, packet_len, 0,
% + (struct sockaddr *)gsin, sizeof(*gsin)) < 0) {
% + send_errors++;
% + usleep(1000);
% + }
% send_calls++;
% }
% @@ -146,9 +155,10 @@
% struct sockaddr_in sin;
% char *dummy, *packet;
% - int s;
% + int bind_desired, connect_desired, s;
%
% - if (argc != 5)
% + if (argc != 7)
% usage();
%
% + gsin = &sin;
% bzero(&sin, sizeof(sin));
% sin.sin_len = sizeof(sin);
% @@ -176,4 +186,7 @@
% usage();
%
% + bind_desired = (strcmp(argv[5], "b") == 0);
% + connect_desired = (strcmp(argv[6], "c") == 0);
% +
% packet = malloc(payloadsize);
% if (packet == NULL) {
% @@ -189,7 +202,19 @@
% }
%
% - if (connect(s, (struct sockaddr *)&sin, sizeof(sin)) < 0) {
% - perror("connect");
% - return (-1);
% + if (bind_desired) {
% + struct sockaddr_in osin;
% +
% + osin = sin;
% + if (inet_aton("0", &sin.sin_addr) == 0)
% + perror("inet_aton(0)");
% + if (bind(s, (struct sockaddr *)&sin, sizeof(sin)) < 0)
% + err(-1, "bind");
% + sin = osin;
% + }
% +
% + if (connect_desired) {
% + if (connect(s, (struct sockaddr *)&sin, sizeof(sin)) < 0)
% + err(-1, "connect");
% + gconnected = 1;
% }
%
This also fixes some bugs in usage() (bogus [] around non-optional args and
bogus exit code) and adds a sleep after send failure. Without the sleep,
netblast distorts the measurements by taking 100% CPU. This depends on
kernel queues having enough buffering to not run dry during the sleep
time (rounded up to a tick boundary). I use ifq_maxlen =
DRIVER_TX_RING_CNT + imax(2 * tick / 4, 10000) = 10512 for DRIVER = bge
and HZ = 100. This is actually wrong now. The magic 2 is to round up to
a tick boundary and the magic 4 is for bge taking a minimum of 4 usec per
packet on old hadware, but bge actually takes about 1.5 usec on the test
hardware and I'd like it to take 0.66 usec. The queues rarely run dry in
practice, but running dry just a few times for a few msec each would
explain some anomalies. Old SGI ttcp uses a select timeout of 18 msec here.
nttcp and netsend use more sophisticated methods that don't work unless HZ
is too small. It's just impossible for a program to schedule its sleeps
with a fine enough resolution to ensure waking up before the queue runs
dry, unless HZ is too small or the queue is too large. select() for
writing doesn't work for the queue part of socket i/o.
Results:
~5.2 sendto (1): 630 kpps 98% CPU 11 cm/p (cache misses/packet (min))
-cur sendto: 590 kpps 100% CPU 10 cm/p (July 8 -current)
(2): no significant difference - see below
~5.2 send (3): 620 kpps 75% CPU 9.5 cm/p
-cur send: 520 kpps 60% CPU 8 cm/p
(4): no significant difference - see below
send() has lower CPU overheads as expected. For some reason, send() gets
lower throughput than sendto(). I think the reason is just that the
queue runs dry due to the lower CPU overhead making it possible for
the userland sender to outrun the hardware -- userland sees more ENOBUFS
and sleeps more often, so it sometimes sleeps too long due to my out of
date hack for increasing the queue length. For some reason, this affects
-current much more than ~5.2 (the bge drivers in each have lots of
modifications which are supposed to be equivalent here). Probably the
same reason. sendto() still 5-10% higher overhead in -current than in
~5.2 and runs out of CPU. It runs out under ~5.2 testing ttcp too.
If you look at the design of the higher performance UDP applications, they
will generally bind a specific IP (perhaps every IP on the host with its own
socket), and if they do sustained communication to a specific endpoint they
will use connect(2) rather than providing an address for each send(2) system
call to the kernel.
I couldn't see any effect from binding. I'm only testing sending, and it
doesn't seem to be possible to bind to anything except local addresses
(0.0.0.0, the NIC's address and 127.0.0.1) but these seem to be equivalent
(with no extra work for translation on every packet?) and seem to be used
by default anyway. In the above, sin.sin_addr has to be set to the
receiver's ip from the command line (else it defaults to a local address),
and the above temporarily sets it back to 0.0.0.0 so as to use the same
sin for the local bind()).
udp_output(2) makes the trade-offs there fairly clear: with the most recent
rev, the optimal case is one connect(2) has been called, allowing a single
inpcb read lock and no global data structure access, vs. an application
calling sendto(2) for each system call and the local binding remaining
INADDR_ANY. Middle ground applications, such as named(8) will force a local
binding using bind(2), but then still have to pass an address to each
sendto(2). In the future, this case will be further optimized in our code by
using a global read lock rather than a global write lock: we have to check
for collisions, but we don't actually have to reserve the new 4-tuple for the
UDP socket as it's an ephemeral association rather than a connect(2).
The July 8 -current should have this rev. Note that I'm not testing
SMP or stessing locking, or nontrivial routine tables, or forwarding,
and don't plan to. UP with a direct connection is hard enough and
short of CPU enough to understand and make efficient. Locking barely
shows up in older tests, only partly because it is mostly inline.
Bruce
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"