On Mon, 17 Dec 2018, Andrew Gallatin wrote:
On 12/17/18 2:08 PM, Bruce Evans wrote:
On Mon, 17 Dec 2018, Andrew Gallatin wrote:
On 12/5/18 9:20 AM, Slava Shwartsman wrote:
Author: slavash
Date: Wed Dec?? 5 14:20:57 2018
New Revision: 341578
URL:
https://urldefense.proofpoint.com/v2/url?u=https-3A__svnweb.freebsd.org_changeset_base_341578&d=DwIDaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=Ed-falealxPeqc22ehgAUCLh8zlZbibZLSMWJeZro4A&m=BFp2c_-S0jnzRZJF2APwvTwmnmVFcyjcnBvHRZ3Locc&s=b7fvhOzf_b5bMVGquu4SaBhMNql5N8dVPAvpfKtz53Q&e=
Log:
???? mlx5en: Remove the DRBR and associated logic in the transmit path.
?????????? The hardware queues are deep enough currently and using the
DRBR and associated
???? callbacks only leads to more task switching in the TX path. The is
also a race
???? setting the queue_state which can lead to hung TX rings.
The point of DRBR in the tx path is not simply to provide a software ring
for queuing excess packets.?? Rather it provides a mechanism to
avoid lock contention by shoving a packet into the software ring, where
it will later be found & processed, rather than blocking the caller on
a mtx lock.???? I'm concerned you may have introduced a performance
regression for use cases where you have N:1?? or N:M lock contention where
many threads on different cores are contending for the same tx queue.??
The state of the art for this is no longer DRBR, but mp_ring,
as used by both cxgbe and iflib.
iflib uses queuing techniques to significantly pessimize em NICs with 1
hardware queue.?? On fast machines, it attempts to do 1 context switch per
Bruce Evans didn't write that. Some mail program converted 2-space sentence
breaks to \xc2\xa0.
This can happen even w/o contention when "abdicate" is enabled in mp
ring. I complained about this as well, and the default was changed in
mp ring to not always "abdicate" (eg, switch to the tq to handle the
packet). Abdication substantially pessimizes Netflix style web uncontended
workloads, but it generally helps small packet forwarding.
It is interesting that you see the opposite. I should try benchmarking
with just a single ring.
Hmm, I didn't remember "abdicated" and never knew about the sysctl for it
(the sysctl is newer), but I notices the slowdown from near the first
commit for it (r323954) and already used the folowing workaround for it:
XX Index: iflib.c
XX ===================================================================
XX --- iflib.c (revision 332488)
XX +++ iflib.c (working copy)
XX @@ -1,3 +1,5 @@
XX +int bde_oldnet = 1;
XX +
XX /*-
XX * Copyright (c) 2014-2018, Matthew Macy <mm...@mattmacy.io>
XX * All rights reserved.
XX @@ -3650,9 +3652,17 @@
XX IFDI_TX_QUEUE_INTR_ENABLE(ctx, txq->ift_id);
XX return;
XX }
XX +if (bde_oldnet) {
XX if (txq->ift_db_pending)
XX ifmp_ring_enqueue(txq->ift_br, (void **)&txq, 1, TX_BATCH_SIZE);
XX + else
XX + ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE);
XX +} else {
XX + if (txq->ift_db_pending)
XX + ifmp_ring_enqueue(txq->ift_br, (void **)&txq, 1, TX_BATCH_SIZE);
XX ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE);
XX +}
XX + ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE);
XX if (ctx->ifc_flags & IFC_LEGACY)
XX IFDI_INTR_ENABLE(ctx);
XX else {
XX @@ -3862,8 +3872,11 @@
XX DBG_COUNTER_INC(tx_seen);
XX err = ifmp_ring_enqueue(txq->ift_br, (void **)&m, 1, TX_BATCH_SIZE);
XX
XX +if (!bde_oldnet)
XX GROUPTASK_ENQUEUE(&txq->ift_task);
XX if (err) {
XX +if (bde_oldnet)
XX + GROUPTASK_ENQUEUE(&txq->ift_task);
XX /* support forthcoming later */
XX #ifdef DRIVER_BACKPRESSURE
XX txq->ift_closed = TRUE;
XX @@ -3870,6 +3883,9 @@
XX #endif
XX ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE);
XX m_freem(m);
XX + } else if (TXQ_AVAIL(txq) < (txq->ift_size >> 1)) {
XX +if (bde_oldnet)
XX + GROUPTASK_ENQUEUE(&txq->ift_task);
XX }
XX
XX return (err);
XX Index: mp_ring.c
XX ===================================================================
XX --- mp_ring.c (revision 332488)
XX +++ mp_ring.c (working copy)
XX @@ -1,3 +1,11 @@
XX +#include "opt_pci.h"
XX +
XX +#ifdef DEV_PCI
XX +extern int bde_oldnet;
XX +#else
XX +#define bde_oldnet 0
XX +#endif
XX +
XX /*-
XX * Copyright (c) 2014 Chelsio Communications, Inc.
XX * All rights reserved.
XX @@ -454,12 +462,25 @@
XX do {
XX os.state = ns.state = r->state;
XX ns.pidx_tail = pidx_stop;
XX +if (bde_oldnet)
XX + ns.flags = BUSY;
XX +else {
XX if (os.flags == IDLE)
XX ns.flags = ABDICATED;
XX +}
XX } while (atomic_cmpset_rel_64(&r->state, os.state, ns.state) == 0);
XX critical_exit();
XX counter_u64_add(r->enqueues, n);
XX
XX +if (bde_oldnet) {
XX + /*
XX + * Turn into a consumer if some other thread isn't active as a consumer
XX + * already.
XX + */
XX + if (os.flags != BUSY)
XX + drain_ring_lockless(r, ns, os.flags, budget);
XX +}
XX +
XX return (0);
XX }
XX #endif
XX @@ -470,10 +491,15 @@
XX union ring_state os, ns;
XX
XX os.state = r->state;
XX +if (bde_oldnet) {
XX + if (os.flags != STALLED || os.pidx_head != os.pidx_tail ||
r->can_drain(r) == 0)
XX + return;
XX +} else {
XX if ((os.flags != STALLED && os.flags != ABDICATED) || // Only
continue in STALLED and ABDICATED
XX os.pidx_head != os.pidx_tail || // Require work
to be available
XX (os.flags != ABDICATED && r->can_drain(r) == 0)) // Can either
drain, or everyone left
XX return;
XX +}
XX
XX MPASS(os.cidx != os.pidx_tail); /* implied by STALLED */
XX ns.state = os.state;
This essentialy just adds back the previous code with a flag to check
both versions. Hopefully the sysctl can do the same thing.
The log message for r323954 claims an improvement of 30% for forwarding
small packets, but I saw only lots more context switches, and more CPU
use to do these, but only a small unimprovement for send(2)ing packets.
My hack reduces the context switches but doesn't change the rate much.
I don't understand forwarding but think it is usually entirely in the
kernel and doesn't do as much blasting as netblast (just bursts of
received packets with rate and burst size limiting by interrupt
moderation, and with no syscall slowness between the packets?). This
is a very different load.
Ping (-fq) latency increased a while ago, but I forget when, and my hack
doesn't affect it much. The increase was much larger for an I218V than for a
PRO1000(fairly low end). This might be due to timing problems increased
by non-synchronous transmitting.
...
I think part of the problem with iflib is just its sheer size, but
some of that is hard to avoid due to it being an intermediate shim
layer & having to translate packets.
Hmm, the only thing I like about it is that it seems to be less sheer
than a vendor driver :-) (1.5MB for e1000 and only 175K for iflib.c).
One thing I think could help is to do the conversion from mbufs
to iflib's packet info data structure at entry into iflib, and not
at xmit time. This the stuff that parses
the mbufs, does the virt to phys, and basically will take cache
misses if it runs on a different CPU, but seems less likely to
take cache misses if run just after ether_output() (which has
likely already taken misses). I've been trying to find the
time to make this change for a while. It would be interesting
to see if it helps your workload too.
I forgot about cache misses. Separate threads must make them worse,
especially for user threads and with spectres breaking caching. I
think L2+ caches are shared across CPUs for x86, but some spectre
cases flush the TLB which is not quite as bad as flushing caches.
In my old work on bge, -current (6-8 at the time) seemed to be slower
than my version in ~5.2 with more than half of the difference due to
1 or 2 more cache misses in the send[to]() path. I never located these
exactly, and got nowhere with prefetch instructions on old Athlons.
send() takes about 1600 nsec on an original Athlon64 ~2GHz, and though
the main memory latency was much lower than now (42.7 nsec according
to lmbench), adding a couple of cache misses to 1600 nsec means that
the CPU saturates before the NIC+PCI33.
CPU affinity doesn't have much affect on the results of my benchmarks.
I think threads should never be bound to CPUs like iflib does, unless
the CPUs can be dedicated to a single thread. Schedulers usually do a
good job of picking the best CPU to run on (I fixed my version of 4BSD
to do this slightly better than ULE for HTT pairs). Binding is home
made scheduling which cannot adapt to changing loads, and it prevents
the scheduler from picking the best CPU in some cases.
One case is if previous activity resulted in all CPUs being non-idle.
Then if CPU0 and CPU1 but no others become idle, threads on the others
shouldn't switched to CPU0 or CPU1 even when these are a HTT pair so
switching the threads now on CPU[2-3] would run faster by migrating one
of them to CPU0 -- it just takes too long to migrate. Then if a bound
thread needs to run on CPU6, it either has to wait or the thread on
CPU6 must be migrated to CPU0 or CPU1. The thread should be unbound so
that it can run immediately on any available CPU (CPU0 or CPU1 here).
Bruce
_______________________________________________
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"