from:"Andrew Gallatin"

git: 1f628be888b7 - main - tcp_ratelimit: provide an api for drivers to release ratesets at detach

2024-08-05 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=1f628be888b74f1219b3ea7ccea1e7a3d1db77a2

commit 1f628be888b74f1219b3ea7ccea1e7a3d1db77a2
Author: Andrew Gallatin 
AuthorDate: 2024-08-05 15:45:42 +
Commit: Andrew Gallatin 
CommitDate: 2024-08-05 16:51:35 +

tcp_ratelimit: provide an api for drivers to release ratesets at detach

When the kernel is compiled with options RATELIMIT, the
mlx5en driver cannot detach. It gets stuck waiting for all
kernel users of its rates to drop to zero before finally calling
ether_ifdetach.

The tcp ratelimit code has an eventhandler for ifnet departure
which causes rates to be released. However, this is called as an
ifnet departure eventhandler, which is invoked as part of
ifdetach(), via either_ifdetach(). This means that the tcp
ratelimit code holds down many hw rates when the mlx5en driver
is waiting for the rate count to go to 0. Thus devctl detach
will deadlock on mlx5 with this stack:
mi_switch+0xcf sleepq_timedwait+0x2f _sleep+0x1a3 pause_sbt+0x77 
mlx5e_destroy_ifp+0xaf mlx5_remove_device+0xa7 mlx5_unregister_device+0x78 
mlx5_unload_one+0x10a remove_one+0x1e linux_pci_detach_device+0x36 
linux_pci_detach+0x24 device_detach+0x180 devctl2_ioctl+0x3dc devfs_ioctl+0xbb 
vn_ioctl+0xca devfs_ioctl_f+0x1e kern_ioctl+0x1c3 sys_ioctl+0x10a

To fix this, provide an explicit API for a driver to call the tcp
ratelimit code telling it to detach itself from an ifnet. This
allows the mlx5 driver to unload cleanly. I considered adding an
ifnet pre-departure eventhandler. However, that would need to be
invoked by the driver, so a simple function call seemed better.

The mlx5en driver has been updated to call this function.

Reviewed by: kib, rrs

Differential Revision:  https://reviews.freebsd.org/D46221
Sponsored by: Netflix
---
 sys/dev/mlx5/mlx5_en/mlx5_en_main.c | 8 +++-
 sys/netinet/tcp_ratelimit.c | 6 ++
 sys/netinet/tcp_ratelimit.h | 9 +
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/sys/dev/mlx5/mlx5_en/mlx5_en_main.c 
b/sys/dev/mlx5/mlx5_en/mlx5_en_main.c
index ccbdf11a1dd5..a80235f0f347 100644
--- a/sys/dev/mlx5/mlx5_en/mlx5_en_main.c
+++ b/sys/dev/mlx5/mlx5_en/mlx5_en_main.c
@@ -36,6 +36,7 @@
 #include 
 
 #include 
+#include 
 #include 
 #include 
 
@@ -4876,7 +4877,12 @@ mlx5e_destroy_ifp(struct mlx5_core_dev *mdev, void 
*vpriv)
 
 #ifdef RATELIMIT
/*
-* The kernel can have reference(s) via the m_snd_tag's into
+* Tell the TCP ratelimit code to release the rate-sets attached
+* to our ifnet.
+*/
+   tcp_rl_release_ifnet(ifp);
+   /*
+* The kernel can still have reference(s) via the m_snd_tag's into
 * the ratelimit channels, and these must go away before
 * detaching:
 */
diff --git a/sys/netinet/tcp_ratelimit.c b/sys/netinet/tcp_ratelimit.c
index 1834c702c493..22bdf707fa89 100644
--- a/sys/netinet/tcp_ratelimit.c
+++ b/sys/netinet/tcp_ratelimit.c
@@ -1298,6 +1298,12 @@ tcp_rl_ifnet_departure(void *arg __unused, struct ifnet 
*ifp)
NET_EPOCH_EXIT(et);
 }
 
+void
+tcp_rl_release_ifnet(struct ifnet *ifp)
+{
+   tcp_rl_ifnet_departure(NULL, ifp);
+}
+
 static void
 tcp_rl_shutdown(void *arg __unused, int howto __unused)
 {
diff --git a/sys/netinet/tcp_ratelimit.h b/sys/netinet/tcp_ratelimit.h
index cd540d1164e1..0ce42dea0d90 100644
--- a/sys/netinet/tcp_ratelimit.h
+++ b/sys/netinet/tcp_ratelimit.h
@@ -94,6 +94,8 @@ CK_LIST_HEAD(head_tcp_rate_set, tcp_rate_set);
 #ifndef ETHERNET_SEGMENT_SIZE
 #define ETHERNET_SEGMENT_SIZE 1514
 #endif
+struct tcpcb;
+
 #ifdef RATELIMIT
 #define DETAILED_RATELIMIT_SYSCTL 1/*
 * Undefine this if you don't want
@@ -131,6 +133,9 @@ tcp_get_pacing_burst_size_w_divisor(struct tcpcb *tp, 
uint64_t bw, uint32_t segs
 void
 tcp_rl_log_enobuf(const struct tcp_hwrate_limit_table *rte);
 
+void
+tcp_rl_release_ifnet(struct ifnet *ifp);
+
 #else
 static inline const struct tcp_hwrate_limit_table *
 tcp_set_pacing_rate(struct tcpcb *tp, struct ifnet *ifp,
@@ -218,6 +223,10 @@ tcp_rl_log_enobuf(const struct tcp_hwrate_limit_table *rte)
 {
 }
 
+static inline void
+tcp_rl_release_ifnet(struct ifnet *ifp)
+{
+}
 #endif
 
 /*

git: 13a5a46c49d0 - main - Fix new users of MAXPHYS and hide it from the kernel namespace

2024-04-30 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=13a5a46c49d0ec3e10e5476ad763947f165052e2

commit 13a5a46c49d0ec3e10e5476ad763947f165052e2
Author: Andrew Gallatin 
AuthorDate: 2024-04-29 23:11:56 +
Commit: Andrew Gallatin 
CommitDate: 2024-04-30 19:29:06 +

Fix new users of MAXPHYS and hide it from the kernel namespace

In cd8537910406, kib made maxphys a load-time tunable.  This made
the #define MAXPHYS in sys/param.h  almost entirely obsolete, as
it could now be overridden by kern.maxphys at boot time, or by
opt_maxphys.h.

However, decades of tradition have led to several new, incorrect, uses
of MAXPHYS in other parts of the kernel, mostly by seasoned
developers.  I've corrected those uses here in a mechanical fashion,
and verified that it fixes a bug in the md driver that I was
experiencing.

Since using MAXPHYS is such an easy mistake to make, it is best to
hide it from the kernel namespace.  So I've moved its definition to
_maxphys.h, which is now included in param.h only for userspace.

That brings up the fact that lots of userspace programs use MAXPHYS
for different reasons, most of them probably wrong.  Userspace consumers
that really need to know the value of maxphys should probably be
changed to use the kern.maxphys sysctl.  But that's outside the scope
of this change.

Reviewed by: imp, jkim, kib, markj
Fixes: 30038a8b4efc ("md: Get rid of the pbuf zone")
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D44986
---
 sys/compat/linux/linux_socket.c |  2 +-
 sys/dev/md/md.c |  6 +++---
 sys/dev/rtsx/rtsx.c |  4 ++--
 sys/kern/subr_param.c   |  1 +
 sys/sys/_maxphys.h  | 10 ++
 sys/sys/param.h |  8 +---
 6 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/sys/compat/linux/linux_socket.c b/sys/compat/linux/linux_socket.c
index 36cffc979802..15431bf3127c 100644
--- a/sys/compat/linux/linux_socket.c
+++ b/sys/compat/linux/linux_socket.c
@@ -2468,7 +2468,7 @@ sendfile_fallback(struct thread *td, struct file *fp, 
l_int out,
out_offset = 0;
 
flags = FOF_OFFSET | FOF_NOUPDATE;
-   bufsz = min(count, MAXPHYS);
+   bufsz = min(count, maxphys);
buf = malloc(bufsz, M_LINUX, M_WAITOK);
bytes_sent = 0;
while (bytes_sent < count) {
diff --git a/sys/dev/md/md.c b/sys/dev/md/md.c
index 27e63363767c..241517898ad4 100644
--- a/sys/dev/md/md.c
+++ b/sys/dev/md/md.c
@@ -965,7 +965,7 @@ unmapped_step:
PAGE_MASK;
iolen = min(ptoa(npages) - (ma_offs & PAGE_MASK), len);
KASSERT(iolen > 0, ("zero iolen"));
-   KASSERT(npages <= atop(MAXPHYS + PAGE_SIZE),
+   KASSERT(npages <= atop(maxphys + PAGE_SIZE),
("npages %d too large", npages));
pmap_qenter(sc->kva, &bp->bio_ma[atop(ma_offs)], npages);
aiov.iov_base = (void *)(sc->kva + (ma_offs & PAGE_MASK));
@@ -1487,7 +1487,7 @@ mdcreate_vnode(struct md_s *sc, struct md_req *mdr, 
struct thread *td)
goto bad;
}
 
-   sc->kva = kva_alloc(MAXPHYS + PAGE_SIZE);
+   sc->kva = kva_alloc(maxphys + PAGE_SIZE);
return (0);
 bad:
VOP_UNLOCK(nd.ni_vp);
@@ -1547,7 +1547,7 @@ mddestroy(struct md_s *sc, struct thread *td)
if (sc->uma)
uma_zdestroy(sc->uma);
if (sc->kva)
-   kva_free(sc->kva, MAXPHYS + PAGE_SIZE);
+   kva_free(sc->kva, maxphys + PAGE_SIZE);
 
LIST_REMOVE(sc, list);
free_unr(md_uh, sc->unit);
diff --git a/sys/dev/rtsx/rtsx.c b/sys/dev/rtsx/rtsx.c
index 464a155e66c2..a2f124f6c30d 100644
--- a/sys/dev/rtsx/rtsx.c
+++ b/sys/dev/rtsx/rtsx.c
@@ -311,7 +311,7 @@ static int  rtsx_resume(device_t dev);
 #defineRTSX_DMA_ALIGN  4
 #defineRTSX_HOSTCMD_MAX256
 #defineRTSX_DMA_CMD_BIFSIZE(sizeof(uint32_t) * RTSX_HOSTCMD_MAX)
-#defineRTSX_DMA_DATA_BUFSIZE   MAXPHYS
+#defineRTSX_DMA_DATA_BUFSIZE   maxphys
 
 #defineISSET(t, f) ((t) & (f))
 
@@ -2762,7 +2762,7 @@ rtsx_xfer(struct rtsx_softc *sc, struct mmc_command *cmd)
  (unsigned long)cmd->data->len, (unsigned 
long)cmd->data->xfer_len);
 
if (cmd->data->len > RTSX_DMA_DATA_BUFSIZE) {
-   device_printf(sc->rtsx_dev, "rtsx_xfer() length too large: %ld 
> %d\n",
+   device_printf(sc->rtsx_dev, "rtsx_xfer() length too large: %ld 
> %ld\n",
  (unsigned long)cmd->data->len, 
RTSX_DMA_DATA_BUFSIZE);
cmd->error = MMC_ERR_INVALID;

git: 530c2c30b0c7 - main - ip6_output: Reduce cache misses on pktopts

2024-03-20 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=530c2c30b0c75f1a71df637ae1e09b352f8256cb

commit 530c2c30b0c75f1a71df637ae1e09b352f8256cb
Author: Andrew Gallatin 
AuthorDate: 2024-03-20 19:46:01 +
Commit: Andrew Gallatin 
CommitDate: 2024-03-20 19:50:57 +

ip6_output: Reduce cache misses on pktopts

When profiling an IP6 heavy workload, I noticed that we were
getting a lot of cache misses in ip6_output() around
ip6_pktopts. This was happening because the TCP stack passes
inp->in6p_outputopts even if all options are unused. So in the
common case of no options present, pkt_opts is not null, and is
checked repeatedly for different options. Since ip6_pktopts is
large (4 cachelines), and every field is checked, we take 4
cache misses (2 of which tend to be hidden by the adjacent line
prefetcher).

To fix this common case, I introduced a new flag in ip6_pktopts
(ip6po_valid) which tracks which options have been set. In the
common case where nothing is set, this causes just a single
cache miss to load. It also eliminates a test for some options
(if (opt != NULL && opt->val >= const) vs if ((optvalid & flag) !=0 )

To keep the struct the same size in 64-bit kernels, and to keep
the integer values (like ip6po_hlim, ip6po_tclass, etc) on the
same cacheline, I moved them to the top.

As suggested by zlei, the null check in MAKE_EXTHDR() becomes
redundant, and can be removed.

For our web server workload (with the ip6po_tclass option set),
this drops the CPI from 2.9 to 2.4 for ip6_output

Differential Revision: https://reviews.freebsd.org/D44204
Reviewed by: bz, glebius, zlei
No Objection from: melifaro
Sponsored by: Netflix Inc.
---
 sys/netinet6/ip6_output.c | 67 ---
 sys/netinet6/ip6_var.h| 56 +--
 2 files changed, 83 insertions(+), 40 deletions(-)

diff --git a/sys/netinet6/ip6_output.c b/sys/netinet6/ip6_output.c
index a2c3efad749b..530f86c36689 100644
--- a/sys/netinet6/ip6_output.c
+++ b/sys/netinet6/ip6_output.c
@@ -159,14 +159,12 @@ static int copypktopts(struct ip6_pktopts *, struct 
ip6_pktopts *, int);
  */
 #defineMAKE_EXTHDR(hp, mp, _ol)
\
 do {   \
-   if (hp) {   \
-   struct ip6_ext *eh = (struct ip6_ext *)(hp);\
-   error = ip6_copyexthdr((mp), (caddr_t)(hp), \
-   ((eh)->ip6e_len + 1) << 3); \
-   if (error)  \
-   goto freehdrs;  \
-   (_ol) += (*(mp))->m_len;\
-   }   \
+   struct ip6_ext *eh = (struct ip6_ext *)(hp);\
+   error = ip6_copyexthdr((mp), (caddr_t)(hp), \
+   ((eh)->ip6e_len + 1) << 3); \
+   if (error)  \
+   goto freehdrs;  \
+   (_ol) += (*(mp))->m_len;\
 } while (/*CONSTCOND*/ 0)
 
 /*
@@ -431,6 +429,7 @@ ip6_output(struct mbuf *m0, struct ip6_pktopts *opt,
uint32_t fibnum;
struct m_tag *fwd_tag = NULL;
uint32_t id;
+   uint32_t optvalid;
 
NET_EPOCH_ASSERT();
 
@@ -491,14 +490,17 @@ ip6_output(struct mbuf *m0, struct ip6_pktopts *opt,
 * Keep the length of the unfragmentable part for fragmentation.
 */
bzero(&exthdrs, sizeof(exthdrs));
-   optlen = 0;
+   optlen = optvalid = 0;
unfragpartlen = sizeof(struct ip6_hdr);
if (opt) {
+   optvalid = opt->ip6po_valid;
+
/* Hop-by-Hop options header. */
-   MAKE_EXTHDR(opt->ip6po_hbh, &exthdrs.ip6e_hbh, optlen);
+   if ((optvalid & IP6PO_VALID_HBH) != 0)
+   MAKE_EXTHDR(opt->ip6po_hbh, &exthdrs.ip6e_hbh, optlen);
 
/* Destination options header (1st part). */
-   if (opt->ip6po_rthdr) {
+   if ((optvalid & IP6PO_VALID_RHINFO) != 0) {
 #ifndef RTHDR_SUPPORT_IMPLEMENTED
/*
 * If there is a routing header, discard the packet
@@ -524,11 +526,13 @@ ip6_output(struct mbuf *m0, struct ip6_pktopts *opt,
 * options, which might automatically be inserted in
 * the kernel.
 */
-   MAKE_EXTHDR(opt->ip6po_dest1, &

git: b50abe6bd45d - main - namei: Treat non-tied KLDs as if they had INVARIANTS enabled

2022-03-18 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=b50abe6bd45dde2baac130d4c4da097598c3b9c0

commit b50abe6bd45dde2baac130d4c4da097598c3b9c0
Author: Andrew Gallatin 
AuthorDate: 2022-03-18 14:14:14 +
Commit: Andrew Gallatin 
CommitDate: 2022-03-18 14:14:14 +

namei: Treat non-tied KLDs as if they had INVARIANTS enabled

When working with a vendor to debug their kernel module,
I found that a non-tied kld which uses NDINIT will panic
due to "namei: bad debugflags " on a kernel compiled with
INVARIANTS because non-tied KLDs do not pick up the
initialization that is done in NDINIT_DBG/NDREINIT_DBG().

Fix this by making this initialization happen for non-KLD_TIED
as well as INVARIANTS

Reviewed by: mjg
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34588
---
 sys/sys/namei.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/sys/sys/namei.h b/sys/sys/namei.h
index 23718dde5bed..98cbc2ca6ed9 100644
--- a/sys/sys/namei.h
+++ b/sys/sys/namei.h
@@ -228,8 +228,11 @@ intcache_fplookup(struct nameidata *ndp, enum 
cache_fpl_status *status,
 
 /*
  * Note the constant pattern may *hide* bugs.
+ * Note also that we enable debug checks for non-TIED KLDs
+ * so that they can run on an INVARIANTS kernel without tripping over
+ * assertions on ni_debugflags state.
  */
-#ifdef INVARIANTS
+#if defined(INVARIANTS) || (defined(KLD_MODULE) && !defined(KLD_TIED))
 #define NDINIT_PREFILL(arg)memset(arg, 0xff, offsetof(struct nameidata,
\
 ni_dvp_seqc))
 #define NDINIT_DBG(arg){ (arg)->ni_debugflags = 
NAMEI_DBG_INITED; }

git: a2fc8ade1057 - main - isci: use maxphys rather than 128KB to size s/g list

2021-01-07 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=a2fc8ade10577cd35a6000fdb6e7dd7c570852d6

commit a2fc8ade10577cd35a6000fdb6e7dd7c570852d6
Author: Andrew Gallatin 
AuthorDate: 2021-01-07 17:45:46 +
Commit: Andrew Gallatin 
CommitDate: 2021-01-07 17:45:46 +

isci: use maxphys rather than 128KB to size s/g list

In the conversion into a tunable, we converted the
size of the s/g list used by the driver to be based
off of a hardcoded size of 128k rather than maxphys,
this caused performance problems for us.  Revert this
to use the maxphys tunable.

Note that this constant is used to size dynamically allocated
things, and not static data structs, so this is safe.

Reviewed By:imp, kib, mav
Tested By:i dhw
Differential Revision: https://reviews.freebsd.org/D28023
Sponsored by: Netflix
---
 sys/dev/isci/scil/sci_controller_constants.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sys/dev/isci/scil/sci_controller_constants.h 
b/sys/dev/isci/scil/sci_controller_constants.h
index 40f6b983601d..47712c531986 100644
--- a/sys/dev/isci/scil/sci_controller_constants.h
+++ b/sys/dev/isci/scil/sci_controller_constants.h
@@ -157,7 +157,7 @@ extern "C" {
  * posted to hardware always contain pairs of elements (with second
  * element set to zeroes if not needed).
  */
-#define __MAXPHYS_ELEMENTS ((128 * 1024 / PAGE_SIZE) + 1)
+#define __MAXPHYS_ELEMENTS ((maxphys / PAGE_SIZE) + 1)
 #define SCI_MAX_SCATTER_GATHER_ELEMENTS  ((__MAXPHYS_ELEMENTS + 1) & ~0x1)
 #endif
 
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: 52cd25eb1aa7 - main - mbuf: enable ext_pgs ("unmapped") mbufs by default

2021-01-08 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=52cd25eb1aa75a28f6d3c3eb4757242c1f55d6cc

commit 52cd25eb1aa75a28f6d3c3eb4757242c1f55d6cc
Author: Andrew Gallatin 
AuthorDate: 2021-01-08 18:18:42 +
Commit: Andrew Gallatin 
CommitDate: 2021-01-08 18:43:30 +

mbuf: enable ext_pgs ("unmapped") mbufs by default

Ext_pg mbufs allow carrying multiple pages per mbuf. This
reduces mbuf linked list traversals, especially in socket
buffers, thereby reducing cache misses and CPU use for
applications using sendfile.  Note that ext_pages use
unmapped pages, eliminating KVA mapping costs on 32-bit
platforms.

Ext_pg mbufs are also required for ktls (KERN_TLS), and having
them disabled by default is a stumbling block for those
wishing to enable ktls.

Reviewed-by:jhb, glebius
Sponsored by:   Netfix
---
 sys/kern/kern_mbuf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sys/kern/kern_mbuf.c b/sys/kern/kern_mbuf.c
index 84e068424427..a46c576bad90 100644
--- a/sys/kern/kern_mbuf.c
+++ b/sys/kern/kern_mbuf.c
@@ -116,7 +116,7 @@ int nmbjumbop;  /* limits number of 
page size jumbo clusters */
 int nmbjumbo9; /* limits number of 9k jumbo clusters */
 int nmbjumbo16;/* limits number of 16k jumbo clusters 
*/
 
-bool mb_use_ext_pgs;   /* use M_EXTPG mbufs for sendfile & TLS */
+bool mb_use_ext_pgs = true;/* use M_EXTPG mbufs for sendfile & TLS */
 SYSCTL_BOOL(_kern_ipc, OID_AUTO, mb_use_ext_pgs, CTLFLAG_RWTUN,
 &mb_use_ext_pgs, 0,
 "Use unmapped mbufs for sendfile(2) and TLS offload");
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: 7eaea04a5bb1 - main - amd64: compare TLB shootdown target to all_cpus

2021-01-11 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=7eaea04a5bb1dc86c43ce046311e1c1a042994d3

commit 7eaea04a5bb1dc86c43ce046311e1c1a042994d3
Author: Andrew Gallatin 
AuthorDate: 2021-01-12 01:03:37 +
Commit: Andrew Gallatin 
CommitDate: 2021-01-12 01:09:32 +

amd64: compare TLB shootdown target to all_cpus

On amd64, the pmap code passes all_cpus to
smp_targeted_tlb_shootdown() when unmapping from the
kernel pmap.  This function has an optimized path to send IPIs
to all but itself, which it intends to do when the target
is all cpus.   However, we need to compare the target cpu mask
with all_cpus, rather than using CPU_ISFULLSET().  Comparing with
CPU_ISFULLSET() will only work when we have MAXCPU cpus active in
the system, otherwise, we'll be sending repeated IPIs, rather than
a single IPI to all CPUs but ourself.

Fixing this should reduce the time spent in native_lapic_ipi_wait()
as we will be sending ipis in parallel, rather than one-by-one.
This is confirmed by dtrace.

Reviewed by: alc, jhb, kib, markj
Sponsored by:   Netflix
Differential Revision:  https://reviews.freebsd.org/D28102
---
 sys/amd64/amd64/mp_machdep.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sys/amd64/amd64/mp_machdep.c b/sys/amd64/amd64/mp_machdep.c
index 63777014e151..794a11bf1276 100644
--- a/sys/amd64/amd64/mp_machdep.c
+++ b/sys/amd64/amd64/mp_machdep.c
@@ -673,7 +673,7 @@ smp_targeted_tlb_shootdown(cpuset_t mask, pmap_t pmap, 
vm_offset_t addr1,
/*
 * Check for other cpus.  Return if none.
 */
-   if (CPU_ISFULLSET(&mask)) {
+   if (!CPU_CMP(&mask, &all_cpus)) {
if (mp_ncpus <= 1)
goto local_cb;
} else {
@@ -719,7 +719,7 @@ smp_targeted_tlb_shootdown(cpuset_t mask, pmap_t pmap, 
vm_offset_t addr1,
 * (zeroing slot) and reading from it below (wait for
 * acknowledgment).
 */
-   if (CPU_ISFULLSET(&mask)) {
+   if (!CPU_CMP(&mask, &all_cpus)) {
ipi_all_but_self(IPI_INVLOP);
other_cpus = all_cpus;
CPU_CLR(PCPU_GET(cpuid), &other_cpus);
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

Re: git: 7eaea04a5bb1 - main - amd64: compare TLB shootdown target to all_cpus

2021-01-12 Thread Andrew Gallatin


On 1/12/21 12:59 AM, Mateusz Guzik wrote:

This makes my 2 core vm crash on boot:

Launching APs: 1
Timecounter "TSC-low" frequency 1346899854 Hz quality 1000
panic: IPI scoreboard is zero, initiator 1 target 1



Ugh, sorry for the breakage & thanks for the fix.  That's what I get
for not testing enough.

Drew

___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: efa9c21bca98 - main - KTLS: Enable KERN_TLS in GENERIC on amd64

2021-01-18 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=efa9c21bca9873af9c9660f5aeffda9d5ae1dfb7

commit efa9c21bca9873af9c9660f5aeffda9d5ae1dfb7
Author: Andrew Gallatin 
AuthorDate: 2021-01-14 17:44:06 +
Commit: Andrew Gallatin 
CommitDate: 2021-01-18 18:29:10 +

KTLS: Enable KERN_TLS in GENERIC on amd64

Based on discussions on freebsd-arch@, enable KERN_TLS in
GENERIC on amd64, but leave it disabled via the
sysctl kern.ipc.tls.enable.  Users wishing to enable
ktls must set kern.ipc.tls.enable=1

While here, fix wording in NOTES to mention that KERN_TLS
also does receive now.

Sponsored by:   Netflix

Reviewed by:allanjude
Differential Revision:  https://reviews.freebsd.org/D28163
---
 sys/amd64/conf/GENERIC | 1 +
 sys/conf/NOTES | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/sys/amd64/conf/GENERIC b/sys/amd64/conf/GENERIC
index c9ab23bb91b5..9f55a935f8a5 100644
--- a/sys/amd64/conf/GENERIC
+++ b/sys/amd64/conf/GENERIC
@@ -37,6 +37,7 @@ options   TCP_BLACKBOX# Enhanced TCP event 
logging
 optionsTCP_HHOOK   # hhook(9) framework for TCP
 optionsTCP_RFC7413 # TCP Fast Open
 optionsSCTP_SUPPORT# Allow kldload of SCTP
+optionsKERN_TLS# TLS transmit & receive offload
 optionsFFS # Berkeley Fast Filesystem
 optionsSOFTUPDATES # Enable FFS soft updates support
 optionsUFS_ACL # Support for access control lists
diff --git a/sys/conf/NOTES b/sys/conf/NOTES
index 1a8059a2e5c0..b4202bb65618 100644
--- a/sys/conf/NOTES
+++ b/sys/conf/NOTES
@@ -666,8 +666,8 @@ options IPSEC_SUPPORT
 #options   IPSEC_DEBUG #debug for IP security
 
 
-# TLS framing and encryption of data transmitted over TCP sockets.
-optionsKERN_TLS# TLS transmit offload
+# TLS framing and encryption/decryption of data over TCP sockets.
+optionsKERN_TLS# TLS transmit and receive 
offload
 
 #
 # SMB/CIFS requester
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: 0c864213ef1e - main - iflib: Fix a NULL pointer deref

2021-01-21 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=0c864213ef1ee440411e3bb6437ecc04273db86b

commit 0c864213ef1ee440411e3bb6437ecc04273db86b
Author: Andrew Gallatin 
AuthorDate: 2021-01-21 14:45:15 +
Commit: Andrew Gallatin 
CommitDate: 2021-01-21 14:47:06 +

iflib: Fix a NULL pointer deref

rxd_frag_to_sd() have pf_rv parameter as NULL with the current
code. This patch fixes the NULL pointer dereference in that
case thus avoiding a possible panic.

Submitted by: rajesh1.kumar at amd.com
Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D28115
---
 sys/net/iflib.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/sys/net/iflib.c b/sys/net/iflib.c
index 4b4952122d1e..ea2c5789a7b5 100644
--- a/sys/net/iflib.c
+++ b/sys/net/iflib.c
@@ -2654,7 +2654,8 @@ rxd_frag_to_sd(iflib_rxq_t rxq, if_rxd_frag_t irf, bool 
unload, if_rxsd_t sd,
}
} else {
fl->ifl_sds.ifsd_m[cidx] = NULL;
-   *pf_rv = PFIL_PASS;
+   if (pf_rv != NULL)
+   *pf_rv = PFIL_PASS;
}
 
if (unload && irf->irf_len != 0)
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: 3183d0b68072 - main - iflib: initialize LRO unconditionally

2021-04-23 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=3183d0b68072dda0e80bb6e03c970625f2823e97

commit 3183d0b68072dda0e80bb6e03c970625f2823e97
Author: Andrew Gallatin 
AuthorDate: 2021-04-23 09:51:22 +
Commit: Andrew Gallatin 
CommitDate: 2021-04-23 09:55:20 +

iflib: initialize LRO unconditionally

Changes to the LRO code have exposed a bug in iflib where devices
which are not capable of doing LRO are still calling
tcp_lro_flush_all(), even when they have not initialized the LRO
context. This used to be mostly harmless, but the LRO code now sets
the VNET based on the ifp in the lro context and will try to access it
through a NULL ifp resulting in a panic at boot.

To fix this, we unconditionally initializes LRO so that we have a
valid LRO context when calling tcp_lro_flush_all(). One alternative is
to check the device capabilities before calling tcp_lro_flush_all() or
adding a new state flag in the ctx. However, it seems unwise to add an
extra, mostly useless test for higher performance devices when we can
just initialize LRO for all devices.

Reviewed by: erj, hselasky, markj, olivier
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29928
---
 sys/net/iflib.c | 22 +-
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/sys/net/iflib.c b/sys/net/iflib.c
index 6dbaff556a15..fc0814d0fc19 100644
--- a/sys/net/iflib.c
+++ b/sys/net/iflib.c
@@ -5891,15 +5891,13 @@ iflib_rx_structures_setup(if_ctx_t ctx)
 
for (q = 0; q < ctx->ifc_softc_ctx.isc_nrxqsets; q++, rxq++) {
 #if defined(INET6) || defined(INET)
-   if (if_getcapabilities(ctx->ifc_ifp) & IFCAP_LRO) {
-   err = tcp_lro_init_args(&rxq->ifr_lc, ctx->ifc_ifp,
-   TCP_LRO_ENTRIES, min(1024,
-   ctx->ifc_softc_ctx.isc_nrxd[rxq->ifr_fl_offset]));
-   if (err != 0) {
-   device_printf(ctx->ifc_dev,
-   "LRO Initialization failed!\n");
-   goto fail;
-   }
+   err = tcp_lro_init_args(&rxq->ifr_lc, ctx->ifc_ifp,
+   TCP_LRO_ENTRIES, min(1024,
+   ctx->ifc_softc_ctx.isc_nrxd[rxq->ifr_fl_offset]));
+   if (err != 0) {
+   device_printf(ctx->ifc_dev,
+   "LRO Initialization failed!\n");
+   goto fail;
}
 #endif
IFDI_RXQ_SETUP(ctx, rxq->ifr_id);
@@ -5914,8 +5912,7 @@ fail:
 */
rxq = ctx->ifc_rxqs;
for (i = 0; i < q; ++i, rxq++) {
-   if (if_getcapabilities(ctx->ifc_ifp) & IFCAP_LRO)
-   tcp_lro_free(&rxq->ifr_lc);
+   tcp_lro_free(&rxq->ifr_lc);
}
return (err);
 #endif
@@ -5938,8 +5935,7 @@ iflib_rx_structures_free(if_ctx_t ctx)
iflib_dma_free(&rxq->ifr_ifdi[j]);
iflib_rx_sds_free(rxq);
 #if defined(INET6) || defined(INET)
-   if (if_getcapabilities(ctx->ifc_ifp) & IFCAP_LRO)
-   tcp_lro_free(&rxq->ifr_lc);
+   tcp_lro_free(&rxq->ifr_lc);
 #endif
}
free(ctx->ifc_rxqs, M_IFLIB);
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: 086a35562f47 - main - tcp: enter network epoch when calling tfb_tcp_fb_fini

2021-05-25 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=086a35562f47917a516d30acc8b78a4884e31a4f

commit 086a35562f47917a516d30acc8b78a4884e31a4f
Author: Andrew Gallatin 
AuthorDate: 2021-05-25 17:45:37 +
Commit: Andrew Gallatin 
CommitDate: 2021-05-25 17:45:37 +

tcp: enter network epoch when calling tfb_tcp_fb_fini

We need to enter the network epoch when calling into
tfb_tcp_fb_fini.  I noticed this when I hit an assert
running the latest rack

Differential Revision: https://reviews.freebsd.org/D30407
Reviewed by: rrs, tuexen
Sponsored by: Netflix
---
 sys/netinet/tcp_usrreq.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/sys/netinet/tcp_usrreq.c b/sys/netinet/tcp_usrreq.c
index 4f418f8809a7..caef798772ea 100644
--- a/sys/netinet/tcp_usrreq.c
+++ b/sys/netinet/tcp_usrreq.c
@@ -1818,11 +1818,14 @@ tcp_ctloutput(struct socket *so, struct sockopt *sopt)
 * new one already.
 */
if (tp->t_fb->tfb_tcp_fb_fini) {
+   struct epoch_tracker et;
/*
 * Tell the stack to cleanup with 0 i.e.
 * the tcb is not going away.
 */
+   NET_EPOCH_ENTER(et);
(*tp->t_fb->tfb_tcp_fb_fini)(tp, 0);
+   NET_EPOCH_EXIT(et);
}
 #ifdef TCPHPTS
/* Assure that we are not on any hpts */
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: df8437a93dd5 - main - cxgbe: fix enabling lro & rxtimestamps

2021-05-26 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=df8437a93dd5268e5bfd06411c01a5cbdb38c6ac

commit df8437a93dd5268e5bfd06411c01a5cbdb38c6ac
Author: Andrew Gallatin 
AuthorDate: 2021-05-26 13:54:26 +
Commit: Andrew Gallatin 
CommitDate: 2021-05-26 14:00:07 +

cxgbe: fix enabling lro & rxtimestamps

A recent change caused iq flags, like LRO, to be set before
init_iq(). However, init_iq() clears those flags, so they
became effectively impossible to set.   This change moves
the initializion of these flags to after the call to init_iq().
This fixes LRO.

Differential Revision: https://reviews.freebsd.org/D30460
Reviewed by: np, rrs
Sponsored by: Netflix
Fixes: 43bbae19483fbde0a91e61acad8a6e71e334c8b8 
<https://reviews.freebsd.org/R10:43bbae19483fbde0a91e61acad8a6e71e334c8b8>"
---
 sys/dev/cxgbe/t4_sge.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/sys/dev/cxgbe/t4_sge.c b/sys/dev/cxgbe/t4_sge.c
index 4b685129193e..0b429c602a91 100644
--- a/sys/dev/cxgbe/t4_sge.c
+++ b/sys/dev/cxgbe/t4_sge.c
@@ -3938,12 +3938,7 @@ alloc_rxq(struct vi_info *vi, struct sge_rxq *rxq, int 
idx, int intr_idx,
if (rc != 0)
return (rc);
MPASS(rxq->lro.ifp == ifp); /* also indicates LRO init'ed */
-
-   if (ifp->if_capenable & IFCAP_LRO)
-   rxq->iq.flags |= IQ_LRO_ENABLED;
 #endif
-   if (ifp->if_capenable & IFCAP_HWRXTSTMP)
-   rxq->iq.flags |= IQ_RX_TIMESTAMP;
rxq->ifp = ifp;
 
snprintf(name, sizeof(name), "%d", idx);
@@ -3953,6 +3948,12 @@ alloc_rxq(struct vi_info *vi, struct sge_rxq *rxq, int 
idx, int intr_idx,
 
init_iq(&rxq->iq, sc, vi->tmr_idx, vi->pktc_idx, vi->qsize_rxq,
intr_idx, tnl_cong(vi->pi, cong_drop));
+#if defined(INET) || defined(INET6)
+   if (ifp->if_capenable & IFCAP_LRO)
+   rxq->iq.flags |= IQ_LRO_ENABLED;
+#endif
+   if (ifp->if_capenable & IFCAP_HWRXTSTMP)
+   rxq->iq.flags |= IQ_RX_TIMESTAMP;
snprintf(name, sizeof(name), "%s rxq%d-fl",
device_get_nameunit(vi->dev), idx);
init_fl(sc, &rxq->fl, vi->qsize_rxq / 8, maxp, name);
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: ed5e13cfc268 - main - ktls: Fix interaction with RATELIMIT

2021-06-14 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=ed5e13cfc2689049ce415dad5057923bc7214a41

commit ed5e13cfc2689049ce415dad5057923bc7214a41
Author: Andrew Gallatin 
AuthorDate: 2021-06-14 14:46:13 +
Commit: Andrew Gallatin 
CommitDate: 2021-06-14 14:51:16 +

ktls: Fix interaction with RATELIMIT

uipc_ktls.c was missing opt_ratelimit.h, so it was
never noticing that RATELIMIT was enabled.  Once it was
enabled, it failed to compile as  ktls_modify_txrtlmt()
had accrued a compilation error when it was not being
compiled in.

Sponsored by: Netflix
---
 sys/kern/uipc_ktls.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c
index b0d7ea8016dd..2ab2ef18446b 100644
--- a/sys/kern/uipc_ktls.c
+++ b/sys/kern/uipc_ktls.c
@@ -30,6 +30,7 @@ __FBSDID("$FreeBSD$");
 
 #include "opt_inet.h"
 #include "opt_inet6.h"
+#include "opt_ratelimit.h"
 #include "opt_rss.h"
 
 #include 
@@ -1399,7 +1400,6 @@ ktls_modify_txrtlmt(struct ktls_session *tls, uint64_t 
max_pacing_rate)
};
struct m_snd_tag *mst;
struct ifnet *ifp;
-   int error;
 
/* Can't get to the inp, but it should be locked. */
/* INP_LOCK_ASSERT(inp); */
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: 517a7adb1160 - main - Make hwpmc work for userspace binaries again

2021-12-15 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=517a7adb1160850746227e4cc30d4bcc3ff04d7d

commit 517a7adb1160850746227e4cc30d4bcc3ff04d7d
Author: Andrew Gallatin 
AuthorDate: 2021-12-15 13:38:36 +
Commit: Andrew Gallatin 
CommitDate: 2021-12-15 13:38:36 +

Make hwpmc work for userspace binaries again

hwpmc has been utterly broken for userspace binaries, and has been
labeling all samples from userspace binaries as dubious frames. The
issues are that:

-The check for ph.p_offset & (-ph.p_align) == 0 was mostly bogus. The
 intent was to ignore all executable segments other than the first,
 which when using BFD appeared in the first page, but with current LLD
 a read-only data segment appears before the executable segment,
 pushing the latter into the second page or later. This meant no
 executable segment was ever found, and thus pi_vaddr remained
 0. Instead of relying on BFD's layout, track whether we've seen an
 executable segment explicitly with a local bool.

-Shared libraries were not parsing the segments to calculate pi_vaddr,
 resulting in it always being 0. Again, when using BFD, the executable
 segment started at the first page, and so pi_vaddr was genuinely
 meant to be 0, but not with LLD's current layout. This meant that
 pmcstat_image_link's offset calculation gave the base address of the
 segment in memory, rather than the base address of the whole library
 in memory, and so when adding that to pi_start/pi_end to get the
 range of the executable sections in memory it double-counted the
 offset of the first executable segment within the library. Thus we
 need to do the exact same parsing for ET_DYN as we do for ET_EXEC,
 which is simpler to write as special-casing ET_REL to not look for
 segments. Note that, whilst PT_INTERP isn't needed for shared
 libraries, it will be for PIEs, which pmcstat still fails to handle
 due to not knowing the base address of the PIE; we get the base
 address for libraries by MAP_IN events, and for rtld by virtue of the
 process's entry address being rtld's, but have no equivalent for the
 executable.

Fixes courtesy of jrtc27@.

Reviewed by: jrtc27, jhb (earlier version)
Differential Revision: https://reviews.freebsd.org/D33055
Sponsored by: Netflix
---
 lib/libpmcstat/libpmcstat_image.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/lib/libpmcstat/libpmcstat_image.c 
b/lib/libpmcstat/libpmcstat_image.c
index 9ee7097e95ec..97109f203806 100644
--- a/lib/libpmcstat/libpmcstat_image.c
+++ b/lib/libpmcstat/libpmcstat_image.c
@@ -43,6 +43,7 @@ __FBSDID("$FreeBSD$");
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -295,6 +296,7 @@ pmcstat_image_get_elf_params(struct pmcstat_image *image,
size_t i, nph, nsh;
const char *path, *elfbase;
char *p, *endp;
+   bool first_exec_segment;
uintfptr_t minva, maxva;
Elf *e;
Elf_Scn *scn;
@@ -384,7 +386,7 @@ pmcstat_image_get_elf_params(struct pmcstat_image *image,
 * loaded.  Additionally, for dynamically linked executables,
 * save the pathname to the runtime linker.
 */
-   if (eh.e_type == ET_EXEC) {
+   if (eh.e_type != ET_REL) {
if (elf_getphnum(e, &nph) == 0) {
warnx(
 "WARNING: Could not determine the number of program headers in \"%s\": %s.",
@@ -392,6 +394,7 @@ pmcstat_image_get_elf_params(struct pmcstat_image *image,
elf_errmsg(-1));
goto done;
}
+   first_exec_segment = true;
for (i = 0; i < eh.e_phnum; i++) {
if (gelf_getphdr(e, i, &ph) != &ph) {
warnx(
@@ -416,8 +419,10 @@ pmcstat_image_get_elf_params(struct pmcstat_image *image,
break;
case PT_LOAD:
if ((ph.p_flags & PF_X) != 0 &&
-   (ph.p_offset & (-ph.p_align)) == 0)
+   first_exec_segment) {
image->pi_vaddr = ph.p_vaddr & 
(-ph.p_align);
+   first_exec_segment = false;
+   }
break;
}
}

git: 588f03ec9b9e - main - bectl: Improve error message when ZFS root is not found.

2023-03-31 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=588f03ec9b9ebd3c17b3e978140ff3f3e4bcad73

commit 588f03ec9b9ebd3c17b3e978140ff3f3e4bcad73
Author: Andrew Gallatin 
AuthorDate: 2023-03-30 21:57:26 +
Commit: Andrew Gallatin 
CommitDate: 2023-03-31 14:27:38 +

bectl: Improve error message when ZFS root is not found.

When recovering a system that is unbootable due to some
problem with the active BE, it is likely you'll be booted
from a rescue image running UFS.  In this case, bectl
needs help finding the zpool root that you want to operate
on.  In this case, improve the error message to suggest
specifying a root, rather than just emitting a generic
error message that might imply, to the naive user, that
there is a ZFS compatibility issue between the rescue
image and the on-disk ZFS pool.

Reviewed by: imp, kevans
Sponsored by: Netflix
Differential Revision:  https://reviews.freebsd.org/D39346
---
 sbin/bectl/bectl.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/sbin/bectl/bectl.c b/sbin/bectl/bectl.c
index 2b7af4e55419..814b98ba8a8a 100644
--- a/sbin/bectl/bectl.c
+++ b/sbin/bectl/bectl.c
@@ -584,9 +584,13 @@ main(int argc, char *argv[])
}
 
if ((be = libbe_init(root)) == NULL) {
-   if (!cmd->silent)
+   if (!cmd->silent) {
fprintf(stderr, "libbe_init(\"%s\") failed.\n",
root != NULL ? root : "");
+   if (root == NULL)
+   fprintf(stderr,
+   "Try specifying ZFS root using -r.\n");
+   }
return (-1);
}

git: 8b0dafdb2f18 - main - vm: implement vm_page_reclaim_contig_domain_ext()

2023-05-09 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=8b0dafdb2f18b9bdc464a4ddbcfd749c3d3875f1

commit 8b0dafdb2f18b9bdc464a4ddbcfd749c3d3875f1
Author: Andrew Gallatin 
AuthorDate: 2023-05-08 13:25:40 +
Commit: Andrew Gallatin 
CommitDate: 2023-05-09 17:09:34 +

vm: implement vm_page_reclaim_contig_domain_ext()

Implement vm_page_reclaim_contig_domain_ext() to reclaim multiple
contiguous regions at once.  This makes it more efficient for users
that need multiple contiguous regions to reclaim those regions
efficiently.

This is needed because callers like ktls may need to reclaim many
contiguous regions, and each scan of physical memory can take
multiple seconds on a large memory machine (order of 100GB of
RMA).  Rather than modifying the core algorithm, I extended
vm_page_reclaim_contig_domain() to take a "desired_runs" argument to
allow the caller to request that it reclaim more than just a single
run. There is no functional change intended for all existing
callers.

The first user for this interface is the ktls code
(https://reviews.freebsd.org/D39421). By reclaiming multiple runs,
ktls goes from consuming hours of CPU to refill its buffer zone to
just seconds or minutes.

Differential Revision: https://reviews.freebsd.org/D39739
Sponsored by:   Netflix
Reviewed by:alc, jhb, markj
---
 sys/vm/vm_page.c | 69 +++-
 sys/vm/vm_page.h |  3 +++
 2 files changed, 56 insertions(+), 16 deletions(-)

diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
index 90413f235ec0..4b967a94aa1f 100644
--- a/sys/vm/vm_page.c
+++ b/sys/vm/vm_page.c
@@ -2995,9 +2995,7 @@ unlock:
 
 #defineNRUNS   16
 
-CTASSERT(powerof2(NRUNS));
-
-#defineRUN_INDEX(count)((count) & (NRUNS - 1))
+#defineRUN_INDEX(count, nruns) ((count) % (nruns))
 
 #defineMIN_RECLAIM 8
 
@@ -3025,19 +3023,42 @@ CTASSERT(powerof2(NRUNS));
  * must be a power of two.
  */
 bool
-vm_page_reclaim_contig_domain(int domain, int req, u_long npages,
-vm_paddr_t low, vm_paddr_t high, u_long alignment, vm_paddr_t boundary)
+vm_page_reclaim_contig_domain_ext(int domain, int req, u_long npages,
+vm_paddr_t low, vm_paddr_t high, u_long alignment, vm_paddr_t boundary,
+int desired_runs)
 {
struct vm_domain *vmd;
vm_paddr_t curr_low;
-   vm_page_t m_run, m_runs[NRUNS];
+   vm_page_t m_run, _m_runs[NRUNS], *m_runs;
u_long count, minalign, reclaimed;
-   int error, i, options, req_class;
+   int error, i, min_reclaim, nruns, options, req_class;
+   bool ret;
 
KASSERT(npages > 0, ("npages is 0"));
KASSERT(powerof2(alignment), ("alignment is not a power of 2"));
KASSERT(powerof2(boundary), ("boundary is not a power of 2"));
 
+   ret = false;
+
+   /*
+* If the caller wants to reclaim multiple runs, try to allocate
+* space to store the runs.  If that fails, fall back to the old
+* behavior of just reclaiming MIN_RECLAIM pages.
+*/
+   if (desired_runs > 1)
+   m_runs = malloc((NRUNS + desired_runs) * sizeof(*m_runs),
+   M_TEMP, M_NOWAIT);
+   else
+   m_runs = NULL;
+
+   if (m_runs == NULL) {
+   m_runs = _m_runs;
+   nruns = NRUNS;
+   } else {
+   nruns = NRUNS + desired_runs - 1;
+   }
+   min_reclaim = MAX(desired_runs * npages, MIN_RECLAIM);
+
/*
 * The caller will attempt an allocation after some runs have been
 * reclaimed and added to the vm_phys buddy lists.  Due to limitations
@@ -3066,7 +3087,7 @@ vm_page_reclaim_contig_domain(int domain, int req, u_long 
npages,
if (count < npages + vmd->vmd_free_reserved || (count < npages +
vmd->vmd_interrupt_free_min && req_class == VM_ALLOC_SYSTEM) ||
(count < npages && req_class == VM_ALLOC_INTERRUPT))
-   return (false);
+   goto done;
 
/*
 * Scan up to three times, relaxing the restrictions ("options") on
@@ -3085,27 +3106,29 @@ vm_page_reclaim_contig_domain(int domain, int req, 
u_long npages,
if (m_run == NULL)
break;
curr_low = VM_PAGE_TO_PHYS(m_run) + ptoa(npages);
-   m_runs[RUN_INDEX(count)] = m_run;
+   m_runs[RUN_INDEX(count, nruns)] = m_run;
count++;
}
 
/*
 * Reclaim the highest runs in LIFO (descending) order until
 * the number of reclaimed pages, "reclaimed", is at least
-* MIN_RECLAIM.  Reset "reclaimed" each time because

git: 198558523361 - main - ktls: re-work alloc thread

2023-05-09 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=198558523361a654409b6d3f8d63c12ba3f72ae5

commit 198558523361a654409b6d3f8d63c12ba3f72ae5
Author: Andrew Gallatin 
AuthorDate: 2023-05-08 13:38:59 +
Commit: Andrew Gallatin 
CommitDate: 2023-05-09 17:09:34 +

ktls: re-work alloc thread

When the ktls_buffer zone needs to expand, it may fail due
to a lack of physically contiguous memory.  We tried to rectify
that by introducing an alloc thread to provide a context where
it is harmless to sleep, and letting that thread repopulate
the ktls_buffer zone.

However, it turns out that M_WAITOK is not enough, and we
must call vm_page_reclaim_contig_domain() to reclaim contig
memory. Worse, M_WAITOK results in the allocation essentially
busy-looping around vm_domain_alloc_fail() returning EAGIN,
causing vm_page_alloc_noobj_contig_domain() to loop and resulting
in the alloc thread consuming 100% CPU.

To fix this, we change the alloc thread to call
vm_page_reclaim_contig_domain_ext()

In order to prevent the busy loop around vm_domain_alloc_fail(), we
must change the uma_zalloc flags to M_NORECLAIM | M_NOWAIT.  However,
once that is done, these allocations become no different than the
allocations done in the critical path in ktls_buffer_alloc(), so its
best to just eliminate them.

Since we're no longer doing allocations but just calling
vm_page_reclaim_contig_domain_ext(), the name has changed to the ktls
reclaim thread.

Reviewed by: jhb, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D39421
---
 sys/kern/uipc_ktls.c | 82 ++--
 1 file changed, 34 insertions(+), 48 deletions(-)

diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c
index 4639355b1558..1e892dde9022 100644
--- a/sys/kern/uipc_ktls.c
+++ b/sys/kern/uipc_ktls.c
@@ -88,9 +88,9 @@ struct ktls_wq {
int lastallocfail;
 } __aligned(CACHE_LINE_SIZE);
 
-struct ktls_alloc_thread {
+struct ktls_reclaim_thread {
uint64_t wakeups;
-   uint64_t allocs;
+   uint64_t reclaims;
struct thread *td;
int running;
 };
@@ -98,7 +98,7 @@ struct ktls_alloc_thread {
 struct ktls_domain_info {
int count;
int cpu[MAXCPU];
-   struct ktls_alloc_thread alloc_td;
+   struct ktls_reclaim_thread reclaim_td;
 };
 
 struct ktls_domain_info ktls_domains[MAXMEMDOM];
@@ -154,10 +154,10 @@ SYSCTL_BOOL(_kern_ipc_tls, OID_AUTO, sw_buffer_cache, 
CTLFLAG_RDTUN,
 &ktls_sw_buffer_cache, 1,
 "Enable caching of output buffers for SW encryption");
 
-static int ktls_max_alloc = 128;
-SYSCTL_INT(_kern_ipc_tls, OID_AUTO, max_alloc, CTLFLAG_RWTUN,
-&ktls_max_alloc, 128,
-"Max number of 16k buffers to allocate in thread context");
+static int ktls_max_reclaim = 1024;
+SYSCTL_INT(_kern_ipc_tls, OID_AUTO, max_reclaim, CTLFLAG_RWTUN,
+&ktls_max_reclaim, 128,
+"Max number of 16k buffers to reclaim in thread context");
 
 static COUNTER_U64_DEFINE_EARLY(ktls_tasks_active);
 SYSCTL_COUNTER_U64(_kern_ipc_tls, OID_AUTO, tasks_active, CTLFLAG_RD,
@@ -303,7 +303,7 @@ static MALLOC_DEFINE(M_KTLS, "ktls", "Kernel TLS");
 static void ktls_reset_receive_tag(void *context, int pending);
 static void ktls_reset_send_tag(void *context, int pending);
 static void ktls_work_thread(void *ctx);
-static void ktls_alloc_thread(void *ctx);
+static void ktls_reclaim_thread(void *ctx);
 
 static u_int
 ktls_get_cpu(struct socket *so)
@@ -454,12 +454,12 @@ ktls_init(void)
continue;
if (CPU_EMPTY(&cpuset_domain[domain]))
continue;
-   error = kproc_kthread_add(ktls_alloc_thread,
+   error = kproc_kthread_add(ktls_reclaim_thread,
&ktls_domains[domain], &ktls_proc,
-   &ktls_domains[domain].alloc_td.td,
-   0, 0, "KTLS", "alloc_%d", domain);
+   &ktls_domains[domain].reclaim_td.td,
+   0, 0, "KTLS", "reclaim_%d", domain);
if (error) {
-   printf("Can't add KTLS alloc thread %d error 
%d\n",
+   printf("Can't add KTLS reclaim thread %d error 
%d\n",
domain, error);
return (error);
}
@@ -2702,9 +2702,9 @@ ktls_buffer_alloc(struct ktls_wq *wq, struct mbuf *m)
 * see an old value of running == true.
 */
if (!VM_DOMAIN_EMPTY(domain)) {
-

git: fd96685a4a57 - main - Revert "When stopping powerd, set the CPU frequency back to its maximum value"

2023-05-25 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=fd96685a4a579fc84031e8e66d8f8b1ce8cdf1e5

commit fd96685a4a579fc84031e8e66d8f8b1ce8cdf1e5
Author: Andrew Gallatin 
AuthorDate: 2023-05-22 00:47:28 +
Commit: Andrew Gallatin 
CommitDate: 2023-05-25 13:40:26 +

Revert "When stopping powerd, set the CPU frequency back to its maximum 
value"

This reverts commit 1dcb6ad173e57b489a859ea59ed6eaa733bdb5bc.

As of "8cb16fdbea6b Restore original frequency on exit.", powerd
restores the original frequency itself.

Further, if the original frequency is not the same as the
first frequency found in the frequency list, then the restoration
done by the powerd_poststop will restore the wrong frequency.
This can happen on Intel machines where Turbo is not enabled,
but the turbo frequency is first in the list of frequencies.
In this case, turbo will be enabled when the user did not want
it to be.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D40197
Reviewed by: imp, mav
---
 libexec/rc/rc.d/powerd | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/libexec/rc/rc.d/powerd b/libexec/rc/rc.d/powerd
index 2fc783a627e9..6f63bb96ff42 100755
--- a/libexec/rc/rc.d/powerd
+++ b/libexec/rc/rc.d/powerd
@@ -14,13 +14,6 @@ name="powerd"
 desc="Modify the power profile based on AC line state"
 rcvar="powerd_enable"
 command="/usr/sbin/${name}"
-stop_postcmd=powerd_poststop
-
-powerd_poststop()
-{
-   sysctl dev.cpu.0.freq=`sysctl -n dev.cpu.0.freq_levels |
-   sed -e 's:/.*::'` > /dev/null
-}
 
 load_rc_config $name
 run_rc_command "$1"

git: 8de48df35c3b - main - ixgbe: Do not count L3/L4 checksum errors as input errors

2023-02-02 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=8de48df35c3bf4800176b7aa54c75a01864d458b

commit 8de48df35c3bf4800176b7aa54c75a01864d458b
Author: Andrew Gallatin 
AuthorDate: 2023-02-02 15:02:44 +
Commit: Andrew Gallatin 
CommitDate: 2023-02-02 15:14:12 +

ixgbe: Do not count L3/L4 checksum errors as input errors

NIC input errors have traditionally indicated problems at the link
level (crc errors, runts, etc).  People tend to build monitoring
infrastructure  around such errors in order to monitor for bad network
hardware. When L3/L4 checksum errors are included in the category of
input errors, it breaks such monitoring, as these errors can originate
anywhere on the internet, and do not necessarily indicate faulty
local network hardware.

Reviewed by: erj, glebius
Differential Revision: https://reviews.freebsd.org/D38346
Sponsored by: Netflix
---
 sys/dev/ixgbe/if_ix.c | 5 -
 sys/dev/ixgbe/ixgbe.h | 1 -
 2 files changed, 6 deletions(-)

diff --git a/sys/dev/ixgbe/if_ix.c b/sys/dev/ixgbe/if_ix.c
index 4f6faeec4296..8df0e59a8346 100644
--- a/sys/dev/ixgbe/if_ix.c
+++ b/sys/dev/ixgbe/if_ix.c
@@ -1577,19 +1577,14 @@ ixgbe_update_stats_counters(struct ixgbe_softc *sc)
 * Aggregate following types of errors as RX errors:
 * - CRC error count,
 * - illegal byte error count,
-* - checksum error count,
 * - missed packets count,
 * - length error count,
 * - undersized packets count,
 * - fragmented packets count,
 * - oversized packets count,
 * - jabber count.
-*
-* Ignore XEC errors for 82599 to workaround errata about
-* UDP frames with zero checksum.
 */
IXGBE_SET_IERRORS(sc, stats->crcerrs + stats->illerrc +
-   (hw->mac.type != ixgbe_mac_82599EB ? stats->xec : 0) +
stats->mpc[0] + stats->rlec + stats->ruc + stats->rfc + stats->roc +
stats->rjc);
 } /* ixgbe_update_stats_counters */
diff --git a/sys/dev/ixgbe/ixgbe.h b/sys/dev/ixgbe/ixgbe.h
index 0f81a0a2c2da..83a51b4d15e7 100644
--- a/sys/dev/ixgbe/ixgbe.h
+++ b/sys/dev/ixgbe/ixgbe.h
@@ -507,7 +507,6 @@ struct ixgbe_softc {
 "\nSum of the following RX errors counters:\n" \
 " * CRC errors,\n" \
 " * illegal byte error count,\n" \
-" * checksum error count,\n" \
 " * missed packet count,\n" \
 " * length error count,\n" \
 " * undersized packets count,\n" \

git: c0e4090e3d43 - main - ktls: Accurately track if ifnet ktls is enabled

2023-02-09 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=c0e4090e3d43eeb86270dd35835862660b045c26

commit c0e4090e3d43eeb86270dd35835862660b045c26
Author: Andrew Gallatin 
AuthorDate: 2023-02-08 20:37:08 +
Commit: Andrew Gallatin 
CommitDate: 2023-02-09 17:44:44 +

ktls: Accurately track if ifnet ktls is enabled

This allows us to avoid spurious calls to ktls_disable_ifnet()

When we implemented ifnet kTLSe, we set a flag in the tx socket
buffer (SB_TLS_IFNET) to indicate ifnet kTLS.  This flag meant that
now, or in the past, ifnet ktls was active on a socket.  Later,
I added code to switch ifnet ktls sessions to software in the case
of lossy TCP connections that have a high retransmit rate.
Because TCP was using SB_TLS_IFNET to know if it needed to do math
to calculate the retransmit ratio and potentially call into
ktls_disable_ifnet(), it was doing unneeded work long after
a session was moved to software.

This patch carefully tracks whether or not ifnet ktls is still enabled
on a TCP connection.  Because the inp is now embedded in the tcpcb, and
because TCP is the most frequent accessor of this state, it made sense to
move this from the socket buffer flags to the tcpcb. Because we now need
reliable access to the tcbcb, we take a ref on the inp when creating a tx
ktls session.

While here, I noticed that rack/bbr were incorrectly implementing
tfb_hwtls_change(), and applying the change to all pending sends,
when it should apply only to future sends.

This change reduces spurious calls to  ktls_disable_ifnet() by 95% or so
in a Netflix CDN environment.

Reviewed by: markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D38380
---
 sys/kern/uipc_ktls.c  | 145 +-
 sys/netinet/tcp_output.c  |   2 +-
 sys/netinet/tcp_ratelimit.c   |   4 +-
 sys/netinet/tcp_stacks/bbr.c  |   2 +-
 sys/netinet/tcp_stacks/rack.c |  14 +---
 sys/netinet/tcp_var.h |   3 +
 sys/sys/ktls.h|   3 +-
 sys/sys/sockbuf.h |   2 +-
 8 files changed, 126 insertions(+), 49 deletions(-)

diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c
index ac55268728e9..b3895aee9249 100644
--- a/sys/kern/uipc_ktls.c
+++ b/sys/kern/uipc_ktls.c
@@ -222,6 +222,11 @@ static COUNTER_U64_DEFINE_EARLY(ktls_ifnet_disable_ok);
 SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, ifnet_disable_ok, CTLFLAG_RD,
 &ktls_ifnet_disable_ok, "TLS sessions able to switch to SW from ifnet");
 
+static COUNTER_U64_DEFINE_EARLY(ktls_destroy_task);
+SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, destroy_task, CTLFLAG_RD,
+&ktls_destroy_task,
+"Number of times ktls session was destroyed via taskqueue");
+
 SYSCTL_NODE(_kern_ipc_tls, OID_AUTO, sw, CTLFLAG_RD | CTLFLAG_MPSAFE, 0,
 "Software TLS session stats");
 SYSCTL_NODE(_kern_ipc_tls, OID_AUTO, ifnet, CTLFLAG_RD | CTLFLAG_MPSAFE, 0,
@@ -619,10 +624,14 @@ ktls_create_session(struct socket *so, struct tls_enable 
*en,
counter_u64_add(ktls_offload_active, 1);
 
refcount_init(&tls->refcount, 1);
-   if (direction == KTLS_RX)
+   if (direction == KTLS_RX) {
TASK_INIT(&tls->reset_tag_task, 0, ktls_reset_receive_tag, tls);
-   else
+   } else {
TASK_INIT(&tls->reset_tag_task, 0, ktls_reset_send_tag, tls);
+   tls->inp = so->so_pcb;
+   in_pcbref(tls->inp);
+   tls->tx = true;
+   }
 
tls->wq_index = ktls_get_cpu(so);
 
@@ -757,12 +766,16 @@ ktls_clone_session(struct ktls_session *tls, int 
direction)
counter_u64_add(ktls_offload_active, 1);
 
refcount_init(&tls_new->refcount, 1);
-   if (direction == KTLS_RX)
+   if (direction == KTLS_RX) {
TASK_INIT(&tls_new->reset_tag_task, 0, ktls_reset_receive_tag,
tls_new);
-   else
+   } else {
TASK_INIT(&tls_new->reset_tag_task, 0, ktls_reset_send_tag,
tls_new);
+   tls_new->inp = tls->inp;
+   tls_new->tx = true;
+   in_pcbref(tls_new->inp);
+   }
 
/* Copy fields from existing session. */
tls_new->params = tls->params;
@@ -1272,6 +1285,7 @@ ktls_enable_tx(struct socket *so, struct tls_enable *en)
 {
struct ktls_session *tls;
struct inpcb *inp;
+   struct tcpcb *tp;
int error;
 
if (!ktls_offload_enable)
@@ -1336,8 +1350,13 @@ ktls_enable_tx(struct socket *so, struct tls_enable *en)
SOCKBUF_LOCK(&so->so_snd);
so->so_snd.sb_tls_seqno = be64dec(en->rec_seq);
so->so_snd.sb_tls_info = tls;
-   if (tls->mode != TCP_TLS_MODE

git: d24b032bec1b - main - ktls: Fix comments & whitespace issues with c0e4090e3d43

2023-02-09 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=d24b032bec1b868b99fd1f3f23ec8116cd719e94

commit d24b032bec1b868b99fd1f3f23ec8116cd719e94
Author: Andrew Gallatin 
AuthorDate: 2023-02-09 19:09:05 +
Commit: Andrew Gallatin 
CommitDate: 2023-02-09 19:11:24 +

ktls: Fix comments & whitespace issues with c0e4090e3d43

Address some last minute review feedback on c0e4090e3d43
by fixing spacing around comments, and clarifying that the
newly added destroy_task is not related to tls 1.0.
No functional change intended.

Pointed out by: jhb
Sponsored by: Netflix
---
 sys/kern/uipc_ktls.c | 3 ++-
 sys/sys/ktls.h   | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c
index b3895aee9249..cb2e3f272774 100644
--- a/sys/kern/uipc_ktls.c
+++ b/sys/kern/uipc_ktls.c
@@ -1478,6 +1478,7 @@ ktls_set_tx_mode(struct socket *so, int mode)
/* Don't allow enabling ifnet ktls multiple times */
if (tp->t_nic_ktls_xmit)
return (EALREADY);
+
/*
 * Don't enable ifnet ktls if we disabled it due to an
 * excessive retransmission rate
@@ -1850,7 +1851,6 @@ ktls_destroy(struct ktls_session *tls)
 * know that we don't hold the inp rlock, and
 * can safely take the wlock
 */
-
if (curthread->td_rw_rlocks == 0) {
INP_WLOCK(inp);
} else {
@@ -3335,6 +3335,7 @@ ktls_disable_ifnet(void *arg)
SOCK_UNLOCK(so);
return;
}
+
/*
 * note that t_nic_ktls_xmit_dis is never cleared; disabling
 * ifnet can only be done once per connection, so we never want
diff --git a/sys/sys/ktls.h b/sys/sys/ktls.h
index 909d5347bc47..549ce3ee869d 100644
--- a/sys/sys/ktls.h
+++ b/sys/sys/ktls.h
@@ -201,6 +201,8 @@ struct ktls_session {
/* Only used for TLS 1.0. */
uint64_t next_seqno;
STAILQ_HEAD(, mbuf) pending_records;
+
+   /* Used to destroy any kTLS session */
struct task destroy_task;
 } __aligned(CACHE_LINE_SIZE);

git: abba58766fdd - main - LRO: Add missing checks for invalid IP addresses

2023-03-25 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=abba58766fdd7f9720761aba39c2b9653eb4fbd3

commit abba58766fdd7f9720761aba39c2b9653eb4fbd3
Author: Andrew Gallatin 
AuthorDate: 2023-03-25 15:51:51 +
Commit: Andrew Gallatin 
CommitDate: 2023-03-25 15:56:02 +

LRO: Add missing checks for invalid IP addresses

LRO bypasses normal ip_input()/tcp_input() and lacks several checks
that are present in the normal path.  Without these checks, it
is possible to trigger assertions added in b0ccf53f2455

Reviewed by: glebius, rrs
Sponsored by: Netflix
---
 sys/netinet/tcp_lro.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/sys/netinet/tcp_lro.c b/sys/netinet/tcp_lro.c
index bde8fadbc05b..908f9cdd7ea4 100644
--- a/sys/netinet/tcp_lro.c
+++ b/sys/netinet/tcp_lro.c
@@ -292,6 +292,10 @@ tcp_lro_low_level_parser(void *ptr, struct lro_parser 
*parser, bool update_data,
/* .. and the packet is not fragmented. */
if (parser->ip4->ip_off & htons(IP_MF|IP_OFFMASK))
break;
+   /* .. and the packet has valid src/dst addrs */
+   if (__predict_false(parser->ip4->ip_src.s_addr == INADDR_ANY ||
+   parser->ip4->ip_dst.s_addr == INADDR_ANY))
+   break;
ptr = (uint8_t *)ptr + (parser->ip4->ip_hl << 2);
mlen -= sizeof(struct ip);
if (update_data) {
@@ -339,6 +343,10 @@ tcp_lro_low_level_parser(void *ptr, struct lro_parser 
*parser, bool update_data,
parser->ip6 = ptr;
if (__predict_false(mlen < sizeof(struct ip6_hdr)))
return (NULL);
+   /* Ensure the packet has valid src/dst addrs */
+   if 
(__predict_false(IN6_IS_ADDR_UNSPECIFIED(&parser->ip6->ip6_src) ||
+   IN6_IS_ADDR_UNSPECIFIED(&parser->ip6->ip6_dst)))
+   return (NULL);
ptr = (uint8_t *)ptr + sizeof(*parser->ip6);
if (update_data) {
parser->data.s_addr.v6 = parser->ip6->ip6_src;

git: 2c6ff1d6320d - main - LRO: fix BPF filters for lagg in the hpts path

2022-08-13 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=2c6ff1d6320d57a9d0dc62c10c83145ed49a51dd

commit 2c6ff1d6320d57a9d0dc62c10c83145ed49a51dd
Author: Andrew Gallatin 
AuthorDate: 2022-08-13 00:15:46 +
Commit: Andrew Gallatin 
CommitDate: 2022-08-13 21:33:36 +

LRO: fix BPF filters for lagg in the hpts path

When in the hpts path, we need to handle BPF filters since aggregated
packets do not pass up the stack in the normal way. This is already
done for most interfaces, but lagg needs special handling. This is
because packets received via a lagg are passed up the stack with
the leaf interface's ifp stored in m_pkthdr.rcvif.

To handle lagg packets, we must identify that the passed rcvif is
currently a lagg port by checking for IFT_IEEE8023ADLAG or
IFT_INFINIBANDLAG (since lagg changes the lagg port's type to that
when an interface becomes a lagg member). Then we need to find the
lagg's ifp, and handle any BPF listeners on the lagg.

Note: It is possible to have multiple BPF filters, one on a member
port and one on the lagg itself. That is why we have to have 2
checks and 2 ETHER_BPF_MTAPs.

Reviewed by: jhb, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D36136
---
 sys/netinet/tcp_lro.c | 30 ++
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/sys/netinet/tcp_lro.c b/sys/netinet/tcp_lro.c
index 2633ccd12afc..fcde002bac53 100644
--- a/sys/netinet/tcp_lro.c
+++ b/sys/netinet/tcp_lro.c
@@ -53,6 +53,11 @@ __FBSDID("$FreeBSD$");
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -85,7 +90,8 @@ static inttcp_lro_rx_common(struct lro_ctrl *lc, struct 
mbuf *m,
 
 #ifdef TCPHPTS
 static booldo_bpf_strip_and_compress(struct inpcb *, struct lro_ctrl *,
-   struct lro_entry *, struct mbuf **, struct mbuf **, struct mbuf 
**, bool *, bool);
+   struct lro_entry *, struct mbuf **, struct mbuf **, struct mbuf 
**,
+   bool *, bool, bool, struct ifnet *);
 
 #endif
 
@@ -1283,7 +1289,8 @@ tcp_lro_flush_tcphpts(struct lro_ctrl *lc, struct 
lro_entry *le)
struct inpcb *inp;
struct tcpcb *tp;
struct mbuf **pp, *cmp, *mv_to;
-   bool bpf_req, should_wake;
+   struct ifnet *lagg_ifp;
+   bool bpf_req, lagg_bpf_req, should_wake;
 
/* Check if packet doesn't belongs to our network interface. */
if ((tcplro_stacks_wanting_mbufq == 0) ||
@@ -1341,13 +1348,25 @@ tcp_lro_flush_tcphpts(struct lro_ctrl *lc, struct 
lro_entry *le)
should_wake = true;
/* Check if packets should be tapped to BPF. */
bpf_req = bpf_peers_present(lc->ifp->if_bpf);
+   lagg_bpf_req = false;
+   lagg_ifp = NULL;
+   if (lc->ifp->if_type == IFT_IEEE8023ADLAG ||
+   lc->ifp->if_type == IFT_INFINIBANDLAG) {
+   struct lagg_port *lp = lc->ifp->if_lagg;
+   struct lagg_softc *sc = lp->lp_softc;
+
+   lagg_ifp = sc->sc_ifp;
+   if (lagg_ifp != NULL)
+   lagg_bpf_req = bpf_peers_present(lagg_ifp->if_bpf);
+   }
 
/* Strip and compress all the incoming packets. */
cmp = NULL;
for (pp = &le->m_head; *pp != NULL; ) {
mv_to = NULL;
if (do_bpf_strip_and_compress(inp, lc, le, pp,
-&cmp, &mv_to, &should_wake, bpf_req ) == false) {
+   &cmp, &mv_to, &should_wake, bpf_req,
+   lagg_bpf_req, lagg_ifp) == false) {
/* Advance to next mbuf. */
pp = &(*pp)->m_nextpkt;
} else if (mv_to != NULL) {
@@ -1593,7 +1612,7 @@ build_ack_entry(struct tcp_ackent *ae, struct tcphdr *th, 
struct mbuf *m,
 static bool
 do_bpf_strip_and_compress(struct inpcb *inp, struct lro_ctrl *lc,
 struct lro_entry *le, struct mbuf **pp, struct mbuf **cmp, struct mbuf 
**mv_to,
-bool *should_wake, bool bpf_req)
+bool *should_wake, bool bpf_req, bool lagg_bpf_req, struct ifnet *lagg_ifp)
 {
union {
void *ptr;
@@ -1619,6 +1638,9 @@ do_bpf_strip_and_compress(struct inpcb *inp, struct 
lro_ctrl *lc,
if (__predict_false(bpf_req))
ETHER_BPF_MTAP(lc->ifp, m);
 
+   if (__predict_false(lagg_bpf_req))
+   ETHER_BPF_MTAP(lagg_ifp, m);
+
tcp_hdr_offset = m->m_pkthdr.lro_tcp_h_off;
lro_type = le->inner.data.lro_type;
switch (lro_type) {

git: 8b19898a78d5 - main - Fix a panic on boot introduced by 555a861d6826

2022-11-01 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=8b19898a78d52b351f4d7a6ad1d8b074d037e3b7

commit 8b19898a78d52b351f4d7a6ad1d8b074d037e3b7
Author: Andrew Gallatin 
AuthorDate: 2022-11-01 17:44:39 +
Commit: Andrew Gallatin 
CommitDate: 2022-11-01 17:44:39 +

Fix a panic on boot introduced by 555a861d6826

First, an sbuf_new() in device_get_path() shadows the sb
passed in by dev_wired_cache_add(), leaving its sb in an
unfinished state, leading to a failed KASSERT().  Fixing this
is as simple as removing the sbuf_new() from device_get_path()

Second, we cannot simply take a pointer to the sbuf memory and
store it in the device location cache, because that sbuf
is freed immediately after we add data to the cache, leading
to a use-after-free and eventually a double-free.  Fixing this
requires allocating memory for the path.

After a discussion with jhb, we decided that one malloc was
better than two in dev_wired_cache_add, which is why it changed
so much.

Reviewed by: jhb
Sponsored by: Netflix
MFC after: 14 days
---
 sys/kern/subr_bus.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/sys/kern/subr_bus.c b/sys/kern/subr_bus.c
index 5c165419af2d..2fcf650b0289 100644
--- a/sys/kern/subr_bus.c
+++ b/sys/kern/subr_bus.c
@@ -5310,7 +5310,7 @@ device_get_path(device_t dev, const char *locator, struct 
sbuf *sb)
device_t parent;
int error;
 
-   sb = sbuf_new(NULL, NULL, 0, SBUF_AUTOEXTEND | SBUF_INCLUDENUL);
+   KASSERT(sb != NULL, ("sb is NULL"));
parent = device_get_parent(dev);
if (parent == NULL) {
error = sbuf_printf(sb, "/");
@@ -5663,8 +5663,6 @@ dev_wired_cache_fini(device_location_cache_t *dcp)
struct device_location_node *dln, *tdln;
 
TAILQ_FOREACH_SAFE(dln, &dcp->dlc_list, dln_link, tdln) {
-   /* Note: one allocation for both node and locator, but not path 
*/
-   free(__DECONST(void *, dln->dln_path), M_BUS);
free(dln, M_BUS);
}
free(dcp, M_BUS);
@@ -5687,12 +5685,15 @@ static struct device_location_node *
 dev_wired_cache_add(device_location_cache_t *dcp, const char *locator, const 
char *path)
 {
struct device_location_node *dln;
-   char *l;
-
-   dln = malloc(sizeof(*dln) + strlen(locator) + 1, M_BUS, M_WAITOK | 
M_ZERO);
-   dln->dln_locator = l = (char *)(dln + 1);
-   memcpy(l, locator, strlen(locator) + 1);
-   dln->dln_path = path;
+   size_t loclen, pathlen;
+
+   loclen = strlen(locator) + 1;
+   pathlen = strlen(path) + 1;
+   dln = malloc(sizeof(*dln) + loclen + pathlen, M_BUS, M_WAITOK | M_ZERO);
+   dln->dln_locator = (char *)(dln + 1);
+   memcpy(__DECONST(char *, dln->dln_locator), locator, loclen);
+   dln->dln_path = dln->dln_locator + loclen;
+   memcpy(__DECONST(char *, dln->dln_path), path, pathlen);
TAILQ_INSERT_HEAD(&dcp->dlc_list, dln, dln_link);
 
return (dln);

git: 17859d538c23 - main - ixl: silence runtime warning when PCI_IOV is not enabled

2022-12-06 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=17859d538c23d6faa5a5512262d678377130e591

commit 17859d538c23d6faa5a5512262d678377130e591
Author: Andrew Gallatin 
AuthorDate: 2022-12-06 16:35:18 +
Commit: Andrew Gallatin 
CommitDate: 2022-12-06 16:35:18 +

ixl: silence runtime warning when PCI_IOV is not enabled

When PCI_IOV is not enabled, do not attempt to call
iflib_softirq_alloc_generic(...IFLIB_INTR_IOV), as it results
in boot-time warnings similar to:
taskqgroup_attach_cpu: qid not found for iov cpu=2
ixl2: taskqgroup_attach_cpu failed 22
Instead, make it conditional on PCI_IOV like the other
SR-IOV related code.

Reviewed by:erj
Sponsored by:   Netflix
Differential Revision:  https://reviews.freebsd.org/D37609
---
 sys/dev/ixl/if_ixl.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/sys/dev/ixl/if_ixl.c b/sys/dev/ixl/if_ixl.c
index cb3ce72a95ed..352a35d95512 100644
--- a/sys/dev/ixl/if_ixl.c
+++ b/sys/dev/ixl/if_ixl.c
@@ -1064,8 +1064,11 @@ ixl_if_msix_intr_assign(if_ctx_t ctx, int msix)
"Failed to register Admin Que handler");
return (err);
}
+
+#ifdef PCI_IOV
/* Create soft IRQ for handling VFLRs */
iflib_softirq_alloc_generic(ctx, NULL, IFLIB_INTR_IOV, pf, 0, "iov");
+#endif
 
/* Now set up the stations */
for (i = 0, vector = 1; i < vsi->shared->isc_nrxqsets; i++, vector++, 
rx_que++) {

git: c4a4b2633d97 - main - allocate inpcb aligned to cachelines

2022-12-14 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=c4a4b2633d975bd0813afca6b8e23ead29d80e82

commit c4a4b2633d975bd0813afca6b8e23ead29d80e82
Author: Andrew Gallatin 
AuthorDate: 2022-12-14 19:19:35 +
Commit: Andrew Gallatin 
CommitDate: 2022-12-14 19:19:35 +

allocate inpcb aligned to cachelines

The inpcb struct is one of the most heavily utilized in the kernel
on a busy network server.  By aligning it to a cacheline
boundary, we can ensure that closely related fields in the inpcb
and tcbcb can be predictably located on the same cacheline.  rrs
has already done a lot of this work to put related fields on the
same line for the tcbcb.

In combination with a forthcoming patch to align the start of the tcpcb,
we see a roughly 3% reduction in CPU use on a busy web server serving
traffic over roughly 50,000 TCP connections.

Reviewed by: glebius, markj, tuexen
Differential Revision: https://reviews.freebsd.org/D37687
Sponsored by: Netflix
---
 sys/netinet/in_pcb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sys/netinet/in_pcb.c b/sys/netinet/in_pcb.c
index 3a83682b711f..e7f425f8593a 100644
--- a/sys/netinet/in_pcb.c
+++ b/sys/netinet/in_pcb.c
@@ -552,7 +552,7 @@ in_pcbstorage_init(void *arg)
 
pcbstor->ips_zone = uma_zcreate(pcbstor->ips_zone_name,
pcbstor->ips_size, NULL, inpcb_dtor, pcbstor->ips_pcbinit,
-   inpcb_fini, UMA_ALIGN_PTR, UMA_ZONE_SMR);
+   inpcb_fini, UMA_ALIGN_CACHE, UMA_ZONE_SMR);
pcbstor->ips_portzone = uma_zcreate(pcbstor->ips_portzone_name,
sizeof(struct inpcbport), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0);
uma_zone_set_smr(pcbstor->ips_portzone,

git: 1cac76c93fb7 - main - vm: reduce lock contention when processing vm batchqueues

2022-12-14 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=1cac76c93fb7f627fd9e304cbd99e8c8a2b8fce8

commit 1cac76c93fb7f627fd9e304cbd99e8c8a2b8fce8
Author: Andrew Gallatin 
AuthorDate: 2022-12-14 19:34:07 +
Commit: Andrew Gallatin 
CommitDate: 2022-12-14 19:34:07 +

vm: reduce lock contention when processing vm batchqueues

Rather than waiting until the batchqueue is full to acquire the lock &
process the queue, we now start trying to acquire the lock using trylocks
when the batchqueue is 1/2 full. This removes almost all contention on the
vm pagequeue mutex for for our busy sendfile() based web workload.
It also greadly reduces the amount of time a network driver ithread
remains blocked on a mutex, and eliminates some packet drops under
heavy load.

So that the system does not loose the benefit of processing large
batchqueues, I've doubled the size of the batchqueues. This way, when
there is no contention, we process the same batch size as before.

This has been run for several months on a busy Netflix server, as well
as on my personal desktop.

Reviewed by: markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37305
---
 sys/amd64/include/vmparam.h   |  2 +-
 sys/powerpc/include/vmparam.h |  2 +-
 sys/vm/vm_page.c  | 17 +++--
 sys/vm/vm_pageout.c   |  2 +-
 sys/vm/vm_pagequeue.h | 12 +++-
 5 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/sys/amd64/include/vmparam.h b/sys/amd64/include/vmparam.h
index fc88296f754c..205848489644 100644
--- a/sys/amd64/include/vmparam.h
+++ b/sys/amd64/include/vmparam.h
@@ -293,7 +293,7 @@
  * Use a fairly large batch size since we expect amd64 systems to have lots of
  * memory.
  */
-#defineVM_BATCHQUEUE_SIZE  31
+#defineVM_BATCHQUEUE_SIZE  63
 
 /*
  * The pmap can create non-transparent large page mappings.
diff --git a/sys/powerpc/include/vmparam.h b/sys/powerpc/include/vmparam.h
index 77457717a3fd..1b9873aede4a 100644
--- a/sys/powerpc/include/vmparam.h
+++ b/sys/powerpc/include/vmparam.h
@@ -263,7 +263,7 @@ extern  int vm_level_0_order;
  * memory.
  */
 #ifdef __powerpc64__
-#defineVM_BATCHQUEUE_SIZE  31
+#defineVM_BATCHQUEUE_SIZE  63
 #endif
 
 /*
diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
index 2b7bc6a5b66e..797207205f42 100644
--- a/sys/vm/vm_page.c
+++ b/sys/vm/vm_page.c
@@ -3662,19 +3662,32 @@ vm_page_pqbatch_submit(vm_page_t m, uint8_t queue)
 {
struct vm_batchqueue *bq;
struct vm_pagequeue *pq;
-   int domain;
+   int domain, slots_remaining;
 
KASSERT(queue < PQ_COUNT, ("invalid queue %d", queue));
 
domain = vm_page_domain(m);
critical_enter();
bq = DPCPU_PTR(pqbatch[domain][queue]);
-   if (vm_batchqueue_insert(bq, m)) {
+   slots_remaining = vm_batchqueue_insert(bq, m);
+   if (slots_remaining > (VM_BATCHQUEUE_SIZE >> 1)) {
+   /* keep building the bq */
+   critical_exit();
+   return;
+   } else if (slots_remaining > 0 ) {
+   /* Try to process the bq if we can get the lock */
+   pq = &VM_DOMAIN(domain)->vmd_pagequeues[queue];
+   if (vm_pagequeue_trylock(pq)) {
+   vm_pqbatch_process(pq, bq, queue);
+   vm_pagequeue_unlock(pq);
+   }
critical_exit();
return;
}
critical_exit();
 
+   /* if we make it here, the bq is full so wait for the lock */
+
pq = &VM_DOMAIN(domain)->vmd_pagequeues[queue];
vm_pagequeue_lock(pq);
critical_enter();
diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c
index bb12a7e335d5..2945b53835c6 100644
--- a/sys/vm/vm_pageout.c
+++ b/sys/vm/vm_pageout.c
@@ -1405,7 +1405,7 @@ vm_pageout_reinsert_inactive(struct scan_state *ss, 
struct vm_batchqueue *bq,
pq = ss->pq;
 
if (m != NULL) {
-   if (vm_batchqueue_insert(bq, m))
+   if (vm_batchqueue_insert(bq, m) != 0)
return;
vm_pagequeue_lock(pq);
delta += vm_pageout_reinsert_inactive_page(pq, marker, m);
diff --git a/sys/vm/vm_pagequeue.h b/sys/vm/vm_pagequeue.h
index a9d4c920e5be..268d53a391db 100644
--- a/sys/vm/vm_pagequeue.h
+++ b/sys/vm/vm_pagequeue.h
@@ -75,7 +75,7 @@ struct vm_pagequeue {
 } __aligned(CACHE_LINE_SIZE);
 
 #ifndef VM_BATCHQUEUE_SIZE
-#defineVM_BATCHQUEUE_SIZE  7
+#defineVM_BATCHQUEUE_SIZE  15
 #endif
 
 struct vm_batchqueue {
@@ -356,15 +356,17 @@ vm_batchqueue_init(struct vm_batchqueue *bq)
bq->bq_cnt = 0;
 }
 
-static inline bool
+static inline int
 vm_batchqueue_insert(struct vm_batchqueue *bq, vm_page_t m)
 {
+   int s

git: ac4e3a27ab49 - main - Unbreak the build when MAC is not defined

2022-12-14 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=ac4e3a27ab499d3401e8810c6a11713e6ed6f76b

commit ac4e3a27ab499d3401e8810c6a11713e6ed6f76b
Author: Andrew Gallatin 
AuthorDate: 2022-12-14 22:33:30 +
Commit: Andrew Gallatin 
CommitDate: 2022-12-14 22:39:25 +

Unbreak the build when MAC is not defined

7a2c93b86ef7 removed the use of "error" when MAC was not
defined, resulting in an unused variable error.

Sponsored by: Netflix
Reviewed by: jhb
---
 sys/kern/sys_socket.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/sys/kern/sys_socket.c b/sys/kern/sys_socket.c
index 5cfb366c150b..2ad76b15cee6 100644
--- a/sys/kern/sys_socket.c
+++ b/sys/kern/sys_socket.c
@@ -145,7 +145,8 @@ soo_write(struct file *fp, struct uio *uio, struct ucred 
*active_cred,
if (error)
return (error);
 #endif
-   return (sousrsend(so, NULL, uio, NULL, 0, NULL));
+   error = sousrsend(so, NULL, uio, NULL, 0, NULL);
+   return (error);
 }
 
 static int

git: 8ea418299548 - main - tcp: Build RACK and BBR stacks as a part of LINT

2023-01-10 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=8ea41829954831d345c3aef58488adf0fc8dbb42

commit 8ea41829954831d345c3aef58488adf0fc8dbb42
Author: Andrew Gallatin 
AuthorDate: 2023-01-10 21:09:00 +
Commit: Andrew Gallatin 
CommitDate: 2023-01-10 21:16:43 +

tcp: Build RACK and BBR stacks as a part of LINT

When RACK and BBR were added to the kernel, they were put
behind 'WITH_EXTRA_TCP_STACKS=1'.   Unfortunately that was
never added to any NOTES file, so RACK & BBR were not compiled
with the various LINT-NOINET, LINT-NOINET6, and LINT-NOIP kernels.
This lead to the stacks sometimes being broken.

This change:

- Fixes RACK so that it compiles with the various LINT-NO* kernels
- Adds WITH_EXTRA_TCP_STACKS=1 to all NOTES kernels so that
   RACK and BBR are compile tested regularly

Sponsored by: Netflix
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D37903
---
 sys/conf/NOTES|  1 +
 sys/netinet/tcp_stacks/rack.c | 23 ---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/sys/conf/NOTES b/sys/conf/NOTES
index a1c0e71551ae..6cea39d27ad6 100644
--- a/sys/conf/NOTES
+++ b/sys/conf/NOTES
@@ -677,6 +677,7 @@ options TCP_OFFLOAD # TCP offload support.
 optionsTCP_RFC7413 # TCP Fast Open
 
 optionsTCPHPTS
+makeoptionsWITH_EXTRA_TCP_STACKS=1 # RACK and BBR TCP kernel modules
 
 # In order to enable IPSEC you MUST also add device crypto to 
 # your kernel configuration
diff --git a/sys/netinet/tcp_stacks/rack.c b/sys/netinet/tcp_stacks/rack.c
index 6070ad5dc17a..dafe8184a8fd 100644
--- a/sys/netinet/tcp_stacks/rack.c
+++ b/sys/netinet/tcp_stacks/rack.c
@@ -32,6 +32,7 @@ __FBSDID("$FreeBSD$");
 #include "opt_ipsec.h"
 #include "opt_ratelimit.h"
 #include "opt_kern_tls.h"
+#if defined(INET) || defined(INET6)
 #include 
 #include 
 #include 
@@ -12347,6 +12348,7 @@ rack_init_fsb_block(struct tcpcb *tp, struct tcp_rack 
*rack)
  ip6, rack->r_ctl.fsb.th);
} else
 #endif /* INET6 */
+#ifdef INET
{
rack->r_ctl.fsb.tcp_ip_hdr_len = sizeof(struct tcpiphdr);
ip = (struct ip *)rack->r_ctl.fsb.tcp_ip_hdr;
@@ -12366,6 +12368,7 @@ rack_init_fsb_block(struct tcpcb *tp, struct tcp_rack 
*rack)
  tp->t_port,
  ip, rack->r_ctl.fsb.th);
}
+#endif
rack->r_fsb_inited = 1;
 }
 
@@ -15611,7 +15614,7 @@ rack_fast_rsm_output(struct tcpcb *tp, struct tcp_rack 
*rack, struct rack_sendma
struct tcpopt to;
u_char opt[TCP_MAXOLEN];
uint32_t hdrlen, optlen;
-   int32_t slot, segsiz, max_val, tso = 0, error, ulen = 0;
+   int32_t slot, segsiz, max_val, tso = 0, error = 0, ulen = 0;
uint16_t flags;
uint32_t if_hw_tsomaxsegcount = 0, startseq;
uint32_t if_hw_tsomaxsegsize;
@@ -15935,8 +15938,6 @@ rack_fast_rsm_output(struct tcpcb *tp, struct tcp_rack 
*rack, struct rack_sendma
   &inp->inp_route6,
   0, NULL, NULL, inp);
}
-#endif
-#if defined(INET) && defined(INET6)
else
 #endif
 #ifdef INET
@@ -16102,7 +16103,9 @@ rack_fast_output(struct tcpcb *tp, struct tcp_rack 
*rack, uint64_t ts_val,
 * the max-burst). We have how much to send and all the info we
 * need to just send.
 */
+#ifdef INET
struct ip *ip = NULL;
+#endif
struct udphdr *udp = NULL;
struct tcphdr *th = NULL;
struct mbuf *m, *s_mb;
@@ -16133,8 +16136,10 @@ rack_fast_output(struct tcpcb *tp, struct tcp_rack 
*rack, uint64_t ts_val,
} else
 #endif /* INET6 */
{
+#ifdef INET
ip = (struct ip *)rack->r_ctl.fsb.tcp_ip_hdr;
hdrlen = sizeof(struct tcpiphdr);
+#endif
}
if (tp->t_port && (V_tcp_udp_tunneling_port == 0)) {
m = NULL;
@@ -16281,8 +16286,10 @@ again:
else
 #endif
{
+#ifdef INET
ip->ip_tos &= ~IPTOS_ECN_MASK;
ip->ip_tos |= ect;
+#endif
}
}
tcp_set_flags(th, flags);
@@ -18346,7 +18353,9 @@ send:
ip6 = (struct ip6_hdr *)rack->r_ctl.fsb.tcp_ip_hdr;
else
 #endif /* INET6 */
+#ifdef INET
ip = (struct ip *)rack->r_ctl.fsb.tcp_ip_hdr;
+#endif
th = rack->r_ctl.fsb.th;
udp = rack->r_ctl.fsb.udp;
if (udp) {
@@ -18375,6 +18384,7 @@ send:
} else
 #endif /* INET6

git: 9cb6ba29cb70 - main - vm: centralize VM_BATCHQUEUE_SIZE definition

2023-01-21 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=9cb6ba29cb704c180d5b82f409e280377a641a28

commit 9cb6ba29cb704c180d5b82f409e280377a641a28
Author: Andrew Gallatin 
AuthorDate: 2023-01-21 19:26:25 +
Commit: Andrew Gallatin 
CommitDate: 2023-01-21 19:30:00 +

vm: centralize VM_BATCHQUEUE_SIZE definition

Remove the platform-specific definitions of VM_BATCHQUEUE_SIZE
for amd64 and powerpc64, and instead treat all 64-bit platforms
identically.  This has the effect of increasing the arm64
and riscv VM_BATCHQUEUE_SIZE to match that of other platforms.

Reviewed by: jhb, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37707
---
 sys/amd64/include/vmparam.h   | 6 --
 sys/powerpc/include/vmparam.h | 8 
 sys/vm/vm_pagequeue.h | 4 +++-
 3 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/sys/amd64/include/vmparam.h b/sys/amd64/include/vmparam.h
index 205848489644..880c46bba84d 100644
--- a/sys/amd64/include/vmparam.h
+++ b/sys/amd64/include/vmparam.h
@@ -289,12 +289,6 @@
 
 #defineZERO_REGION_SIZE(2 * 1024 * 1024)   /* 2MB */
 
-/*
- * Use a fairly large batch size since we expect amd64 systems to have lots of
- * memory.
- */
-#defineVM_BATCHQUEUE_SIZE  63
-
 /*
  * The pmap can create non-transparent large page mappings.
  */
diff --git a/sys/powerpc/include/vmparam.h b/sys/powerpc/include/vmparam.h
index 1b9873aede4a..0f3321379b47 100644
--- a/sys/powerpc/include/vmparam.h
+++ b/sys/powerpc/include/vmparam.h
@@ -258,14 +258,6 @@ extern int vm_level_0_order;
 #defineZERO_REGION_SIZE(64 * 1024) /* 64KB */
 #endif
 
-/*
- * Use a fairly large batch size since we expect ppc64 systems to have lots of
- * memory.
- */
-#ifdef __powerpc64__
-#defineVM_BATCHQUEUE_SIZE  63
-#endif
-
 /*
  * On 32-bit OEA, the only purpose for which sf_buf is used is to implement
  * an opaque pointer required by the machine-independent parts of the kernel.
diff --git a/sys/vm/vm_pagequeue.h b/sys/vm/vm_pagequeue.h
index 268d53a391db..9624d31a75b7 100644
--- a/sys/vm/vm_pagequeue.h
+++ b/sys/vm/vm_pagequeue.h
@@ -74,7 +74,9 @@ struct vm_pagequeue {
uint64_tpq_pdpages;
 } __aligned(CACHE_LINE_SIZE);
 
-#ifndef VM_BATCHQUEUE_SIZE
+#if __SIZEOF_LONG__ == 8
+#defineVM_BATCHQUEUE_SIZE  63
+#else
 #defineVM_BATCHQUEUE_SIZE  15
 #endif

git: da81cc6035f8 - main - dtrace: conditionally load the systrace_linux klds when loading dtrace.

2023-01-23 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=da81cc6035f8283b6adda1ef466977e8c1c5389e

commit da81cc6035f8283b6adda1ef466977e8c1c5389e
Author: Andrew Gallatin 
AuthorDate: 2023-01-24 01:27:17 +
Commit: Andrew Gallatin 
CommitDate: 2023-01-24 01:36:24 +

dtrace: conditionally load the systrace_linux klds when loading dtrace.

When dtrace starts, it tries to detect if the dtrace klds are loaded,
and if not, it loads them by loading the dtraceall kld. This module
depends on most dtrace modules, including systrace for the native
freebsd and freebsd32 ABIs. However, it does not depend on the
systrace_linux klds, as they in turn depend on the linux ABI klds, and
we don't want to load an ABI module that the user has not explicitly
requested. This can leave a naive user in a state where they think all
syscall providers have been loaded, yet linux ABI syscalls are
"invisible" to dtrace.

To fix this, check to see if the linux ABI modules are loaded. If they
are, then load their systrace klds.

Reviewed by: markj, (emaste & jhb, earlier versions)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37986
---
 cddl/contrib/opensolaris/lib/libdtrace/common/dt_open.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/cddl/contrib/opensolaris/lib/libdtrace/common/dt_open.c 
b/cddl/contrib/opensolaris/lib/libdtrace/common/dt_open.c
index 867259b5d77c..e11cdc954683 100644
--- a/cddl/contrib/opensolaris/lib/libdtrace/common/dt_open.c
+++ b/cddl/contrib/opensolaris/lib/libdtrace/common/dt_open.c
@@ -1115,6 +1115,15 @@ dt_vopen(int version, int flags, int *errp,
 */
if (err == ENOENT && modfind("dtraceall") < 0) {
kldload("dtraceall"); /* ignore the error */
+#if __SIZEOF_LONG__ == 8
+   if (modfind("linux64elf") >= 0)
+   kldload("systrace_linux");
+   if (modfind("linuxelf") >= 0)
+   kldload("systrace_linux32");
+#else
+   if (modfind("linuxelf") >= 0) {
+   kldload("systrace_linux");
+#endif
dtfd = open("/dev/dtrace/dtrace", O_RDWR | O_CLOEXEC);
err = errno;
}

Re: git: ecdd0b48cbf4 - main - dtrace: remove stray {

2023-01-24 Thread Andrew Gallatin


On 1/24/23 03:36, Kristof Provost wrote:

The branch main has been updated by kp:

URL: 
https://urldefense.com/v3/__https://cgit.FreeBSD.org/src/commit/?id=ecdd0b48cbf4dd5acc3fc14625de6dc25cf354ce__;!!OToaGQ!sSG7SsMzMdduiDQHrFvUgnJx_Vqw_UVXcGnUYdnForAGClss0veTm_Y0BxHqtle2SRvY0H7NmxQvAA$

commit ecdd0b48cbf4dd5acc3fc14625de6dc25cf354ce
Author: Kristof Provost 
AuthorDate: 2023-01-24 07:39:37 +
Commit: Kristof Provost 
CommitDate: 2023-01-24 07:39:37 +

 dtrace: remove stray {
 


Thanks so much.  I can't believe I did that! :(

I had done several make universe's on the previous version, and thought 
I'd done one on this one as well.  Obviously I was wrong.




Drew

git: 9ba117960e17 - main - Fix a memory leak when ip_output_send() returns EAGAIN due to send tag issues

2022-01-27 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=9ba117960e1755a693f9361e4d076630dfe13dba

commit 9ba117960e1755a693f9361e4d076630dfe13dba
Author: Andrew Gallatin 
AuthorDate: 2022-01-27 15:28:15 +
Commit: Andrew Gallatin 
CommitDate: 2022-01-27 15:34:34 +

Fix a memory leak when ip_output_send() returns EAGAIN due to send tag 
issues

When ip_output_send() returns EAGAIN due to issues with send tags (route
change, lagg failover, etc), it must free the mbuf. This is because
ip_output_send() was written as a wrapper/replacement for a direct
call to  if_output(), and the contract with if_output() has
historically been that it owns the mbufs once called. When
ip_output_send() failed to free mbufs, it violated this assumption
and lead to leaked mbufs.

This was noticed when using NIC TLS in combination with hardware
rate-limited connections. When seeing lots of NIC output drops
triggered ratelimit send tag changes, we noticed we were leaking
ktls_sessions, send tags and mbufs. This was due ip_output_send()
leaking mbufs which held references to ktls_sessions, which in
turn held references to send tags.

Many thanks to jbh, rrs, hselasky and markj for their help in
debugging this.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34054
Reviewed by: hselasky, jhb, rrs
MFC after: 2 weeks
---
 sys/netinet/ip_output.c   | 2 ++
 sys/netinet6/ip6_output.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/sys/netinet/ip_output.c b/sys/netinet/ip_output.c
index e30339f8c4aa..f203bc165e61 100644
--- a/sys/netinet/ip_output.c
+++ b/sys/netinet/ip_output.c
@@ -239,6 +239,7 @@ ip_output_send(struct inpcb *inp, struct ifnet *ifp, struct 
mbuf *m,
 * packet.
 */
if (mst == NULL) {
+   m_freem(m);
error = EAGAIN;
goto done;
}
@@ -263,6 +264,7 @@ ip_output_send(struct inpcb *inp, struct ifnet *ifp, struct 
mbuf *m,
KASSERT(m->m_pkthdr.rcvif == NULL,
("trying to add a send tag to a forwarded packet"));
if (mst->ifp != ifp) {
+   m_freem(m);
error = EAGAIN;
goto done;
}
diff --git a/sys/netinet6/ip6_output.c b/sys/netinet6/ip6_output.c
index 848ec6694398..406776bdb5a4 100644
--- a/sys/netinet6/ip6_output.c
+++ b/sys/netinet6/ip6_output.c
@@ -336,6 +336,7 @@ ip6_output_send(struct inpcb *inp, struct ifnet *ifp, 
struct ifnet *origifp,
 * packet.
 */
if (mst == NULL) {
+   m_freem(m);
error = EAGAIN;
goto done;
}
@@ -360,6 +361,7 @@ ip6_output_send(struct inpcb *inp, struct ifnet *ifp, 
struct ifnet *origifp,
KASSERT(m->m_pkthdr.rcvif == NULL,
("trying to add a send tag to a forwarded packet"));
if (mst->ifp != ifp) {
+   m_freem(m);
error = EAGAIN;
goto done;
}

git: 8a7404b2aeeb - main - tcp: fix leaks in tcp_chg_pacing_rate error paths

2022-01-27 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=8a7404b2aeeb6345bd82c13c432e56d8cbfba869

commit 8a7404b2aeeb6345bd82c13c432e56d8cbfba869
Author: Andrew Gallatin 
AuthorDate: 2022-01-27 15:35:03 +
Commit: Andrew Gallatin 
CommitDate: 2022-01-27 15:35:03 +

tcp: fix leaks in tcp_chg_pacing_rate error paths

tcp_chg_pacing_rate() is expected to release the hw rate limit table,
but failed to do so in several error cases, leading to ever
increasing counts of flows using the rate.

This patch was mostly done by rrs

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34058
Reviewed by: hselasky, rrs,  jhb (inital version, outside of Differential)
---
 sys/netinet/tcp_ratelimit.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/sys/netinet/tcp_ratelimit.c b/sys/netinet/tcp_ratelimit.c
index 96a38b6afd54..2f36cea4faed 100644
--- a/sys/netinet/tcp_ratelimit.c
+++ b/sys/netinet/tcp_ratelimit.c
@@ -1411,6 +1411,7 @@ tcp_chg_pacing_rate(const struct tcp_hwrate_limit_table 
*crte,
 * tags if it didn't allocate one when an
 * existing rate was present, so ignore.
 */
+   tcp_rel_pacing_rate(crte, tp);
if (error)
*error = EOPNOTSUPP;
return (NULL);
@@ -1419,6 +1420,7 @@ tcp_chg_pacing_rate(const struct tcp_hwrate_limit_table 
*crte,
 #endif
if (tp->t_inpcb->inp_snd_tag == NULL) {
/* Wrong interface */
+   tcp_rel_pacing_rate(crte, tp);
if (error)
*error = EINVAL;
return (NULL);
@@ -1457,10 +1459,29 @@ tcp_chg_pacing_rate(const struct tcp_hwrate_limit_table 
*crte,
 #endif
err = in_pcbmodify_txrtlmt(tp->t_inpcb, nrte->rate);
if (err) {
+   struct tcp_rate_set *lrs;
+   uint64_t pre;
+
rl_decrement_using(nrte);
+   lrs = __DECONST(struct tcp_rate_set *, rs);
+   pre = atomic_fetchadd_64(&lrs->rs_flows_using, -1);
/* Do we still have a snd-tag attached? */
if (tp->t_inpcb->inp_snd_tag)
in_pcbdetach_txrtlmt(tp->t_inpcb);
+
+   if (pre == 1) {
+   struct epoch_tracker et;
+
+   NET_EPOCH_ENTER(et);
+   mtx_lock(&rs_mtx);
+   /*
+* Is it dead?
+*/
+   if (lrs->rs_flags & RS_IS_DEAD)
+   rs_defer_destroy(lrs);
+   mtx_unlock(&rs_mtx);
+   NET_EPOCH_EXIT(et);
+   }
if (error)
*error = err;
return (NULL);

Re: git: b1f7154cb125 - main - gitignore: ignore vim swap files & .rej/.orig

2022-02-10 Thread Andrew Gallatin


On 1/17/22 04:35, Alexander V. Chernikov wrote:

The branch main has been updated by melifaro:

URL: 
https://urldefense.com/v3/__https://cgit.FreeBSD.org/src/commit/?id=b1f7154cb12517162a51d19ae19ec3f2dee88e11__;!!OToaGQ!4Lozvj8S2Opxre6qHuywX_aNhwm1heXl1CyQyb0N5f_fiBJEkTQGhLzE7KlqqP9C7A$

commit b1f7154cb12517162a51d19ae19ec3f2dee88e11
Author: Alexander V. Chernikov 
AuthorDate: 2022-01-08 16:14:47 +
Commit: Alexander V. Chernikov 
CommitDate: 2022-01-17 09:35:15 +

 gitignore: ignore vim swap files & .rej/.orig
 
 Reviewed by:cem, avg

 MFC after:  2 weeks



Hi,

I was wondering if you might consider reverting this change?
Alternatively, can you teach me how to override this file
locally without carrying a diff?

I'm asking because this makes life painful for my workflow.

Having git clean be able to handle .orig and .rej is incredibly
handy when applying large patch sets.  It makes finding a rejected
patch as simple as 'git clean -n | grep rej'.

Also, when directories are removed *AND* they have an errant
.orig or .rej file remaining in them, git rm will not garbage
collect the directory.  This causes the build to fail.  I used
to use the trick of 'git clean -nd' to find such directories, but
now they are hidden.   This scenario just cost me hours of parsing
make output, trying to figure out why my build had failed.

Thanks,

Drew

git: 28d0a740dd9a - main - ktls: auto-disable ifnet (inline hw) kTLS

2021-07-06 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=28d0a740dd9a67e4a4fa9fda5bb39b5963316f35

commit 28d0a740dd9a67e4a4fa9fda5bb39b5963316f35
Author: Andrew Gallatin 
AuthorDate: 2021-07-06 14:17:33 +
Commit: Andrew Gallatin 
CommitDate: 2021-07-06 14:28:32 +

ktls: auto-disable ifnet (inline hw) kTLS

Ifnet (inline) hw kTLS NICs typically keep state within
a TLS record, so that when transmitting in-order,
they can continue encryption on each segment sent without
DMA'ing extra state from the host.

This breaks down when transmits are out of order (eg,
TCP retransmits).  In this case, the NIC must re-DMA
the entire TLS record up to and including the segment
being retransmitted.  This means that when re-transmitting
the last 1448 byte segment of a TLS record, the NIC will
have to re-DMA the entire 16KB TLS record. This can lead
to the NIC running out of PCIe bus bandwidth well before
it saturates the network link if a lot of TCP connections have
a high retransmoit rate.

This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct),
where TCP connections with higher retransmit rate will be
switched to SW kTLS so as to conserve PCIe bandwidth.

Reviewed by:hselasky, markj, rrs
Sponsored by:   Netflix
Differential Revision:  https://reviews.freebsd.org/D30908
---
 sys/kern/uipc_ktls.c  | 107 ++
 sys/netinet/tcp_var.h |  13 +-
 sys/sys/ktls.h|  15 ++-
 3 files changed, 133 insertions(+), 2 deletions(-)

diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c
index 7e87e7c740e3..88e29157289d 100644
--- a/sys/kern/uipc_ktls.c
+++ b/sys/kern/uipc_ktls.c
@@ -30,6 +30,7 @@ __FBSDID("$FreeBSD$");
 
 #include "opt_inet.h"
 #include "opt_inet6.h"
+#include "opt_kern_tls.h"
 #include "opt_ratelimit.h"
 #include "opt_rss.h"
 
@@ -121,6 +122,11 @@ SYSCTL_INT(_kern_ipc_tls_stats, OID_AUTO, threads, 
CTLFLAG_RD,
 &ktls_number_threads, 0,
 "Number of TLS threads in thread-pool");
 
+unsigned int ktls_ifnet_max_rexmit_pct = 2;
+SYSCTL_UINT(_kern_ipc_tls, OID_AUTO, ifnet_max_rexmit_pct, CTLFLAG_RWTUN,
+&ktls_ifnet_max_rexmit_pct, 2,
+"Max percent bytes retransmitted before ifnet TLS is disabled");
+
 static bool ktls_offload_enable;
 SYSCTL_BOOL(_kern_ipc_tls, OID_AUTO, enable, CTLFLAG_RWTUN,
 &ktls_offload_enable, 0,
@@ -184,6 +190,14 @@ static COUNTER_U64_DEFINE_EARLY(ktls_switch_failed);
 SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, switch_failed, CTLFLAG_RD,
 &ktls_switch_failed, "TLS sessions unable to switch between SW and ifnet");
 
+static COUNTER_U64_DEFINE_EARLY(ktls_ifnet_disable_fail);
+SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, ifnet_disable_failed, 
CTLFLAG_RD,
+&ktls_ifnet_disable_fail, "TLS sessions unable to switch to SW from 
ifnet");
+
+static COUNTER_U64_DEFINE_EARLY(ktls_ifnet_disable_ok);
+SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, ifnet_disable_ok, CTLFLAG_RD,
+&ktls_ifnet_disable_ok, "TLS sessions able to switch to SW from ifnet");
+
 SYSCTL_NODE(_kern_ipc_tls, OID_AUTO, sw, CTLFLAG_RD | CTLFLAG_MPSAFE, 0,
 "Software TLS session stats");
 SYSCTL_NODE(_kern_ipc_tls, OID_AUTO, ifnet, CTLFLAG_RD | CTLFLAG_MPSAFE, 0,
@@ -2187,3 +2201,96 @@ ktls_work_thread(void *ctx)
}
}
 }
+
+static void
+ktls_disable_ifnet_help(void *context, int pending __unused)
+{
+   struct ktls_session *tls;
+   struct inpcb *inp;
+   struct tcpcb *tp;
+   struct socket *so;
+   int err;
+
+   tls = context;
+   inp = tls->inp;
+   if (inp == NULL)
+   return;
+   INP_WLOCK(inp);
+   so = inp->inp_socket;
+   MPASS(so != NULL);
+   if ((inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) ||
+   (inp->inp_flags2 & INP_FREED)) {
+   goto out;
+   }
+
+   if (so->so_snd.sb_tls_info != NULL)
+   err = ktls_set_tx_mode(so, TCP_TLS_MODE_SW);
+   else
+   err = ENXIO;
+   if (err == 0) {
+   counter_u64_add(ktls_ifnet_disable_ok, 1);
+   /* ktls_set_tx_mode() drops inp wlock, so recheck flags */
+   if ((inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) == 0 &&
+   (inp->inp_flags2 & INP_FREED) == 0 &&
+   (tp = intotcpcb(inp)) != NULL &&
+   tp->t_fb->tfb_hwtls_change != NULL)
+   (*tp->t_fb->tfb_hwtls_change)(tp, 0);
+   } else {
+   counter_u64_add(ktls_ifnet_disable_fail, 1);
+   }
+
+out:
+   SOCK_LOCK(so);
+   sorele(so);
+   if (!in_pcbrele_wlocked(inp))
+

git: 4150a5a87ed6 - main - ktls: fix NOINET build

2021-07-07 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=4150a5a87ed6757cb6fd0118b4058eef77f735f7

commit 4150a5a87ed6757cb6fd0118b4058eef77f735f7
Author: Andrew Gallatin 
AuthorDate: 2021-07-07 14:38:57 +
Commit: Andrew Gallatin 
CommitDate: 2021-07-07 14:40:02 +

ktls: fix NOINET build

Reported by: mjguzik
Sponsored by: Netflix
---
 sys/kern/uipc_ktls.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c
index 88e29157289d..5f7dde325740 100644
--- a/sys/kern/uipc_ktls.c
+++ b/sys/kern/uipc_ktls.c
@@ -2202,6 +2202,7 @@ ktls_work_thread(void *ctx)
}
 }
 
+#if defined(INET) || defined(INET6)
 static void
 ktls_disable_ifnet_help(void *context, int pending __unused)
 {
@@ -2294,3 +2295,4 @@ ktls_disable_ifnet(void *arg)
TASK_INIT(&tls->disable_ifnet_task, 0, ktls_disable_ifnet_help, tls);
(void)taskqueue_enqueue(taskqueue_thread, &tls->disable_ifnet_task);
 }
+#endif
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

Re: git: 28d0a740dd9a - main - ktls: auto-disable ifnet (inline hw) kTLS

2021-07-07 Thread Andrew Gallatin


On 7/7/21 7:00 AM, Mateusz Guzik wrote:

This breaks NOIP kernel builds.


Thanks for pointing this out,  it should be fixed in 4150a5a87ed


On 7/6/21, Andrew Gallatin  wrote:

The branch main has been updated by gallatin:

URL:
https://urldefense.com/v3/__https://cgit.FreeBSD.org/src/commit/?id=28d0a740dd9a67e4a4fa9fda5bb39b5963316f35__;!!OToaGQ!_d4pkzhNaWowgMsR4-c1qtLXr1H9SC_kBWNDvXvVV15lerMV4elltm-V6OZj3iET-A$

commit 28d0a740dd9a67e4a4fa9fda5bb39b5963316f35
Author: Andrew Gallatin 
AuthorDate: 2021-07-06 14:17:33 +
Commit: Andrew Gallatin 
CommitDate: 2021-07-06 14:28:32 +

 ktls: auto-disable ifnet (inline hw) kTLS

 Ifnet (inline) hw kTLS NICs typically keep state within
 a TLS record, so that when transmitting in-order,
 they can continue encryption on each segment sent without
 DMA'ing extra state from the host.

 This breaks down when transmits are out of order (eg,
 TCP retransmits).  In this case, the NIC must re-DMA
 the entire TLS record up to and including the segment
 being retransmitted.  This means that when re-transmitting
 the last 1448 byte segment of a TLS record, the NIC will
 have to re-DMA the entire 16KB TLS record. This can lead
 to the NIC running out of PCIe bus bandwidth well before
 it saturates the network link if a lot of TCP connections have
 a high retransmoit rate.

 This change introduces a new sysctl
(kern.ipc.tls.ifnet_max_rexmit_pct),
 where TCP connections with higher retransmit rate will be
 switched to SW kTLS so as to conserve PCIe bandwidth.

 Reviewed by:hselasky, markj, rrs
 Sponsored by:   Netflix
 Differential Revision:  
https://urldefense.com/v3/__https://reviews.freebsd.org/D30908__;!!OToaGQ!_d4pkzhNaWowgMsR4-c1qtLXr1H9SC_kBWNDvXvVV15lerMV4elltm-V6OYOYLaV0A$
---
  sys/kern/uipc_ktls.c  | 107
++
  sys/netinet/tcp_var.h |  13 +-
  sys/sys/ktls.h|  15 ++-
  3 files changed, 133 insertions(+), 2 deletions(-)

diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c
index 7e87e7c740e3..88e29157289d 100644
--- a/sys/kern/uipc_ktls.c
+++ b/sys/kern/uipc_ktls.c
@@ -30,6 +30,7 @@ __FBSDID("$FreeBSD$");

  #include "opt_inet.h"
  #include "opt_inet6.h"
+#include "opt_kern_tls.h"
  #include "opt_ratelimit.h"
  #include "opt_rss.h"

@@ -121,6 +122,11 @@ SYSCTL_INT(_kern_ipc_tls_stats, OID_AUTO, threads,
CTLFLAG_RD,
  &ktls_number_threads, 0,
  "Number of TLS threads in thread-pool");

+unsigned int ktls_ifnet_max_rexmit_pct = 2;
+SYSCTL_UINT(_kern_ipc_tls, OID_AUTO, ifnet_max_rexmit_pct, CTLFLAG_RWTUN,
+&ktls_ifnet_max_rexmit_pct, 2,
+"Max percent bytes retransmitted before ifnet TLS is disabled");
+
  static bool ktls_offload_enable;
  SYSCTL_BOOL(_kern_ipc_tls, OID_AUTO, enable, CTLFLAG_RWTUN,
  &ktls_offload_enable, 0,
@@ -184,6 +190,14 @@ static COUNTER_U64_DEFINE_EARLY(ktls_switch_failed);
  SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, switch_failed,
CTLFLAG_RD,
  &ktls_switch_failed, "TLS sessions unable to switch between SW and
ifnet");

+static COUNTER_U64_DEFINE_EARLY(ktls_ifnet_disable_fail);
+SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, ifnet_disable_failed,
CTLFLAG_RD,
+&ktls_ifnet_disable_fail, "TLS sessions unable to switch to SW from
ifnet");
+
+static COUNTER_U64_DEFINE_EARLY(ktls_ifnet_disable_ok);
+SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, ifnet_disable_ok,
CTLFLAG_RD,
+&ktls_ifnet_disable_ok, "TLS sessions able to switch to SW from
ifnet");
+
  SYSCTL_NODE(_kern_ipc_tls, OID_AUTO, sw, CTLFLAG_RD | CTLFLAG_MPSAFE, 0,
  "Software TLS session stats");
  SYSCTL_NODE(_kern_ipc_tls, OID_AUTO, ifnet, CTLFLAG_RD | CTLFLAG_MPSAFE,
0,
@@ -2187,3 +2201,96 @@ ktls_work_thread(void *ctx)
}
}
  }
+
+static void
+ktls_disable_ifnet_help(void *context, int pending __unused)
+{
+   struct ktls_session *tls;
+   struct inpcb *inp;
+   struct tcpcb *tp;
+   struct socket *so;
+   int err;
+
+   tls = context;
+   inp = tls->inp;
+   if (inp == NULL)
+   return;
+   INP_WLOCK(inp);
+   so = inp->inp_socket;
+   MPASS(so != NULL);
+   if ((inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) ||
+   (inp->inp_flags2 & INP_FREED)) {
+   goto out;
+   }
+
+   if (so->so_snd.sb_tls_info != NULL)
+   err = ktls_set_tx_mode(so, TCP_TLS_MODE_SW);
+   else
+   err = ENXIO;
+   if (err == 0) {
+   counter_u64_add(ktls_ifnet_disable_ok, 1);
+   /* ktls_set_tx_mode() drops inp wlock, so recheck flags */
+   if ((inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) == 0 &&
+   (inp->inp_flags2

git: b1e806c0ed96 - main - tcp: fix alternate stack build with LINT-NO{INET, INET6, IP}

2021-07-07 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=b1e806c0ed960e1eb9ee889c7d0df3c168290c4f

commit b1e806c0ed960e1eb9ee889c7d0df3c168290c4f
Author: Andrew Gallatin 
AuthorDate: 2021-07-07 17:02:08 +
Commit: Andrew Gallatin 
CommitDate: 2021-07-07 17:02:08 +

tcp: fix alternate stack build with LINT-NO{INET,INET6,IP}

When fixing another bug, I noticed that the alternate
TCP stacks do not build when various combinations of
ipv4 and ipv6 are disabled.

Reviewed by:rrs, tuexen
Differential Revision:  https://reviews.freebsd.org/D31094
Sponsored by: Netflix
---
 sys/netinet/tcp_stacks/bbr.c |  9 ---
 sys/netinet/tcp_stacks/rack.c| 45 
 sys/netinet/tcp_stacks/rack_bbr_common.c |  6 -
 3 files changed, 45 insertions(+), 15 deletions(-)

diff --git a/sys/netinet/tcp_stacks/bbr.c b/sys/netinet/tcp_stacks/bbr.c
index 8969e4e47ba1..c96fec07b6c9 100644
--- a/sys/netinet/tcp_stacks/bbr.c
+++ b/sys/netinet/tcp_stacks/bbr.c
@@ -3515,13 +3515,16 @@ bbr_get_header_oh(struct tcp_bbr *bbr)
if (bbr->r_ctl.rc_inc_ip_oh) {
/* Do we include IP overhead? */
 #ifdef INET6
-   if (bbr->r_is_v6)
+   if (bbr->r_is_v6) {
seg_oh += sizeof(struct ip6_hdr);
-   else
+   } else
 #endif
+   {
+
 #ifdef INET
seg_oh += sizeof(struct ip);
 #endif
+   }
}
if (bbr->r_ctl.rc_inc_enet_oh) {
/* Do we include the ethernet overhead?  */
@@ -11956,7 +11959,7 @@ bbr_output_wtime(struct tcpcb *tp, const struct timeval 
*tv)
uint32_t tot_len = 0;
uint32_t rtr_cnt = 0;
uint32_t maxseg, pace_max_segs, p_maxseg;
-   int32_t csum_flags;
+   int32_t csum_flags = 0;
int32_t hw_tls;
 #if defined(IPSEC) || defined(IPSEC_SUPPORT)
unsigned ipsec_optlen = 0;
diff --git a/sys/netinet/tcp_stacks/rack.c b/sys/netinet/tcp_stacks/rack.c
index 75287282cf3e..f417f8a4ee7f 100644
--- a/sys/netinet/tcp_stacks/rack.c
+++ b/sys/netinet/tcp_stacks/rack.c
@@ -12043,7 +12043,9 @@ rack_init_fsb_block(struct tcpcb *tp, struct tcp_rack 
*rack)
 #ifdef INET
struct ip *ip = NULL;
 #endif
+#if defined(INET) || defined(INET6)
struct udphdr *udp = NULL;
+#endif
 
/* Ok lets fill in the fast block, it can only be used with no IP 
options! */
 #ifdef INET6
@@ -12067,6 +12069,7 @@ rack_init_fsb_block(struct tcpcb *tp, struct tcp_rack 
*rack)
  ip6, rack->r_ctl.fsb.th);
} else
 #endif /* INET6 */
+#ifdef INET
{
rack->r_ctl.fsb.tcp_ip_hdr_len = sizeof(struct tcpiphdr);
ip = (struct ip *)rack->r_ctl.fsb.tcp_ip_hdr;
@@ -12086,6 +12089,7 @@ rack_init_fsb_block(struct tcpcb *tp, struct tcp_rack 
*rack)
  tp->t_port,
  ip, rack->r_ctl.fsb.th);
}
+#endif
rack->r_fsb_inited = 1;
 }
 
@@ -15226,7 +15230,7 @@ rack_fast_rsm_output(struct tcpcb *tp, struct tcp_rack 
*rack, struct rack_sendma
struct tcpopt to;
u_char opt[TCP_MAXOLEN];
uint32_t hdrlen, optlen;
-   int32_t slot, segsiz, max_val, tso = 0, error, flags, ulen = 0;
+   int32_t slot, segsiz, max_val, tso = 0, error = 0, flags, ulen = 0;
uint32_t us_cts;
uint32_t if_hw_tsomaxsegcount = 0, startseq;
uint32_t if_hw_tsomaxsegsize;
@@ -15706,7 +15710,7 @@ rack_fast_output(struct tcpcb *tp, struct tcp_rack 
*rack, uint64_t ts_val,
u_char opt[TCP_MAXOLEN];
uint32_t hdrlen, optlen;
int cnt_thru = 1;
-   int32_t slot, segsiz, len, max_val, tso = 0, sb_offset, error, flags, 
ulen = 0;
+   int32_t slot, segsiz, len, max_val, tso = 0, sb_offset, error = 0, 
flags, ulen = 0;
uint32_t us_cts, s_soff;
uint32_t if_hw_tsomaxsegcount = 0, startseq;
uint32_t if_hw_tsomaxsegsize;
@@ -16119,9 +16123,9 @@ rack_output(struct tcpcb *tp)
long tot_len_this_send = 0;
 #ifdef INET
struct ip *ip = NULL;
-#endif
 #ifdef TCPDEBUG
struct ipovly *ipov = NULL;
+#endif
 #endif
struct udphdr *udp = NULL;
struct tcp_rack *rack;
@@ -16130,7 +16134,10 @@ rack_output(struct tcpcb *tp)
uint8_t mark = 0;
uint8_t wanted_cookie = 0;
u_char opt[TCP_MAXOLEN];
-   unsigned ipoptlen, optlen, hdrlen, ulen=0;
+   unsigned ipoptlen, optlen, hdrlen;
+#if defined(INET) || defined(INET6)
+   unsigned ulen=0;
+#endif
uint32_t rack_seq;
 
 #if defined(IPSEC) || defined(IPSEC_SUPPORT)
@@ -17830,21 +17837,29 @@ send:
 #endif
if ((ipoptlen == 0) && (rack->r_ctl.fsb.tcp_ip_hdr) &&  
rack->r_fsb_inited) {
 #ifdef INET6
-   if (isipv6)
+

git: 0756bdf19c5c - main - ktls: make ktls_disable_ifnet() shim static

2021-07-07 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=0756bdf19c5c97fabf4090e844f8df9505fbd566

commit 0756bdf19c5c97fabf4090e844f8df9505fbd566
Author: Andrew Gallatin 
AuthorDate: 2021-07-07 19:05:49 +
Commit: Andrew Gallatin 
CommitDate: 2021-07-07 19:08:13 +

ktls: make ktls_disable_ifnet() shim static

A user reported that when compiling without KERN_TLS, and
with -O0, the kernel failed to link due to ktls_disable_ifnet()
being undefined.   Making the shim static works around this issue.

Reported by: Gary Jennejohn
Sponsored by: Netflix
---
 sys/sys/ktls.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sys/sys/ktls.h b/sys/sys/ktls.h
index 7fd8831878b4..a4156eb10395 100644
--- a/sys/sys/ktls.h
+++ b/sys/sys/ktls.h
@@ -238,7 +238,7 @@ extern unsigned int ktls_ifnet_max_rexmit_pct;
 void ktls_disable_ifnet(void *arg);
 #else
 #define ktls_ifnet_max_rexmit_pct 1
-inline void
+static inline void
 ktls_disable_ifnet(void *arg __unused)
 {
 }
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: 98215005b747 - main - ktls: start a thread to keep the 16k ktls buffer zone populated

2021-08-05 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=98215005b747fef67f44794ca64abd473b98bade

commit 98215005b747fef67f44794ca64abd473b98bade
Author: Andrew Gallatin 
AuthorDate: 2021-08-05 14:15:09 +
Commit: Andrew Gallatin 
CommitDate: 2021-08-05 14:19:12 +

ktls: start a thread to keep the 16k ktls buffer zone populated

Ktls recently received an optimization where we allocate 16k
physically contiguous crypto destination buffers. This provides a
large (more than 5%) reduction in CPU use in our
workload. However, after several days of uptime, the performance
benefit disappears because we have frequent allocation failures
from the ktls buffer zone.

It turns out that when load drops off, the ktls buffer zone is
trimmed, and some 16k buffers are freed back to the OS. When load
picks back up again, re-allocating those 16k buffers fails after
some number of days of uptime because physical memory has become
fragmented. This causes allocations to fail, because they are
intentionally done without M_NORECLAIM, so as to avoid pausing
the ktls crytpo work thread while the VM system defragments
memory.

To work around this, this change starts one thread per VM domain
to allocate ktls buffers with M_NORECLAIM, as we don't care if
this thread is paused while memory is defragged. The thread then
frees the buffers back into the ktls buffer zone, thus allowing
future allocations to succeed.

Note that waking up the thread is intentionally racy, but neither
of the races really matter. In the worst case, we could have
either spurious wakeups or we could have to wait 1 second until
the next rate-limited allocation failure to wake up the thread.

This patch has been in use at Netflix on a handful of servers,
and seems to fix the issue.

Differential Revision: https://reviews.freebsd.org/D31260
Reviewed by: jhb, markj,  (jtl, rrs, and dhw reviewed earlier version)
Sponsored by: Netflix
---
 sys/kern/uipc_ktls.c | 121 ++-
 1 file changed, 120 insertions(+), 1 deletion(-)

diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c
index 5f7dde325740..17b87195fc50 100644
--- a/sys/kern/uipc_ktls.c
+++ b/sys/kern/uipc_ktls.c
@@ -78,6 +78,7 @@ __FBSDID("$FreeBSD$");
 #include 
 #include 
 #include 
+#include 
 
 struct ktls_wq {
struct mtx  mtx;
@@ -87,9 +88,17 @@ struct ktls_wq {
int lastallocfail;
 } __aligned(CACHE_LINE_SIZE);
 
+struct ktls_alloc_thread {
+   uint64_t wakeups;
+   uint64_t allocs;
+   struct thread *td;
+   int running;
+};
+
 struct ktls_domain_info {
int count;
int cpu[MAXCPU];
+   struct ktls_alloc_thread alloc_td;
 };
 
 struct ktls_domain_info ktls_domains[MAXMEMDOM];
@@ -142,6 +151,11 @@ SYSCTL_BOOL(_kern_ipc_tls, OID_AUTO, sw_buffer_cache, 
CTLFLAG_RDTUN,
 &ktls_sw_buffer_cache, 1,
 "Enable caching of output buffers for SW encryption");
 
+static int ktls_max_alloc = 128;
+SYSCTL_INT(_kern_ipc_tls, OID_AUTO, max_alloc, CTLFLAG_RWTUN,
+&ktls_max_alloc, 128,
+"Max number of 16k buffers to allocate in thread context");
+
 static COUNTER_U64_DEFINE_EARLY(ktls_tasks_active);
 SYSCTL_COUNTER_U64(_kern_ipc_tls, OID_AUTO, tasks_active, CTLFLAG_RD,
 &ktls_tasks_active, "Number of active tasks");
@@ -278,6 +292,7 @@ static void ktls_cleanup(struct ktls_session *tls);
 static void ktls_reset_send_tag(void *context, int pending);
 #endif
 static void ktls_work_thread(void *ctx);
+static void ktls_alloc_thread(void *ctx);
 
 #if defined(INET) || defined(INET6)
 static u_int
@@ -418,6 +433,32 @@ ktls_init(void *dummy __unused)
ktls_number_threads++;
}
 
+   /*
+* Start an allocation thread per-domain to perform blocking allocations
+* of 16k physically contiguous TLS crypto destination buffers.
+*/
+   if (ktls_sw_buffer_cache) {
+   for (domain = 0; domain < vm_ndomains; domain++) {
+   if (VM_DOMAIN_EMPTY(domain))
+   continue;
+   if (CPU_EMPTY(&cpuset_domain[domain]))
+   continue;
+   error = kproc_kthread_add(ktls_alloc_thread,
+   &ktls_domains[domain], &ktls_proc,
+   &ktls_domains[domain].alloc_td.td,
+   0, 0, "KTLS", "alloc_%d", domain);
+   if (error)
+   panic("Can't add KTLS alloc thread %d error %d",
+   domain, error);
+   CPU_COPY(&cpuset_domain[domain], &mask);
+   error = 
cpuset_se

Re: git: 98215005b747 - main - ktls: start a thread to keep the 16k ktls buffer zone populated

2021-08-05 Thread Andrew Gallatin


On 8/5/21 11:59 AM, Ed Maste wrote:

On Thu, 5 Aug 2021 at 10:22, Andrew Gallatin  wrote:


The branch main has been updated by gallatin:

URL: 
https://urldefense.com/v3/__https://cgit.FreeBSD.org/src/commit/?id=98215005b747fef67f44794ca64abd473b98bade__;!!OToaGQ!6H22s_lcYmkhuynvYHpkyGHe143j9dOq8CYazaDqtTi9kyapeu9DMyf0Tvo0tDDCVw$

commit 98215005b747fef67f44794ca64abd473b98bade
Author: Andrew Gallatin 
AuthorDate: 2021-08-05 14:15:09 +
Commit: Andrew Gallatin 
CommitDate: 2021-08-05 14:19:12 +

 ktls: start a thread to keep the 16k ktls buffer zone populated


My Cirrus-CI boot smoke test is now failing with:

Starting KTLS alloc thread for domain 0
panic: sleeping without a lock
cpuid = 0
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfeb20ae0
vpanic() at vpanic+0x187/frame 0xfeb20b40
panic() at panic+0x43/frame 0xfeb20ba0
_sleep() at _sleep+0x484/frame 0xfeb20c40
ktls_alloc_thread() at ktls_alloc_thread+0x1c4/frame 0xfeb20cf0
fork_exit() at fork_exit+0x80/frame 0xfeb20d30
fork_trampoline() at fork_trampoline+0xe/frame 0xfeb20d30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 2 tid 100027 ]
Stopped at kdb_enter+0x37: movq $0,0x127877e(%rip)
db> qemu-system-x86_64: terminating on signal 15 from pid 32579 (timeout)
Did not boot successfully, see /tmp/ci-qemu-test-boot.log



I'd thought that I'd tested this with INVARIANTS, but I guess I was 
wrong.  The assert is failing because I'm sleeping forever (sbt == 0).

I don't understand the point of the assert, but I've
reproduced the panic and am testing a workaround.

Drew
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: 2694c869ff9f - main - ktls: fix a panic with INVARIANTS

2021-08-05 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=2694c869ff9ff60fd8e3d4d7936b7dc61763c18a

commit 2694c869ff9ff60fd8e3d4d7936b7dc61763c18a
Author: Andrew Gallatin 
AuthorDate: 2021-08-05 17:05:00 +
Commit: Andrew Gallatin 
CommitDate: 2021-08-05 17:09:06 +

ktls: fix a panic with INVARIANTS

98215005b747fef67f44794ca64abd473b98bade introduced a new
thread that uses tsleep(..0) to sleep forever.  This hit
an assert due to sleeping with a 0 timeout.

So spell "forever" using SBT_MAX instead, which does not
trigger the assert.

Pointy hat to: gallatin
Pointed out by: emaste
Sponsored by: Netflix
---
 sys/kern/uipc_ktls.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c
index 17b87195fc50..47815c27 100644
--- a/sys/kern/uipc_ktls.c
+++ b/sys/kern/uipc_ktls.c
@@ -2240,7 +2240,7 @@ ktls_alloc_thread(void *ctx)
nbufs = 0;
for (;;) {
atomic_store_int(&sc->running, 0);
-   tsleep(sc, PZERO, "waiting for work", 0);
+   tsleep_sbt(sc, PZERO, "waiting for work", SBT_MAX, SBT_1S, 0);
atomic_store_int(&sc->running, 1);
sc->wakeups++;
if (nbufs != ktls_max_alloc) {
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

Re: git: 2694c869ff9f - main - ktls: fix a panic with INVARIANTS

2021-08-05 Thread Andrew Gallatin


On 8/5/21 1:41 PM, Ian Lepore wrote:

  if (nbufs != ktls_max_alloc) {


Finding a different way to spell "forever" is not a valid way to fix a
problem where you're being warned that it is not safe to sleep forever.

The assert was warning you that the code was vulnerable to hanging
forever due to a missed wakeup.  The code is still vulnerable to that,
but now the problem is hidden and will be very difficult to find (more
so because the wait message also violates the convention of using a
short name that can be displayed by tools such as ps(1) and SIGINFO,
where the wait-channel display is currently likely to show as
"waitin").

I haven't looked at the code outside of the few lines shown in the
commit diff, but based on the names involved, I suspect the right fix
is to protect sc->running with a mutex and use msleep() instead of
trying to roll-your-own with atomics.  That would certainly be true if
the wakeup code is some form of "if (!sc->running) wakeup(sc);"

-- Ian



The code is a case where a missing wakeup does not matter.

The thread is woken up by an allocation failure, which
are themselves rate-limited to one attempt per second
(since failures are expensive, and there is a less expensive
fallback).  So the worst thing that can happen is that we wait
at most an extra second.

Adding a mutex adds nothing except unneeded complexity.

Drew



___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: 1b97a054f3ac - main - tsleep: Add a PNOLOCK flag

2021-08-05 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=1b97a054f3acaf13a5c8361b7b80e10ad16257b9

commit 1b97a054f3acaf13a5c8361b7b80e10ad16257b9
Author: Andrew Gallatin 
AuthorDate: 2021-08-05 21:16:30 +
Commit: Andrew Gallatin 
CommitDate: 2021-08-05 21:16:30 +

tsleep: Add a PNOLOCK flag

Add a PNOLOCK flag so that, in the race circumstance where
wakeup races are externally mitigated, tsleep() can be
called with a sleep time of 0 without triggering an
an assertion.

Reviewed by: jhb
Sponsored by: Netflix
---
 sys/kern/kern_synch.c | 3 ++-
 sys/sys/param.h   | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/sys/kern/kern_synch.c b/sys/kern/kern_synch.c
index 793c5309a30b..7bf5193fb7b1 100644
--- a/sys/kern/kern_synch.c
+++ b/sys/kern/kern_synch.c
@@ -148,7 +148,8 @@ _sleep(const void *ident, struct lock_object *lock, int 
priority,
 #endif
WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, lock,
"Sleeping on \"%s\"", wmesg);
-   KASSERT(sbt != 0 || mtx_owned(&Giant) || lock != NULL,
+   KASSERT(sbt != 0 || mtx_owned(&Giant) || lock != NULL ||
+   (priority & PNOLOCK) != 0,
("sleeping without a lock"));
KASSERT(ident != NULL, ("_sleep: NULL ident"));
KASSERT(TD_IS_RUNNING(td), ("_sleep: curthread not running"));
diff --git a/sys/sys/param.h b/sys/sys/param.h
index f842b344e9f9..8864063e3d9b 100644
--- a/sys/sys/param.h
+++ b/sys/sys/param.h
@@ -246,7 +246,8 @@
 #definePRIMASK 0x0ff
 #definePCATCH  0x100   /* OR'd with pri for tsleep to check 
signals */
 #definePDROP   0x200   /* OR'd with pri to stop re-entry of 
interlock mutex */
-#definePRILASTFLAG 0x200   /* Last flag defined above */
+#definePNOLOCK 0x400   /* OR'd with pri to allow sleeping w/o 
a lock */
+#definePRILASTFLAG 0x400   /* Last flag defined above */
 
 #defineNZERO   0   /* default "nice" */
 
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: 09066b98663d - main - ktls: Use the new PNOLOCK flag

2021-08-05 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=09066b98663d92f4d129bab25105805adf0abaf7

commit 09066b98663d92f4d129bab25105805adf0abaf7
Author: Andrew Gallatin 
AuthorDate: 2021-08-05 21:19:12 +
Commit: Andrew Gallatin 
CommitDate: 2021-08-05 21:19:12 +

ktls: Use the new PNOLOCK flag

Use the new PNOLOCK flag to tsleep() to indicate that
we are managing potential races, and don't need to
sleep with a lock, or have a backstop timeout.

Reviewed by: jhb
Sponsored by: Netflix
---
 sys/kern/uipc_ktls.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c
index 47815c27..1cc1f2e8b8c4 100644
--- a/sys/kern/uipc_ktls.c
+++ b/sys/kern/uipc_ktls.c
@@ -2240,7 +2240,7 @@ ktls_alloc_thread(void *ctx)
nbufs = 0;
for (;;) {
atomic_store_int(&sc->running, 0);
-   tsleep_sbt(sc, PZERO, "waiting for work", SBT_MAX, SBT_1S, 0);
+   tsleep(sc, PZERO | PNOLOCK, "-",  0);
atomic_store_int(&sc->running, 1);
sc->wakeups++;
if (nbufs != ktls_max_alloc) {
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: 739de953ecc1 - main - ktls: Move KERN_TLS ifdef to tcp_var.h

2021-08-05 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=739de953ecc13afa930e2f55b7ee2a04e41e3519

commit 739de953ecc13afa930e2f55b7ee2a04e41e3519
Author: Andrew Gallatin 
AuthorDate: 2021-08-05 23:17:35 +
Commit: Andrew Gallatin 
CommitDate: 2021-08-05 23:17:35 +

ktls: Move KERN_TLS ifdef to tcp_var.h

This allows us to remove stubs in ktls.h and allows us
to sort the function prototypes.

Reviewed by: jhb
Sponsored by: Netflix
---
 sys/netinet/tcp_var.h |  6 +++---
 sys/sys/ktls.h| 14 +++---
 2 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/sys/netinet/tcp_var.h b/sys/netinet/tcp_var.h
index 8cfd2c5417c2..64e954cf4ad5 100644
--- a/sys/netinet/tcp_var.h
+++ b/sys/netinet/tcp_var.h
@@ -1144,8 +1144,6 @@ static inline void
 tcp_account_for_send(struct tcpcb *tp, uint32_t len, uint8_t is_rxt,
 uint8_t is_tlp, int hw_tls)
 {
-   uint64_t rexmit_percent;
-
if (is_tlp) {
tp->t_sndtlppack++;
tp->t_sndtlpbyte += len;
@@ -1156,11 +1154,13 @@ tcp_account_for_send(struct tcpcb *tp, uint32_t len, 
uint8_t is_rxt,
else
tp->t_sndbytes += len;
 
+#ifdef KERN_TLS
if (hw_tls && is_rxt) {
-   rexmit_percent = (1000ULL * tp->t_snd_rxt_bytes) / (10ULL * 
(tp->t_snd_rxt_bytes + tp->t_sndbytes));
+   uint64_t rexmit_percent = (1000ULL * tp->t_snd_rxt_bytes) / 
(10ULL * (tp->t_snd_rxt_bytes + tp->t_sndbytes));
if (rexmit_percent > ktls_ifnet_max_rexmit_pct)
ktls_disable_ifnet(tp);
}
+#endif
 
 }
 #endif /* _KERNEL */
diff --git a/sys/sys/ktls.h b/sys/sys/ktls.h
index a4156eb10395..437e36f965ea 100644
--- a/sys/sys/ktls.h
+++ b/sys/sys/ktls.h
@@ -197,7 +197,10 @@ struct ktls_session {
bool disable_ifnet_pending;
 } __aligned(CACHE_LINE_SIZE);
 
+extern unsigned int ktls_ifnet_max_rexmit_pct;
+
 void ktls_check_rx(struct sockbuf *sb);
+void ktls_disable_ifnet(void *arg);
 int ktls_enable_rx(struct socket *so, struct tls_enable *en);
 int ktls_enable_tx(struct socket *so, struct tls_enable *en);
 void ktls_destroy(struct ktls_session *tls);
@@ -233,16 +236,5 @@ ktls_free(struct ktls_session *tls)
ktls_destroy(tls);
 }
 
-#ifdef KERN_TLS
-extern unsigned int ktls_ifnet_max_rexmit_pct;
-void ktls_disable_ifnet(void *arg);
-#else
-#define ktls_ifnet_max_rexmit_pct 1
-static inline void
-ktls_disable_ifnet(void *arg __unused)
-{
-}
-#endif
-
 #endif /* !_KERNEL */
 #endif /* !_SYS_KTLS_H_ */
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: 95c51fafa40d - main - ktls: Init reset tag task for cloned sessions

2021-08-11 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=95c51fafa40d56d0a32aff857261097acc65ec92

commit 95c51fafa40d56d0a32aff857261097acc65ec92
Author: Andrew Gallatin 
AuthorDate: 2021-08-11 18:06:43 +
Commit: Andrew Gallatin 
CommitDate: 2021-08-11 18:06:43 +

ktls: Init reset tag task for cloned sessions

When cloning a ktls session (which is needed when we need to
switch output NICs for a NIC TLS session), we need to also
init the reset task, like we do when creating a new tls session.

Reviewed by: jhb
Sponsored by: Netflix
---
 sys/kern/uipc_ktls.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c
index 1cc1f2e8b8c4..79da902095b3 100644
--- a/sys/kern/uipc_ktls.c
+++ b/sys/kern/uipc_ktls.c
@@ -709,6 +709,7 @@ ktls_clone_session(struct ktls_session *tls)
counter_u64_add(ktls_offload_active, 1);
 
refcount_init(&tls_new->refcount, 1);
+   TASK_INIT(&tls_new->reset_tag_task, 0, ktls_reset_send_tag, tls_new);
 
/* Copy fields from existing session. */
tls_new->params = tls->params;
___
dev-commits-src-main@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main
To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"

git: 43c72c45a185 - main - lacp: Remove racy kassert

2022-06-13 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=43c72c45a1856c6cdf25a22d259528d5a4040973

commit 43c72c45a1856c6cdf25a22d259528d5a4040973
Author: Andrew Gallatin 
AuthorDate: 2022-06-13 15:32:10 +
Commit: Andrew Gallatin 
CommitDate: 2022-06-13 15:32:10 +

lacp: Remove racy kassert

In lacp_select_tx_port_by_hash(), we assert that the selected port is
DISTRIBUTING. However, the port state is protected by the LACP_LOCK(),
which is not held around lacp_select_tx_port_by_hash().  So this
assertion is racy, and can result in a spurious panic when links
are flapping.

It is certainly possible to fix it by acquiring LACP_LOCK(),
but this seems like an early development assert, and it seems best
to just remove it, rather than add complexity inside an ifdef
INVARIANTS.

Sponsored by: Netflix
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D35396
---
 sys/net/ieee8023ad_lacp.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/sys/net/ieee8023ad_lacp.c b/sys/net/ieee8023ad_lacp.c
index 6656ebb2b400..65b3a337eedc 100644
--- a/sys/net/ieee8023ad_lacp.c
+++ b/sys/net/ieee8023ad_lacp.c
@@ -876,9 +876,6 @@ lacp_select_tx_port_by_hash(struct lagg_softc *sc, uint32_t 
hash,
hash %= count;
lp = map[hash];
 
-   KASSERT((lp->lp_state & LACP_STATE_DISTRIBUTING) != 0,
-   ("aggregated port is not distributing"));
-
return (lp->lp_lagg);
 }

git: 0aa150775179 - main - pmcstat: fix log analysis

2022-07-04 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=0aa150775179a4f683fade5f1d6325a47b5f695f

commit 0aa150775179a4f683fade5f1d6325a47b5f695f
Author: Andrew Gallatin 
AuthorDate: 2022-07-04 16:40:35 +
Commit: Andrew Gallatin 
CommitDate: 2022-07-04 16:42:39 +

pmcstat: fix log analysis

pmcstat has been broken for analyzing logs since D35342 / b6e28991bf3aadb.

This is because the pmc for the first CPU is not added when reading logs
because unlike its clones, its event id is not invalid. That causes us
to fail the assertion at lib/libpmcstat/libpmcstat_logging.c:293
when encountering samples from cpu0.

Fix this by removing the check that the PMC is invalid

Reviewed by: tsoome
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35709
---
 usr.sbin/pmcstat/pmcstat.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/usr.sbin/pmcstat/pmcstat.c b/usr.sbin/pmcstat/pmcstat.c
index 08e43d5d446a..f366e2175a25 100644
--- a/usr.sbin/pmcstat/pmcstat.c
+++ b/usr.sbin/pmcstat/pmcstat.c
@@ -1187,8 +1187,7 @@ main(int argc, char **argv)
 */
 
STAILQ_FOREACH(ev, &args.pa_events, ev_next) {
-   if (ev->ev_pmcid == PMC_ID_INVALID &&
-   pmc_allocate(ev->ev_spec, ev->ev_mode,
+   if (pmc_allocate(ev->ev_spec, ev->ev_mode,
ev->ev_flags, ev->ev_cpu, &ev->ev_pmcid,
ev->ev_count) < 0)
err(EX_OSERR,

git: 713ceb99b685 - main - lagg: fix lagg ifioctl after SIOCSIFCAPNV

2022-07-28 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=713ceb99b68568232bf9895bbe1811797bfde63c

commit 713ceb99b68568232bf9895bbe1811797bfde63c
Author: Andrew Gallatin 
AuthorDate: 2022-07-28 14:36:22 +
Commit: Andrew Gallatin 
CommitDate: 2022-07-28 14:39:00 +

lagg: fix lagg ifioctl after SIOCSIFCAPNV

Lagg was broken by SIOCSIFCAPNV when all underlying devices
support SIOCSIFCAPNV.  This change updates lagg to work with
SIOCSIFCAPNV and if_capabilities2.

Reviewed by: kib, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35865
---
 sys/net/if_lagg.c | 62 +++
 sys/net/if_lagg.h |  1 +
 2 files changed, 45 insertions(+), 18 deletions(-)

diff --git a/sys/net/if_lagg.c b/sys/net/if_lagg.c
index 3894b6d55cea..8e273c4ed391 100644
--- a/sys/net/if_lagg.c
+++ b/sys/net/if_lagg.c
@@ -157,7 +157,7 @@ static void lagg_ratelimit_query(struct ifnet *,
 #endif
 static int lagg_setmulti(struct lagg_port *);
 static int lagg_clrmulti(struct lagg_port *);
-static int lagg_setcaps(struct lagg_port *, int cap);
+static voidlagg_setcaps(struct lagg_port *, int cap, int cap2);
 static int lagg_setflag(struct lagg_port *, int, int,
int (*func)(struct ifnet *, int));
 static int lagg_setflags(struct lagg_port *, int status);
@@ -664,17 +664,20 @@ static void
 lagg_capabilities(struct lagg_softc *sc)
 {
struct lagg_port *lp;
-   int cap, ena, pena;
+   int cap, cap2, ena, ena2, pena, pena2;
uint64_t hwa;
struct ifnet_hw_tsomax hw_tsomax;
 
LAGG_XLOCK_ASSERT(sc);
 
/* Get common enabled capabilities for the lagg ports */
-   ena = ~0;
-   CK_SLIST_FOREACH(lp, &sc->sc_ports, lp_entries)
+   ena = ena2 = ~0;
+   CK_SLIST_FOREACH(lp, &sc->sc_ports, lp_entries) {
ena &= lp->lp_ifp->if_capenable;
-   ena = (ena == ~0 ? 0 : ena);
+   ena2 &= lp->lp_ifp->if_capenable2;
+   }
+   if (CK_SLIST_FIRST(&sc->sc_ports) == NULL)
+   ena = ena2 = 0;
 
/*
 * Apply common enabled capabilities back to the lagg ports.
@@ -682,30 +685,36 @@ lagg_capabilities(struct lagg_softc *sc)
 */
do {
pena = ena;
+   pena2 = ena2;
CK_SLIST_FOREACH(lp, &sc->sc_ports, lp_entries) {
-   lagg_setcaps(lp, ena);
+   lagg_setcaps(lp, ena, ena2);
ena &= lp->lp_ifp->if_capenable;
+   ena2 &= lp->lp_ifp->if_capenable2;
}
-   } while (pena != ena);
+   } while (pena != ena || pena2 != ena2);
 
/* Get other capabilities from the lagg ports */
-   cap = ~0;
+   cap = cap2 = ~0;
hwa = ~(uint64_t)0;
memset(&hw_tsomax, 0, sizeof(hw_tsomax));
CK_SLIST_FOREACH(lp, &sc->sc_ports, lp_entries) {
cap &= lp->lp_ifp->if_capabilities;
+   cap2 &= lp->lp_ifp->if_capabilities2;
hwa &= lp->lp_ifp->if_hwassist;
if_hw_tsomax_common(lp->lp_ifp, &hw_tsomax);
}
-   cap = (cap == ~0 ? 0 : cap);
-   hwa = (hwa == ~(uint64_t)0 ? 0 : hwa);
+   if (CK_SLIST_FIRST(&sc->sc_ports) == NULL)
+   cap = cap2 = hwa = 0;
 
if (sc->sc_ifp->if_capabilities != cap ||
sc->sc_ifp->if_capenable != ena ||
+   sc->sc_ifp->if_capenable2 != ena2 ||
sc->sc_ifp->if_hwassist != hwa ||
if_hw_tsomax_update(sc->sc_ifp, &hw_tsomax) != 0) {
sc->sc_ifp->if_capabilities = cap;
+   sc->sc_ifp->if_capabilities2 = cap2;
sc->sc_ifp->if_capenable = ena;
+   sc->sc_ifp->if_capenable2 = ena2;
sc->sc_ifp->if_hwassist = hwa;
getmicrotime(&sc->sc_ifp->if_lastchange);
 
@@ -982,7 +991,7 @@ lagg_port_destroy(struct lagg_port *lp, int rundelport)
 
if (lp->lp_detaching == 0) {
lagg_setflags(lp, 0);
-   lagg_setcaps(lp, lp->lp_ifcapenable);
+   lagg_setcaps(lp, lp->lp_ifcapenable, lp->lp_ifcapenable2);
if_setlladdr(ifp, lp->lp_lladdr, ifp->if_addrlen);
}
 
@@ -1038,6 +1047,7 @@ lagg_port_ioctl(struct ifnet *ifp, u_long cmd, caddr_t 
data)
break;
 
case SIOCSIFCAP:
+   case SIOCSIFCAPNV:
if (lp->lp_ioctl == NULL) {
error = EINVAL;
break;
@@ -1690,6 +1700,7 @@ lagg_ioctl(struct ifnet *ifp, u_long cmd, caddr_t data)
break;
 
case SIOCSIFCAP:
+   case SIOCSI

git: b2921fdc2330 - main - arm64: Implement bus_get_resource and bus_delete_resource.

2023-11-11 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=b2921fdc2330a5750f557fa321b94f972d5a7702

commit b2921fdc2330a5750f557fa321b94f972d5a7702
Author: Andrew Gallatin 
AuthorDate: 2023-11-11 17:54:19 +
Commit: Andrew Gallatin 
CommitDate: 2023-11-11 17:57:39 +

arm64: Implement bus_get_resource and bus_delete_resource.

These devmethods were not defined, leading to the surprising result
of using bus_set_resource(), and then immediately turning around
and getting zeros back from bus_get_resource().   These are now
simply passed through to the generic definitions, since there
is no need for them to be arm64 specific.

Note that jhb plans to replace most of the devmethods with
the generic versions.

Suggested by: jhb
Sponsored by: Netflix
---
 sys/arm64/arm64/nexus.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/sys/arm64/arm64/nexus.c b/sys/arm64/arm64/nexus.c
index b9871f0e9b3a..6ba73cd456ef 100644
--- a/sys/arm64/arm64/nexus.c
+++ b/sys/arm64/arm64/nexus.c
@@ -136,6 +136,8 @@ static device_method_t nexus_methods[] = {
DEVMETHOD(bus_adjust_resource,  nexus_adjust_resource),
DEVMETHOD(bus_alloc_resource,   nexus_alloc_resource),
DEVMETHOD(bus_deactivate_resource, nexus_deactivate_resource),
+   DEVMETHOD(bus_delete_resource, bus_generic_rl_delete_resource),
+   DEVMETHOD(bus_get_resource, bus_generic_rl_get_resource),
DEVMETHOD(bus_get_resource_list, nexus_get_reslist),
DEVMETHOD(bus_map_resource, nexus_map_resource),
DEVMETHOD(bus_release_resource, nexus_release_resource),

git: ab063ac4444e - main - ipmi_ssif: Fix typo in debug print

2023-11-13 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=ab063ace426759cb5d053e50e02fa078a3c6

commit ab063ace426759cb5d053e50e02fa078a3c6
Author: Andrew Gallatin 
AuthorDate: 2023-11-14 00:44:27 +
Commit: Andrew Gallatin 
CommitDate: 2023-11-14 00:46:56 +

ipmi_ssif: Fix typo in debug print

Fix a typo in a debug print that prevents compilation.

Sponsored by: Netflix
---
 sys/dev/ipmi/ipmi_ssif.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sys/dev/ipmi/ipmi_ssif.c b/sys/dev/ipmi/ipmi_ssif.c
index 3ac1e04c2eda..532d1f7f485c 100644
--- a/sys/dev/ipmi/ipmi_ssif.c
+++ b/sys/dev/ipmi/ipmi_ssif.c
@@ -200,7 +200,7 @@ read_start:
goto fail;
}
 #ifdef SSIF_DEBUG
-   device_printf("SSIF: READ_START: ok\n");
+   device_printf(dev, "SSIF: READ_START: ok\n");
 #endif
 
/*

git: ba0e4d7971e0 - main - smbios: handle smbios3 for arm64

2023-11-15 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=ba0e4d7971e05ee64281a4fc49a2fb408c8ad816

commit ba0e4d7971e05ee64281a4fc49a2fb408c8ad816
Author: Andrew Gallatin 
AuthorDate: 2023-11-15 16:11:53 +
Commit: Andrew Gallatin 
CommitDate: 2023-11-15 16:20:04 +

smbios: handle smbios3 for arm64

Get smbios working on arm64 where it seems to be
exclusively smbios version 3.x

The "interesting" thing here is that the smbios table seems to be
RAM in the EFI runtime services table. This makes it owned by "ram0",
and not io memory. That prevents bus_alloc_resource() from being able
to claim it, since ram0 already owns it. According to jhb, this is
how things are supposed to work.  Eg, bus_alloc_resource() is meant
to be used with IO memory, not physical memory.  Following his
suggestion, I converted the driver to simply use pmap_mapbios().

This is a prerequisite for getting IPMI to attach via the SSIF
attachment on arm64 servers, where all IPMI that I've seen
uses SSIF.

Note that this change is based on initial work by Allan Jude in
https://reviews.freebsd.org/D28739.

Reviewed by: imp
Sponsored by: Netflix, Ampere Computing LLC (D28739)
Differential Revision: https://reviews.freebsd.org/D42592
---
 sys/dev/smbios/smbios.c | 176 
 sys/dev/smbios/smbios.h |  15 +
 2 files changed, 132 insertions(+), 59 deletions(-)

diff --git a/sys/dev/smbios/smbios.c b/sys/dev/smbios/smbios.c
index b9dd8a40e9e4..7f89430226c8 100644
--- a/sys/dev/smbios/smbios.c
+++ b/sys/dev/smbios/smbios.c
@@ -57,41 +57,49 @@
 
 struct smbios_softc {
device_tdev;
-   struct resource *   res;
-   int rid;
-
-   struct smbios_eps * eps;
+   union {
+   struct smbios_eps * eps;
+   struct smbios3_eps *eps3;
+   };
+   bool is_eps3;
 };
 
-#defineRES2EPS(res)((struct smbios_eps *)rman_get_virtual(res))
-
 static voidsmbios_identify (driver_t *, device_t);
 static int smbios_probe(device_t);
 static int smbios_attach   (device_t);
 static int smbios_detach   (device_t);
 static int smbios_modevent (module_t, int, void *);
 
-static int smbios_cksum(struct smbios_eps *);
+static int smbios_cksum(void *);
+static boolsmbios_eps3 (void *);
 
 static void
 smbios_identify (driver_t *driver, device_t parent)
 {
 #ifdef ARCH_MAY_USE_EFI
struct uuid efi_smbios = EFI_TABLE_SMBIOS;
+   struct uuid efi_smbios3 = EFI_TABLE_SMBIOS3;
void *addr_efi;
 #endif
struct smbios_eps *eps;
+   struct smbios3_eps *eps3;
+   void *ptr;
device_t child;
vm_paddr_t addr = 0;
+   size_t map_size = sizeof (*eps);
int length;
-   int rid;
 
if (!device_is_alive(parent))
return;
 
 #ifdef ARCH_MAY_USE_EFI
-   if (!efi_get_table(&efi_smbios, &addr_efi))
+   if (!efi_get_table(&efi_smbios3, &addr_efi)) {
addr = (vm_paddr_t)addr_efi;
+   map_size = sizeof (*eps3);
+   } else if (!efi_get_table(&efi_smbios, &addr_efi)) {
+   addr = (vm_paddr_t)addr_efi;
+   }
+
 #endif
 
 #if defined(__amd64__) || defined(__i386__)
@@ -101,28 +109,50 @@ smbios_identify (driver_t *driver, device_t parent)
 #endif
 
if (addr != 0) {
-   eps = pmap_mapbios(addr, 0x1f);
-   rid = 0;
-   length = eps->length;
-
-   if (length != 0x1f) {
+   ptr = pmap_mapbios(addr, map_size);
+   if (ptr == NULL)
+   return;
+   if (map_size == sizeof (*eps3)) {
+   eps3 = ptr;
+   length = eps3->length;
+   if (memcmp(eps3->anchor_string,
+   SMBIOS3_SIG, SMBIOS3_LEN) != 0) {
+   printf("smbios3: corrupt sig %s found\n",
+   eps3->anchor_string);
+   return;
+   }
+   } else {
+   eps = ptr;
+   length = eps->length;
+   if (memcmp(eps->anchor_string,
+   SMBIOS_SIG, SMBIOS_LEN) != 0) {
+   printf("smbios: corrupt sig %s found\n",
+   eps->anchor_string);
+   return;
+   }
+   }
+   if (length != map_size) {
u_int8_t major, minor;
 
major = eps->major_version;
minor = eps->minor_version;
 
/* SMBIOS

git: 6f38d2e7b059 - main - acpi: Add workaround for Altra I2C memory resource

2023-11-15 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=6f38d2e7b0599f9b61c04686eb9a7faf3264b8ec

commit 6f38d2e7b0599f9b61c04686eb9a7faf3264b8ec
Author: Andrew Gallatin 
AuthorDate: 2023-11-15 21:22:00 +
Commit: Andrew Gallatin 
CommitDate: 2023-11-15 21:25:00 +

acpi: Add workaround for Altra I2C memory resource

Submitted by: allanjude
Sponsored by: Ampere Computing LLC
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D28741
---
 sys/dev/acpica/acpi_resource.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/sys/dev/acpica/acpi_resource.c b/sys/dev/acpica/acpi_resource.c
index 373cc6da9820..b845fd146f67 100644
--- a/sys/dev/acpica/acpi_resource.c
+++ b/sys/dev/acpica/acpi_resource.c
@@ -517,6 +517,13 @@ acpi_parse_resources(device_t dev, ACPI_HANDLE handle,
 acpi_MatchHid(handle, "ARMHD620") != ACPI_MATCHHID_NOMATCH)
arc.ignore_producer_flag = true;
 
+/*
+ * The DesignWare I2C Controller on Ampere Altra sets ResourceProducer on
+ * memory resources.
+ */
+if (acpi_MatchHid(handle, "APMC0D0F") != ACPI_MATCHHID_NOMATCH)
+   arc.ignore_producer_flag = true;
+
 status = AcpiWalkResources(handle, "_CRS", acpi_parse_resource, &arc);
 if (ACPI_FAILURE(status) && status != AE_NOT_FOUND) {
printf("can't fetch resources for %s - %s\n",

git: 5972ffde919a - main - ig4(4): Add an EMAG device type

2023-11-15 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=5972ffde919ab65ba29d4d51ccf735da18d52719

commit 5972ffde919ab65ba29d4d51ccf735da18d52719
Author: Andrew Gallatin 
AuthorDate: 2023-11-16 00:51:28 +
Commit: Andrew Gallatin 
CommitDate: 2023-11-16 00:53:21 +

ig4(4): Add an EMAG device type

Sponsored by: Ampere Computing LLC, Netflix
Submitted by: allanjude
Differential Revision: https://reviews.freebsd.org/D28746
Reviewed by: imp
---
 sys/dev/ichiic/ig4_acpi.c | 12 ++--
 sys/dev/ichiic/ig4_iic.c  |  3 +++
 sys/dev/ichiic/ig4_var.h  |  1 +
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/sys/dev/ichiic/ig4_acpi.c b/sys/dev/ichiic/ig4_acpi.c
index f88cca6cf13d..3f370ae7abb9 100644
--- a/sys/dev/ichiic/ig4_acpi.c
+++ b/sys/dev/ichiic/ig4_acpi.c
@@ -83,13 +83,21 @@ static int
 ig4iic_acpi_attach(device_t dev)
 {
ig4iic_softc_t  *sc;
+   char *str;
int error;
 
sc = device_get_softc(dev);
 
sc->dev = dev;
-   /* All the HIDs matched are Atom SOCs. */
-   sc->version = IG4_ATOM;
+   error = ACPI_ID_PROBE(device_get_parent(dev), dev, ig4iic_ids, &str);
+   if (error > 0)
+   return (error);
+   if (strcmp(str, "APMC0D0F") == 0) {
+   sc->version = IG4_EMAG;
+   } else {
+   /* All the other HIDs matched are Atom SOCs. */
+   sc->version = IG4_ATOM;
+   }
sc->regs_rid = 0;
sc->regs_res = bus_alloc_resource_any(dev, SYS_RES_MEMORY,
  &sc->regs_rid, RF_ACTIVE);
diff --git a/sys/dev/ichiic/ig4_iic.c b/sys/dev/ichiic/ig4_iic.c
index 195bca62928a..3dc72c458b24 100644
--- a/sys/dev/ichiic/ig4_iic.c
+++ b/sys/dev/ichiic/ig4_iic.c
@@ -89,6 +89,9 @@
  * Ig4 hardware parameters except Haswell are taken from intel_lpss driver
  */
 static const struct ig4_hw ig4iic_hw[] = {
+   [IG4_EMAG] = {
+   .ic_clock_rate = 100,   /* MHz */
+   },
[IG4_HASWELL] = {
.ic_clock_rate = 100,   /* MHz */
.sda_hold_time = 90,/* nsec */
diff --git a/sys/dev/ichiic/ig4_var.h b/sys/dev/ichiic/ig4_var.h
index 964a610e7408..989cf23779a2 100644
--- a/sys/dev/ichiic/ig4_var.h
+++ b/sys/dev/ichiic/ig4_var.h
@@ -42,6 +42,7 @@
 #include "iicbus_if.h"
 
 enum ig4_vers {
+   IG4_EMAG,
IG4_HASWELL,
IG4_ATOM,
IG4_SKYLAKE,

git: 5cd08d9ecf52 - main - apei: Mark ReadAckRegister resource as shareable

2024-01-09 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=5cd08d9ecf52d37229f4888e38631cb91ce97eb9

commit 5cd08d9ecf52d37229f4888e38631cb91ce97eb9
Author: Andrew Gallatin 
AuthorDate: 2024-01-09 20:52:07 +
Commit: Andrew Gallatin 
CommitDate: 2024-01-09 21:07:34 +

apei: Mark ReadAckRegister resource as shareable

Work around vendors who use the same address for multiple
ReadAckRegisters in their ACPI HEST table.  This
allows apei to attach cleanly on Ampere Altra servers.
Note the issue is not specific to Ampere, I've run into
it with at least one other vendor (whose server is not
yet released).

Sponsored by: Netflix
Reviewed by: jhb
---
 sys/dev/acpica/acpi_apei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sys/dev/acpica/acpi_apei.c b/sys/dev/acpica/acpi_apei.c
index 6a3d9d10edd4..9cfd46c97430 100644
--- a/sys/dev/acpica/acpi_apei.c
+++ b/sys/dev/acpica/acpi_apei.c
@@ -711,7 +711,7 @@ apei_attach(device_t dev)
if (ge->v1.Header.Type == ACPI_HEST_TYPE_GENERIC_ERROR_V2) {
ge->res2_rid = rid++;
acpi_bus_alloc_gas(dev, &ge->res2_type, &ge->res2_rid,
-   &ge->v2.ReadAckRegister, &ge->res2, 0);
+   &ge->v2.ReadAckRegister, &ge->res2, RF_SHAREABLE);
if (ge->res2 == NULL)
device_printf(dev, "Can't allocate ack 
resource.\n");
}

git: be91b4797e2c - main - acpi_ged: Handle events directly

2023-10-12 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=be91b4797e2c8f3440f6fe3aac7e246886f9ebca

commit be91b4797e2c8f3440f6fe3aac7e246886f9ebca
Author: Andrew Gallatin 
AuthorDate: 2023-10-12 15:15:06 +
Commit: Andrew Gallatin 
CommitDate: 2023-10-12 15:27:44 +

acpi_ged:  Handle events directly

Handle ged interrupts directly from the interrupt handler,
while the interrupt source is masked, so as to conform
with the acpi spec, and avoid spurious interrupts and
lockups on boot.

When an acpi ged interrupt is encountered, the spec requires
the os (as stated in 5.6.4: General Purpose Event Handling)
to leave the interrupt source masked until it runs the
EOI handler.  This is not a good fit for our method of
queuing the work (including the EOI ack of the interrupt),
via the AcpiOsExecute() taskqueue mechanism.

Note this fixes a bug where an arm64 server could lock up if
it encountered a ged interrupt at boot.  The lockup was
due to running on a single core (due to arm64 not using
EARLY_AP_STARTUP), and due to that core encountering a
new interrupt each time the interrupt handler unmasked
the interrupt source, and having the EOI queued on a taskqueue
which never got a chance to run. This is also possible
on any platform when using just a single processor.
The symptom of this is a lockup at boot, with:
"AcpiOsExecute: failed to enqueue task, consider
increasing the debug.acpi.max_tasks tunable" scrolling
on console.

Similarly, spurious interrupts would occur when running
with multiple cores, because it was likely that the
interrupt would fire again immediately, before the
ged task could be run, and before an EOI could be sent
to lower the interrupt line.  I would typically see
3-5 copies of every ged event due to this issue.

This adds a tunable, debug.acpi.ged_defer, which can be
set to 1 to restore the old behavior.  This was done
because acpi is a complex system, and it may be
theoretically possible something the ged handler does
may sleep (though I cannot easily find anthing by inspection).

MFC after: 1 month
Reviewed by: andrew, jhb, imp
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D42158
---
 sys/dev/acpica/acpi_ged.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/sys/dev/acpica/acpi_ged.c b/sys/dev/acpica/acpi_ged.c
index e7dcc1ffb0ac..23e125f277c5 100644
--- a/sys/dev/acpica/acpi_ged.c
+++ b/sys/dev/acpica/acpi_ged.c
@@ -81,6 +81,11 @@ static driver_t acpi_ged_driver = {
 DRIVER_MODULE(acpi_ged, acpi, acpi_ged_driver, 0, 0);
 MODULE_DEPEND(acpi_ged, acpi, 1, 1, 1);
 
+static int acpi_ged_defer;
+SYSCTL_INT(_debug_acpi, OID_AUTO, ged_defer, CTLFLAG_RWTUN,
+&acpi_ged_defer, 0,
+"Handle ACPI GED via a task, rather than in the ISR");
+
 static void
 acpi_ged_evt(void *arg)
 {
@@ -92,7 +97,12 @@ acpi_ged_evt(void *arg)
 static void
 acpi_ged_intr(void *arg)
 {
-   AcpiOsExecute(OSL_GPE_HANDLER, acpi_ged_evt, arg);
+   struct acpi_ged_event *evt = arg;
+
+   if (acpi_ged_defer)
+   AcpiOsExecute(OSL_GPE_HANDLER, acpi_ged_evt, arg);
+   else
+   AcpiEvaluateObject(evt->ah, NULL, &evt->args, NULL);
 }
 static int
 acpi_ged_probe(device_t dev)

git: fd67ff5c7a6c - main - Use the correct idle routine on recent AMD EPYC servers

2024-11-08 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=fd67ff5c7a6cd9a2e82e6a02ea249cec76a4c030

commit fd67ff5c7a6cd9a2e82e6a02ea249cec76a4c030
Author: Andrew Gallatin 
AuthorDate: 2024-11-08 21:37:32 +
Commit: Andrew Gallatin 
CommitDate: 2024-11-08 22:10:44 +

Use the correct idle routine on recent AMD EPYC servers

We have been incorrectly choosing the "hlt" idle method on modern AMD
EPYC servers for C1 idle. This is because AMD also uses the Functional
Fixed Hardware interface. Due to not parsing the table properly for
AMD, and due to a weird quirk where the mwait latency for C1 is
mis-interpreted as the latency for hlt, we wind up choosing hlt for
c1, which has a far higher wake up latency (similar to IO) of roughly
400us on my test system (AMD 7502P).

This patch fixes this by:

- Looking for AMD in addition to Intel in the FFH
 (Note the vendor id of "2" for AMD is not publically documented, but
 AMD has confirmed they are using "2" and has promised to document it.)

- Using mwait on AMD when specified in the table, and when CPUid says
 its supported

- Fixing a weird issue where we copy the contents of cx_ptr for C1 and
 when moving to C2, we do not reinitialize cx_ptr. This leads to
 mwait being selected, and ignoring the specified i/o halt method
 unless we clear mwait before looking at the table for C2.

Differential Revision: https://reviews.freebsd.org/D47444
Reviewed by: dab, kib, vangyzen
Sponsored by: Netflix
---
 sys/dev/acpica/acpi_cpu.c | 9 +++--
 sys/x86/include/x86_var.h | 1 +
 sys/x86/x86/identcpu.c| 2 ++
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/sys/dev/acpica/acpi_cpu.c b/sys/dev/acpica/acpi_cpu.c
index 80855cf168e9..63e17de1ff28 100644
--- a/sys/dev/acpica/acpi_cpu.c
+++ b/sys/dev/acpica/acpi_cpu.c
@@ -131,6 +131,7 @@ struct acpi_cpu_device {
 #define PIIX4_PCNTRL_BST_EN(1<<10)
 
 #defineCST_FFH_VENDOR_INTEL1
+#defineCST_FFH_VENDOR_AMD  2
 #defineCST_FFH_INTEL_CL_C1IO   1
 #defineCST_FFH_INTEL_CL_MWAIT  2
 #defineCST_FFH_MWAIT_HW_COORD  0x0001
@@ -855,7 +856,8 @@ acpi_cpu_cx_cst(struct acpi_cpu_softc *sc)
acpi_cpu_cx_cst_free_plvlx(sc->cpu_dev, cx_ptr);
 #if defined(__i386__) || defined(__amd64__)
if (acpi_PkgFFH_IntelCpu(pkg, 0, &vendor, &class, &address,
- &accsize) == 0 && vendor == CST_FFH_VENDOR_INTEL) {
+ &accsize) == 0 &&
+   (vendor == CST_FFH_VENDOR_INTEL || vendor == 
CST_FFH_VENDOR_AMD)) {
if (class == CST_FFH_INTEL_CL_C1IO) {
/* C1 I/O then Halt */
cx_ptr->res_rid = sc->cpu_cx_count;
@@ -872,7 +874,9 @@ acpi_cpu_cx_cst(struct acpi_cpu_softc *sc)
  "degrading to C1 Halt", (int)address);
}
} else if (class == CST_FFH_INTEL_CL_MWAIT) {
-   acpi_cpu_cx_cst_mwait(cx_ptr, address, accsize);
+   if (vendor == CST_FFH_VENDOR_INTEL ||
+   (vendor == CST_FFH_VENDOR_AMD && cpu_mon_mwait_edx != 
0))
+   acpi_cpu_cx_cst_mwait(cx_ptr, address, accsize);
}
}
 #endif
@@ -922,6 +926,7 @@ acpi_cpu_cx_cst(struct acpi_cpu_softc *sc)
acpi_PkgGas(sc->cpu_dev, pkg, 0, &cx_ptr->res_type,
&cx_ptr->res_rid, &cx_ptr->p_lvlx, RF_SHAREABLE);
if (cx_ptr->p_lvlx) {
+   cx_ptr->do_mwait = false;
ACPI_DEBUG_PRINT((ACPI_DB_INFO,
 "acpi_cpu%d: Got C%d - %d latency\n",
 device_get_unit(sc->cpu_dev), cx_ptr->type,
diff --git a/sys/x86/include/x86_var.h b/sys/x86/include/x86_var.h
index f19c557e270b..6609871bf89e 100644
--- a/sys/x86/include/x86_var.h
+++ b/sys/x86/include/x86_var.h
@@ -62,6 +62,7 @@ externcharcpu_vendor[];
 extern charcpu_model[];
 extern u_int   cpu_vendor_id;
 extern u_int   cpu_mon_mwait_flags;
+extern u_int   cpu_mon_mwait_edx;
 extern u_int   cpu_mon_min_size;
 extern u_int   cpu_mon_max_size;
 extern u_int   cpu_maxphyaddr;
diff --git a/sys/x86/x86/identcpu.c b/sys/x86/x86/identcpu.c
index d3aec5b5e0c6..3f8f11fda011 100644
--- a/sys/x86/x86/identcpu.c
+++ b/sys/x86/x86/identcpu.c
@@ -117,6 +117,7 @@ u_int   cpu_stdext_feature3;/* %edx */
 uint64_t cpu_ia32_arch_caps;
 u_int  cpu_max_ext_state_size;
 u_int  cpu_mon_mwait_flags;/* MONITOR/MWAIT flags (CPUID.05H.ECX) */
+u_int  cpu_mon_mwait_edx;  /* MONITOR/MWAIT supported on AMD 
(CPUID.05H.EDX) */
 u_int  cpu_mon_min_size;   /* MONITOR minimum range size, bytes */
 u_int  cpu_mon_max_size;   /* MONITOR minimum range size

git: 49597c3e84c4 - main - mlx5e: Use M_WAITOK when allocating TLS tags

2024-10-23 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=49597c3e84c4a1cc35f2c392d93db8d0a1cccac2

commit 49597c3e84c4a1cc35f2c392d93db8d0a1cccac2
Author: Andrew Gallatin 
AuthorDate: 2024-10-23 19:56:14 +
Commit: Andrew Gallatin 
CommitDate: 2024-10-23 19:56:14 +

mlx5e:  Use M_WAITOK when allocating TLS tags

Now that it is clear we're in a sleepable context, use
M_WAITOK when allocating TLS tags.

Suggested by: kib
Sponsored by: Netflix
---
 sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c 
b/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
index c347de650250..b5caa3ba53dd 100644
--- a/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
+++ b/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
@@ -335,9 +335,7 @@ mlx5e_tls_snd_tag_alloc(if_t ifp,
return (EOPNOTSUPP);
 
/* allocate new tag from zone, if any */
-   ptag = uma_zalloc(priv->tls.zone, M_NOWAIT);
-   if (ptag == NULL)
-   return (ENOMEM);
+   ptag = uma_zalloc(priv->tls.zone, M_WAITOK);
 
/* sanity check default values */
MPASS(ptag->dek_index == 0);

git: 81dbc22ce8b6 - main - mlx5e: Immediately initialize TLS send tags

2024-10-23 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=81dbc22ce8b66759a9fc4ebdef5cfc7a6185af22

commit 81dbc22ce8b66759a9fc4ebdef5cfc7a6185af22
Author: Andrew Gallatin 
AuthorDate: 2024-10-23 19:16:19 +
Commit: Andrew Gallatin 
CommitDate: 2024-10-23 19:16:19 +

mlx5e: Immediately initialize TLS send tags

Under massive connection thrashing (web server restarting), we see
long periods where the web server blocks when enabling ktls offload
when NIC ktls offload is enabled.

It turns out the driver uses a single-threaded linux work queue to
serialize the commands that must be sent to the nic to allocate and
free tls resources. When freeing sessions, this work is handled
asynchronously. However, when allocating sessions, the work is handled
synchronously and the driver waits for the work to complete before
returning. When under massive connection thrashing, the work queue is
first filled by TLS sessions closing. Then when new sessions arrive,
the web server enables kTLS and blocks while the tens or hundreds of
thousands of sessions closes queued up are processed by the NIC.

Rather than using the work queue to open a TLS session on the NIC,
switch to doing the open directly. This allows use to cut in front of
all those sessions that are waiting to close, and minimize the amount
of time the web server blocks. The risk is that the NIC may be out of
resources because it has not processed all of those session frees. So
if we fail to open a session directly, we fall back to using the work
queue.

Differential Revision: https://reviews.freebsd.org/D47260
Sponsored by: Netflix
Reviewed by: kib
---
 sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c | 86 +--
 1 file changed, 52 insertions(+), 34 deletions(-)

diff --git a/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c 
b/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
index a8522d68d5aa..c347de650250 100644
--- a/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
+++ b/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
@@ -213,54 +213,63 @@ mlx5e_tls_cleanup(struct mlx5e_priv *priv)
counter_u64_free(ptls->stats.arg[x]);
 }
 
+
+static int
+mlx5e_tls_st_init(struct mlx5e_priv *priv, struct mlx5e_tls_tag *ptag)
+{
+   int err;
+
+   /* try to open TIS, if not present */
+   if (ptag->tisn == 0) {
+   err = mlx5_tls_open_tis(priv->mdev, 0, priv->tdn,
+   priv->pdn, &ptag->tisn);
+   if (err) {
+   MLX5E_TLS_STAT_INC(ptag, tx_error, 1);
+   return (err);
+   }
+   }
+   MLX5_SET(sw_tls_cntx, ptag->crypto_params, progress.pd, ptag->tisn);
+
+   /* try to allocate a DEK context ID */
+   err = mlx5_encryption_key_create(priv->mdev, priv->pdn,
+   MLX5_GENERAL_OBJECT_TYPE_ENCRYPTION_KEY_TYPE_TLS,
+   MLX5_ADDR_OF(sw_tls_cntx, ptag->crypto_params, key.key_data),
+   MLX5_GET(sw_tls_cntx, ptag->crypto_params, key.key_len),
+   &ptag->dek_index);
+   if (err) {
+   MLX5E_TLS_STAT_INC(ptag, tx_error, 1);
+   return (err);
+   }
+
+   MLX5_SET(sw_tls_cntx, ptag->crypto_params, param.dek_index, 
ptag->dek_index);
+
+   ptag->dek_index_ok = 1;
+
+   MLX5E_TLS_TAG_LOCK(ptag);
+   if (ptag->state == MLX5E_TLS_ST_INIT)
+   ptag->state = MLX5E_TLS_ST_SETUP;
+   MLX5E_TLS_TAG_UNLOCK(ptag);
+   return (0);
+}
+
 static void
 mlx5e_tls_work(struct work_struct *work)
 {
struct mlx5e_tls_tag *ptag;
struct mlx5e_priv *priv;
-   int err;
 
ptag = container_of(work, struct mlx5e_tls_tag, work);
priv = container_of(ptag->tls, struct mlx5e_priv, tls);
 
switch (ptag->state) {
case MLX5E_TLS_ST_INIT:
-   /* try to open TIS, if not present */
-   if (ptag->tisn == 0) {
-   err = mlx5_tls_open_tis(priv->mdev, 0, priv->tdn,
-   priv->pdn, &ptag->tisn);
-   if (err) {
-   MLX5E_TLS_STAT_INC(ptag, tx_error, 1);
-   break;
-   }
-   }
-   MLX5_SET(sw_tls_cntx, ptag->crypto_params, progress.pd, 
ptag->tisn);
-
-   /* try to allocate a DEK context ID */
-   err = mlx5_encryption_key_create(priv->mdev, priv->pdn,
-   MLX5_GENERAL_OBJECT_TYPE_ENCRYPTION_KEY_TYPE_TLS,
-   MLX5_ADDR_OF(sw_tls_cntx, ptag->crypto_params, 
key.key_data),
-   MLX5_GET(sw_tls_cntx, ptag->crypto_params, key.key_len),
-   &ptag->dek_index);
-   if (err) {
-   MLX5E_TLS_STAT_INC(ptag, tx_error,

git: 4605a99b51ab - main - aio: remove write-only jobid & kernelinfo

2024-11-15 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=4605a99b51ab72351d7554fbadbb24985f4667b1

commit 4605a99b51ab72351d7554fbadbb24985f4667b1
Author: Andrew Gallatin 
AuthorDate: 2024-11-15 15:41:34 +
Commit: Andrew Gallatin 
CommitDate: 2024-11-15 15:47:46 +

aio: remove write-only jobid & kernelinfo

The jobid (which was stored in kernelinfo) was used to look up
jobs until 1ce9182407f6, where it became essentially write only.
Remove it to simplify the code and pave the way for future work
to make aio scale better.

Note this has been slated for removal "soon" for 18 years.

Suggested by: jhb
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D47583
---
 sys/kern/vfs_aio.c | 42 +-
 sys/sys/aio.h  |  2 +-
 2 files changed, 2 insertions(+), 42 deletions(-)

diff --git a/sys/kern/vfs_aio.c b/sys/kern/vfs_aio.c
index e7302f4b7a9e..eb08716fbeda 100644
--- a/sys/kern/vfs_aio.c
+++ b/sys/kern/vfs_aio.c
@@ -71,12 +71,6 @@
 #include 
 #include 
 
-/*
- * Counter for allocating reference ids to new jobs.  Wrapped to 1 on
- * overflow. (XXX will be removed soon.)
- */
-static u_long jobrefid;
-
 /*
  * Counter for aio_fsync.
  */
@@ -297,7 +291,6 @@ struct aiocb_ops {
long(*fetch_error)(struct aiocb *ujob);
int (*store_status)(struct aiocb *ujob, long status);
int (*store_error)(struct aiocb *ujob, long error);
-   int (*store_kernelinfo)(struct aiocb *ujob, long jobref);
int (*store_aiocb)(struct aiocb **ujobp, struct aiocb *ujob);
 };
 
@@ -418,7 +411,6 @@ aio_onceonly(void)
aiolio_zone = uma_zcreate("AIOLIO", sizeof(struct aioliojob), NULL,
NULL, NULL, NULL, UMA_ALIGN_PTR, 0);
aiod_lifetime = AIOD_LIFETIME_DEFAULT;
-   jobrefid = 1;
p31b_setcfg(CTL_P1003_1B_ASYNCHRONOUS_IO, _POSIX_ASYNCHRONOUS_IO);
p31b_setcfg(CTL_P1003_1B_AIO_MAX, MAX_AIO_QUEUE);
p31b_setcfg(CTL_P1003_1B_AIO_PRIO_DELTA_MAX, 0);
@@ -1455,13 +1447,6 @@ aiocb_store_error(struct aiocb *ujob, long error)
return (suword(&ujob->_aiocb_private.error, error));
 }
 
-static int
-aiocb_store_kernelinfo(struct aiocb *ujob, long jobref)
-{
-
-   return (suword(&ujob->_aiocb_private.kernelinfo, jobref));
-}
-
 static int
 aiocb_store_aiocb(struct aiocb **ujobp, struct aiocb *ujob)
 {
@@ -1475,7 +1460,6 @@ static struct aiocb_ops aiocb_ops = {
.fetch_error = aiocb_fetch_error,
.store_status = aiocb_store_status,
.store_error = aiocb_store_error,
-   .store_kernelinfo = aiocb_store_kernelinfo,
.store_aiocb = aiocb_store_aiocb,
 };
 
@@ -1486,7 +1470,6 @@ static struct aiocb_ops aiocb_ops_osigevent = {
.fetch_error = aiocb_fetch_error,
.store_status = aiocb_store_status,
.store_error = aiocb_store_error,
-   .store_kernelinfo = aiocb_store_kernelinfo,
.store_aiocb = aiocb_store_aiocb,
 };
 #endif
@@ -1507,7 +1490,6 @@ aio_aqueue(struct thread *td, struct aiocb *ujob, struct 
aioliojob *lj,
int opcode;
int error;
int fd, kqfd;
-   int jid;
u_short evflags;
 
if (p->p_aioinfo == NULL)
@@ -1517,7 +1499,6 @@ aio_aqueue(struct thread *td, struct aiocb *ujob, struct 
aioliojob *lj,
 
ops->store_status(ujob, -1);
ops->store_error(ujob, 0);
-   ops->store_kernelinfo(ujob, -1);
 
if (num_queue_count >= max_queue_count ||
ki->kaio_count >= max_aio_queue_per_proc) {
@@ -1630,16 +1611,8 @@ aio_aqueue(struct thread *td, struct aiocb *ujob, struct 
aioliojob *lj,
job->fd_file = fp;
 
mtx_lock(&aio_job_mtx);
-   jid = jobrefid++;
job->seqno = jobseqno++;
mtx_unlock(&aio_job_mtx);
-   error = ops->store_kernelinfo(ujob, jid);
-   if (error) {
-   error = EINVAL;
-   goto err3;
-   }
-   job->uaiocb._aiocb_private.kernelinfo = (void *)(intptr_t)jid;
-
if (opcode == LIO_NOP) {
fdrop(fp, td);
MPASS(job->uiop == &job->uio || job->uiop == NULL);
@@ -2728,7 +2701,7 @@ filt_lio(struct knote *kn, long hint)
 struct __aiocb_private32 {
int32_t status;
int32_t error;
-   uint32_t kernelinfo;
+   uint32_t spare;
 };
 
 #ifdef COMPAT_FREEBSD6
@@ -2807,7 +2780,6 @@ aiocb32_copyin_old_sigevent(struct aiocb *ujob, struct 
kaiocb *kjob,
CP(job32, *kcb, aio_reqprio);
CP(job32, *kcb, _aiocb_private.status);
CP(job32, *kcb, _aiocb_private.error);
-   PTRIN_CP(job32, *kcb, _aiocb_private.kernelinfo);
return (convert_old_sigevent32(&job32.aio_sigevent,
&kcb->aio_sigevent));
 }
@@ -2844,7 +2816,6 @@ aiocb32_copyin(struct aiocb *ujob, struct kaioc

git: 194bb58b80c1 - main - x86: Fixes for nmi/pmi interrupt sharing

2025-02-05 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=194bb58b80c184b8230edef0ed7f292b4bf706b0

commit 194bb58b80c184b8230edef0ed7f292b4bf706b0
Author: Andrew Gallatin 
AuthorDate: 2025-02-04 22:04:57 +
Commit: Andrew Gallatin 
CommitDate: 2025-02-05 15:26:27 +

x86: Fixes for nmi/pmi interrupt sharing

- Fix a bug where the semantics of refcount_release() were
reversed.  This would lead to the nmi interrupt being prematurely
masked in the local apic, leading to an out-of-tree profiling
tool only getting results the first time it was run.

- Stop executing nmi handlers after one claims the interrupt.
The core2 hwpmc handler seems to be especially heavy, and running it
in addition to vtune's handler caused roughly 50% of the nmi interrupts
to be lost (and caused vtune to give worse results).

Reviewed by: bojan
Sponsored by: Netflix
---
 sys/x86/x86/cpu_machdep.c | 11 ---
 sys/x86/x86/local_apic.c  |  2 +-
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/sys/x86/x86/cpu_machdep.c b/sys/x86/x86/cpu_machdep.c
index 4df652f1f2a8..5b4abfe71642 100644
--- a/sys/x86/x86/cpu_machdep.c
+++ b/sys/x86/x86/cpu_machdep.c
@@ -65,6 +65,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -955,6 +956,7 @@ nmi_handle_intr(struct trapframe *frame)
 {
int (*func)(struct trapframe *);
struct nmi_handler *hp;
+   int rv;
bool handled;
 
 #ifdef SMP
@@ -965,13 +967,16 @@ nmi_handle_intr(struct trapframe *frame)
handled = false;
hp = (struct nmi_handler *)atomic_load_acq_ptr(
(uintptr_t *)&nmi_handlers_head);
-   while (hp != NULL) {
+   while (!handled && hp != NULL) {
func = hp->func;
if (func != NULL) {
atomic_add_int(&hp->running, 1);
-   if (func(frame) != 0)
-   handled = true;
+   rv = func(frame);
atomic_subtract_int(&hp->running, 1);
+   if (rv != 0) {
+   handled = true;
+   break;
+   }
}
hp = (struct nmi_handler *)atomic_load_acq_ptr(
(uintptr_t *)&hp->next);
diff --git a/sys/x86/x86/local_apic.c b/sys/x86/x86/local_apic.c
index 86cbe9a050dc..db9a1eb757de 100644
--- a/sys/x86/x86/local_apic.c
+++ b/sys/x86/x86/local_apic.c
@@ -895,7 +895,7 @@ lapic_disable_pcint(void)
maxlvt = (lapic_read32(LAPIC_VERSION) & APIC_VER_MAXLVT) >> MAXLVTSHIFT;
if (maxlvt < APIC_LVT_PMC)
return;
-   if (refcount_release(&pcint_refcnt))
+   if (!refcount_release(&pcint_refcnt))
return;
lvts[APIC_LVT_PMC].lvt_masked = 1;

git: 36fdc42c6a4c - main - mlx5en: Fix SIOCSIFCAPNV

2025-01-30 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=36fdc42c6a4c828d334471438c4f852e4b5a25e2

commit 36fdc42c6a4c828d334471438c4f852e4b5a25e2
Author: Andrew Gallatin 
AuthorDate: 2025-01-31 01:07:06 +
Commit: Andrew Gallatin 
CommitDate: 2025-01-31 01:57:35 +

mlx5en: Fix SIOCSIFCAPNV

In 4cc5d081d8c23, a change was introduced that manipulated
drv_ioctl_data->reqcap using IFCAP2 bits.  This was noticed
when creating a mixed lagg with mce0 and ixl0 caused the
interfaces' txcsum caps to be disabled.

Fixes: 4cc5d081d8c23
Reviewed by: glebius
Sponsored by: Netflix
MFC After: 7 days
---
 sys/dev/mlx5/mlx5_en/mlx5_en_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sys/dev/mlx5/mlx5_en/mlx5_en_main.c 
b/sys/dev/mlx5/mlx5_en/mlx5_en_main.c
index 3e7df834d080..c17da50c1a5e 100644
--- a/sys/dev/mlx5/mlx5_en/mlx5_en_main.c
+++ b/sys/dev/mlx5/mlx5_en/mlx5_en_main.c
@@ -3557,12 +3557,12 @@ siocsifcap_driver:
IFCAP_TXTLS6);
}
if (!mlx5e_is_tlsrx_capable(priv->mdev)) {
-   drv_ioctl_data->reqcap &= ~(
+   drv_ioctl_data->reqcap2 &= ~(
IFCAP2_BIT(IFCAP2_RXTLS4) |
IFCAP2_BIT(IFCAP2_RXTLS6));
}
if (!mlx5e_is_ipsec_capable(priv->mdev)) {
-   drv_ioctl_data->reqcap &=
+   drv_ioctl_data->reqcap2 &=
~IFCAP2_BIT(IFCAP2_IPSEC_OFFLOAD);
}
if (!mlx5e_is_ratelimit_capable(priv->mdev)) {

git: cf9070746742 - main - Introduce the UMA_ZONE_NOTRIM uma zone type

2025-01-15 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=cf907074674206b1825f79c6864c4c4a32089ecc

commit cf907074674206b1825f79c6864c4c4a32089ecc
Author: Andrew Gallatin 
AuthorDate: 2025-01-15 17:11:51 +
Commit: Andrew Gallatin 
CommitDate: 2025-01-15 17:23:00 +

Introduce the UMA_ZONE_NOTRIM uma zone type

The ktls buffer zone allocates 16k contiguous buffers, and often needs
to call vm_page_reclaim_contig_domain_ext() to free up contiguous
memory, which can be expensive.  Web servers which have a daily
pattern of peaks and troughs end up having UMA trim the
ktls_buffer_zone when they are in their trough, and end up re-building
it on the way to their peak.

Rather than calling vm_page_reclaim_contig_domain_ext() multiple times
on a daily basis, lets mark the ktls_buffer_zone with a new UMA flag,
UMA_ZONE_NOTRIM.  This disables UMA_RECLAIM_TRIM on the zone, but
allows UMA_RECLAIM_DRAIN* operations, so that if we become extremely
short of memory (vm_page_count_severe()), the uma reclaim worker can
still free up memory.

Note that UMA_ZONE_UNMANAGED already exists, but can never be drained
or trimmed, so it may hold on to memory during times of severe memory
pressure.  Using UMA_ZONE_NOTRIM rather than UMA_ZONE_UNMANAGED is an
attempt to keep this zone more reactive in the face of severe memory
pressure.

Sponsored by: Netflix
Reviewed by: jhb, kib, markj (via slack)
Differential Revision: https://reviews.freebsd.org/D48451
---
 sys/kern/uipc_ktls.c |  2 +-
 sys/vm/uma.h |  1 +
 sys/vm/uma_core.c| 11 ---
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c
index 881825bf1d9f..6815667594a4 100644
--- a/sys/kern/uipc_ktls.c
+++ b/sys/kern/uipc_ktls.c
@@ -495,7 +495,7 @@ ktls_init(void)
ktls_buffer_zone = uma_zcache_create("ktls_buffers",
roundup2(ktls_maxlen, PAGE_SIZE), NULL, NULL, NULL, NULL,
ktls_buffer_import, ktls_buffer_release, NULL,
-   UMA_ZONE_FIRSTTOUCH);
+   UMA_ZONE_FIRSTTOUCH | UMA_ZONE_NOTRIM);
}
 
/*
diff --git a/sys/vm/uma.h b/sys/vm/uma.h
index 38865df7ae02..4f2b143a2fae 100644
--- a/sys/vm/uma.h
+++ b/sys/vm/uma.h
@@ -252,6 +252,7 @@ uma_zone_t uma_zcache_create(const char *name, int size, 
uma_ctor ctor,
 #defineUMA_ZONE_SECONDARY  0x0200  /* Zone is a Secondary Zone */
 #defineUMA_ZONE_NOBUCKET   0x0400  /* Do not use buckets. */
 #defineUMA_ZONE_MAXBUCKET  0x0800  /* Use largest buckets. */
+#defineUMA_ZONE_NOTRIM 0x1000  /* Don't trim this zone */
 #defineUMA_ZONE_CACHESPREAD0x2000  /*
 * Spread memory start locations across
 * all possible cache lines.  May
diff --git a/sys/vm/uma_core.c b/sys/vm/uma_core.c
index e93c561d759a..4de850afcb66 100644
--- a/sys/vm/uma_core.c
+++ b/sys/vm/uma_core.c
@@ -1222,7 +1222,7 @@ zone_timeout(uma_zone_t zone, void *unused)
 
 trim:
/* Trim caches not used for a long time. */
-   if ((zone->uz_flags & UMA_ZONE_UNMANAGED) == 0) {
+   if ((zone->uz_flags & (UMA_ZONE_UNMANAGED | UMA_ZONE_NOTRIM)) == 0) {
for (int i = 0; i < vm_ndomains; i++) {
if (bucket_cache_reclaim_domain(zone, false, false, i) 
&&
(zone->uz_flags & UMA_ZFLAG_CACHE) == 0)
@@ -5306,8 +5306,13 @@ uma_reclaim_domain_cb(uma_zone_t zone, void *arg)
struct uma_reclaim_args *args;
 
args = arg;
-   if ((zone->uz_flags & UMA_ZONE_UNMANAGED) == 0)
-   uma_zone_reclaim_domain(zone, args->req, args->domain);
+   if ((zone->uz_flags & UMA_ZONE_UNMANAGED) != 0)
+   return;
+   if ((args->req == UMA_RECLAIM_TRIM) &&
+   (zone->uz_flags & UMA_ZONE_NOTRIM) !=0)
+   return;
+
+   uma_zone_reclaim_domain(zone, args->req, args->domain);
 }
 
 /* See uma.h */

git: 709348c21351 - main - ifconfig: fix reporting optics on most 100g interfaces

2025-02-25 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=709348c21351a783ff0025519d1f7cf884771077

commit 709348c21351a783ff0025519d1f7cf884771077
Author: Andrew Gallatin 
AuthorDate: 2025-02-25 19:17:14 +
Commit: Andrew Gallatin 
CommitDate: 2025-02-25 19:26:07 +

ifconfig: fix reporting optics on most 100g interfaces

This fixes a bug where optics on 100G and faster NICs
were not properly reported.

The problem is that we pull the string from the correct
table in ifconfig_sfp_physical_spec only when sfp_eth_1040g
contains the SFP_ETH_1040G_EXTENDED bit.  However, we were
never saving that bit when it was encountered.  This change
records that bit into sfp_eth_1040g, allowing us to later
select the appropriate ID string.

This should cause most 100G interfaces to stop being identified
as "unknown" in the "plugged" output of ifconfig -v, and to
start being identified as what they really are.

Example output from a Chelsio T6 with SR4 optics in one port
and DR1 optics in another:

Before:
plugged: QSFP28 Unknown (MPO 1x12 Parallel Optic)
plugged: QSFP28 Unknown (LC)

After:
plugged: QSFP28 100GBASE-SR4 or 25GBASE-SR (MPO 1x12 Parallel Optic)
plugged: QSFP28 100GBASE-DR (LC)

Reviewed by: kbowling, np
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D49127
MFC after: 7 days
---
 lib/libifconfig/libifconfig_sfp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/libifconfig/libifconfig_sfp.c 
b/lib/libifconfig/libifconfig_sfp.c
index 17f130606765..8292135d3e47 100644
--- a/lib/libifconfig/libifconfig_sfp.c
+++ b/lib/libifconfig/libifconfig_sfp.c
@@ -181,6 +181,7 @@ get_qsfp_info(struct i2c_info *ii, struct ifconfig_sfp_info 
*sfp)
if (code & SFF_8636_EXT_COMPLIANCE) {
read_i2c(ii, SFF_8436_BASE, SFF_8436_OPTIONS_START, 1,
&sfp->sfp_eth_ext);
+   sfp->sfp_eth_1040g = code;
} else {
/* Check 10/40G Ethernet class only */
sfp->sfp_eth_1040g =

git: 20e15e905c58 - main - mlx5: Decrease FW init timeout from 120 seconds to 5 seconds

2025-06-29 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=20e15e905c58e9e2020b2c3e40caa2e8406e5827

commit 20e15e905c58e9e2020b2c3e40caa2e8406e5827
Author: Andrew Gallatin 
AuthorDate: 2025-06-29 20:51:50 +
Commit: Andrew Gallatin 
CommitDate: 2025-06-29 20:51:50 +

mlx5: Decrease FW init timeout from 120 seconds to 5 seconds

When encountering a failed NIC, the mlx5 driver will wait up to 120
secs for the firmware to respond.  This timeout is absurdly huge, and
leads to boot times of 40 minutes to over an hour on our servers when a
NIC fails.  This is because the driver will attempt to attach to the
failed NIC multiple times (once for each driver loaded after mlx5),
and wait 2 minutes on each attempt.  This happens because the mlx5
driver is still the best match for the device.  This delay then
triggers watchdog timeouts in our environment, rendering servers
with a failed NIC entirely unbootable without manual intervention.

Note that FW_INIT_WARN_MESSAGE_INTERVAL must also be decreased, as
it must be less than the init timeout.

Reviewed by: kib (initial version, before reducing warn interval)
Sponsored by: Netflix
---
 sys/dev/mlx5/device.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sys/dev/mlx5/device.h b/sys/dev/mlx5/device.h
index e6d46507a5d2..3e2c4f15a5cc 100644
--- a/sys/dev/mlx5/device.h
+++ b/sys/dev/mlx5/device.h
@@ -32,8 +32,8 @@
 
 #defineFW_INIT_TIMEOUT_MILI2000
 #defineFW_INIT_WAIT_MS 2
-#defineFW_PRE_INIT_TIMEOUT_MILI12
-#defineFW_INIT_WARN_MESSAGE_INTERVAL   2
+#defineFW_PRE_INIT_TIMEOUT_MILI5000
+#defineFW_INIT_WARN_MESSAGE_INTERVAL   2000
 
 #if defined(__LITTLE_ENDIAN)
 #define MLX5_SET_HOST_ENDIANNESS   0

git: 78bdaa57cfba - main - lagg: Fix if_hw_tsomax_update() not being called

2025-07-12 Thread Andrew Gallatin

The branch main has been updated by gallatin:

URL: 
https://cgit.FreeBSD.org/src/commit/?id=78bdaa57cfbac759a6d79ecad2fae570e294a4b3

commit 78bdaa57cfbac759a6d79ecad2fae570e294a4b3
Author: Andrew Gallatin 
AuthorDate: 2025-07-12 22:35:29 +
Commit: Andrew Gallatin 
CommitDate: 2025-07-12 22:35:29 +

lagg: Fix if_hw_tsomax_update() not being called

In a mixed lagg, its likely that ifcaps or hwassist may not
match between members.  If this is true, the logical OR will
be short-circuited and if_hw_tsomax_update() will not be called.

Fix this by calling it inside the body of the if as well

Sponsored by: Netflix
MFC after: 2 weeks
---
 sys/net/if_lagg.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/sys/net/if_lagg.c b/sys/net/if_lagg.c
index 9867a718e148..5b52bfa80e3b 100644
--- a/sys/net/if_lagg.c
+++ b/sys/net/if_lagg.c
@@ -718,6 +718,7 @@ lagg_capabilities(struct lagg_softc *sc)
sc->sc_ifp->if_capenable = ena;
sc->sc_ifp->if_capenable2 = ena2;
sc->sc_ifp->if_hwassist = hwa;
+   (void)if_hw_tsomax_update(sc->sc_ifp, &hw_tsomax);
getmicrotime(&sc->sc_ifp->if_lastchange);
 
if (sc->sc_ifflags & IFF_DEBUG)

69 matches

Mail list logo