git: 1f628be888b7 - main - tcp_ratelimit: provide an api for drivers to release ratesets at detach
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=1f628be888b74f1219b3ea7ccea1e7a3d1db77a2 commit 1f628be888b74f1219b3ea7ccea1e7a3d1db77a2 Author: Andrew Gallatin AuthorDate: 2024-08-05 15:45:42 + Commit: Andrew Gallatin CommitDate: 2024-08-05 16:51:35 + tcp_ratelimit: provide an api for drivers to release ratesets at detach When the kernel is compiled with options RATELIMIT, the mlx5en driver cannot detach. It gets stuck waiting for all kernel users of its rates to drop to zero before finally calling ether_ifdetach. The tcp ratelimit code has an eventhandler for ifnet departure which causes rates to be released. However, this is called as an ifnet departure eventhandler, which is invoked as part of ifdetach(), via either_ifdetach(). This means that the tcp ratelimit code holds down many hw rates when the mlx5en driver is waiting for the rate count to go to 0. Thus devctl detach will deadlock on mlx5 with this stack: mi_switch+0xcf sleepq_timedwait+0x2f _sleep+0x1a3 pause_sbt+0x77 mlx5e_destroy_ifp+0xaf mlx5_remove_device+0xa7 mlx5_unregister_device+0x78 mlx5_unload_one+0x10a remove_one+0x1e linux_pci_detach_device+0x36 linux_pci_detach+0x24 device_detach+0x180 devctl2_ioctl+0x3dc devfs_ioctl+0xbb vn_ioctl+0xca devfs_ioctl_f+0x1e kern_ioctl+0x1c3 sys_ioctl+0x10a To fix this, provide an explicit API for a driver to call the tcp ratelimit code telling it to detach itself from an ifnet. This allows the mlx5 driver to unload cleanly. I considered adding an ifnet pre-departure eventhandler. However, that would need to be invoked by the driver, so a simple function call seemed better. The mlx5en driver has been updated to call this function. Reviewed by: kib, rrs Differential Revision: https://reviews.freebsd.org/D46221 Sponsored by: Netflix --- sys/dev/mlx5/mlx5_en/mlx5_en_main.c | 8 +++- sys/netinet/tcp_ratelimit.c | 6 ++ sys/netinet/tcp_ratelimit.h | 9 + 3 files changed, 22 insertions(+), 1 deletion(-) diff --git a/sys/dev/mlx5/mlx5_en/mlx5_en_main.c b/sys/dev/mlx5/mlx5_en/mlx5_en_main.c index ccbdf11a1dd5..a80235f0f347 100644 --- a/sys/dev/mlx5/mlx5_en/mlx5_en_main.c +++ b/sys/dev/mlx5/mlx5_en/mlx5_en_main.c @@ -36,6 +36,7 @@ #include #include +#include #include #include @@ -4876,7 +4877,12 @@ mlx5e_destroy_ifp(struct mlx5_core_dev *mdev, void *vpriv) #ifdef RATELIMIT /* -* The kernel can have reference(s) via the m_snd_tag's into +* Tell the TCP ratelimit code to release the rate-sets attached +* to our ifnet. +*/ + tcp_rl_release_ifnet(ifp); + /* +* The kernel can still have reference(s) via the m_snd_tag's into * the ratelimit channels, and these must go away before * detaching: */ diff --git a/sys/netinet/tcp_ratelimit.c b/sys/netinet/tcp_ratelimit.c index 1834c702c493..22bdf707fa89 100644 --- a/sys/netinet/tcp_ratelimit.c +++ b/sys/netinet/tcp_ratelimit.c @@ -1298,6 +1298,12 @@ tcp_rl_ifnet_departure(void *arg __unused, struct ifnet *ifp) NET_EPOCH_EXIT(et); } +void +tcp_rl_release_ifnet(struct ifnet *ifp) +{ + tcp_rl_ifnet_departure(NULL, ifp); +} + static void tcp_rl_shutdown(void *arg __unused, int howto __unused) { diff --git a/sys/netinet/tcp_ratelimit.h b/sys/netinet/tcp_ratelimit.h index cd540d1164e1..0ce42dea0d90 100644 --- a/sys/netinet/tcp_ratelimit.h +++ b/sys/netinet/tcp_ratelimit.h @@ -94,6 +94,8 @@ CK_LIST_HEAD(head_tcp_rate_set, tcp_rate_set); #ifndef ETHERNET_SEGMENT_SIZE #define ETHERNET_SEGMENT_SIZE 1514 #endif +struct tcpcb; + #ifdef RATELIMIT #define DETAILED_RATELIMIT_SYSCTL 1/* * Undefine this if you don't want @@ -131,6 +133,9 @@ tcp_get_pacing_burst_size_w_divisor(struct tcpcb *tp, uint64_t bw, uint32_t segs void tcp_rl_log_enobuf(const struct tcp_hwrate_limit_table *rte); +void +tcp_rl_release_ifnet(struct ifnet *ifp); + #else static inline const struct tcp_hwrate_limit_table * tcp_set_pacing_rate(struct tcpcb *tp, struct ifnet *ifp, @@ -218,6 +223,10 @@ tcp_rl_log_enobuf(const struct tcp_hwrate_limit_table *rte) { } +static inline void +tcp_rl_release_ifnet(struct ifnet *ifp) +{ +} #endif /*
git: 13a5a46c49d0 - main - Fix new users of MAXPHYS and hide it from the kernel namespace
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=13a5a46c49d0ec3e10e5476ad763947f165052e2 commit 13a5a46c49d0ec3e10e5476ad763947f165052e2 Author: Andrew Gallatin AuthorDate: 2024-04-29 23:11:56 + Commit: Andrew Gallatin CommitDate: 2024-04-30 19:29:06 + Fix new users of MAXPHYS and hide it from the kernel namespace In cd8537910406, kib made maxphys a load-time tunable. This made the #define MAXPHYS in sys/param.h almost entirely obsolete, as it could now be overridden by kern.maxphys at boot time, or by opt_maxphys.h. However, decades of tradition have led to several new, incorrect, uses of MAXPHYS in other parts of the kernel, mostly by seasoned developers. I've corrected those uses here in a mechanical fashion, and verified that it fixes a bug in the md driver that I was experiencing. Since using MAXPHYS is such an easy mistake to make, it is best to hide it from the kernel namespace. So I've moved its definition to _maxphys.h, which is now included in param.h only for userspace. That brings up the fact that lots of userspace programs use MAXPHYS for different reasons, most of them probably wrong. Userspace consumers that really need to know the value of maxphys should probably be changed to use the kern.maxphys sysctl. But that's outside the scope of this change. Reviewed by: imp, jkim, kib, markj Fixes: 30038a8b4efc ("md: Get rid of the pbuf zone") Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D44986 --- sys/compat/linux/linux_socket.c | 2 +- sys/dev/md/md.c | 6 +++--- sys/dev/rtsx/rtsx.c | 4 ++-- sys/kern/subr_param.c | 1 + sys/sys/_maxphys.h | 10 ++ sys/sys/param.h | 8 +--- 6 files changed, 18 insertions(+), 13 deletions(-) diff --git a/sys/compat/linux/linux_socket.c b/sys/compat/linux/linux_socket.c index 36cffc979802..15431bf3127c 100644 --- a/sys/compat/linux/linux_socket.c +++ b/sys/compat/linux/linux_socket.c @@ -2468,7 +2468,7 @@ sendfile_fallback(struct thread *td, struct file *fp, l_int out, out_offset = 0; flags = FOF_OFFSET | FOF_NOUPDATE; - bufsz = min(count, MAXPHYS); + bufsz = min(count, maxphys); buf = malloc(bufsz, M_LINUX, M_WAITOK); bytes_sent = 0; while (bytes_sent < count) { diff --git a/sys/dev/md/md.c b/sys/dev/md/md.c index 27e63363767c..241517898ad4 100644 --- a/sys/dev/md/md.c +++ b/sys/dev/md/md.c @@ -965,7 +965,7 @@ unmapped_step: PAGE_MASK; iolen = min(ptoa(npages) - (ma_offs & PAGE_MASK), len); KASSERT(iolen > 0, ("zero iolen")); - KASSERT(npages <= atop(MAXPHYS + PAGE_SIZE), + KASSERT(npages <= atop(maxphys + PAGE_SIZE), ("npages %d too large", npages)); pmap_qenter(sc->kva, &bp->bio_ma[atop(ma_offs)], npages); aiov.iov_base = (void *)(sc->kva + (ma_offs & PAGE_MASK)); @@ -1487,7 +1487,7 @@ mdcreate_vnode(struct md_s *sc, struct md_req *mdr, struct thread *td) goto bad; } - sc->kva = kva_alloc(MAXPHYS + PAGE_SIZE); + sc->kva = kva_alloc(maxphys + PAGE_SIZE); return (0); bad: VOP_UNLOCK(nd.ni_vp); @@ -1547,7 +1547,7 @@ mddestroy(struct md_s *sc, struct thread *td) if (sc->uma) uma_zdestroy(sc->uma); if (sc->kva) - kva_free(sc->kva, MAXPHYS + PAGE_SIZE); + kva_free(sc->kva, maxphys + PAGE_SIZE); LIST_REMOVE(sc, list); free_unr(md_uh, sc->unit); diff --git a/sys/dev/rtsx/rtsx.c b/sys/dev/rtsx/rtsx.c index 464a155e66c2..a2f124f6c30d 100644 --- a/sys/dev/rtsx/rtsx.c +++ b/sys/dev/rtsx/rtsx.c @@ -311,7 +311,7 @@ static int rtsx_resume(device_t dev); #defineRTSX_DMA_ALIGN 4 #defineRTSX_HOSTCMD_MAX256 #defineRTSX_DMA_CMD_BIFSIZE(sizeof(uint32_t) * RTSX_HOSTCMD_MAX) -#defineRTSX_DMA_DATA_BUFSIZE MAXPHYS +#defineRTSX_DMA_DATA_BUFSIZE maxphys #defineISSET(t, f) ((t) & (f)) @@ -2762,7 +2762,7 @@ rtsx_xfer(struct rtsx_softc *sc, struct mmc_command *cmd) (unsigned long)cmd->data->len, (unsigned long)cmd->data->xfer_len); if (cmd->data->len > RTSX_DMA_DATA_BUFSIZE) { - device_printf(sc->rtsx_dev, "rtsx_xfer() length too large: %ld > %d\n", + device_printf(sc->rtsx_dev, "rtsx_xfer() length too large: %ld > %ld\n", (unsigned long)cmd->data->len, RTSX_DMA_DATA_BUFSIZE); cmd->error = MMC_ERR_INVALID;
git: 530c2c30b0c7 - main - ip6_output: Reduce cache misses on pktopts
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=530c2c30b0c75f1a71df637ae1e09b352f8256cb commit 530c2c30b0c75f1a71df637ae1e09b352f8256cb Author: Andrew Gallatin AuthorDate: 2024-03-20 19:46:01 + Commit: Andrew Gallatin CommitDate: 2024-03-20 19:50:57 + ip6_output: Reduce cache misses on pktopts When profiling an IP6 heavy workload, I noticed that we were getting a lot of cache misses in ip6_output() around ip6_pktopts. This was happening because the TCP stack passes inp->in6p_outputopts even if all options are unused. So in the common case of no options present, pkt_opts is not null, and is checked repeatedly for different options. Since ip6_pktopts is large (4 cachelines), and every field is checked, we take 4 cache misses (2 of which tend to be hidden by the adjacent line prefetcher). To fix this common case, I introduced a new flag in ip6_pktopts (ip6po_valid) which tracks which options have been set. In the common case where nothing is set, this causes just a single cache miss to load. It also eliminates a test for some options (if (opt != NULL && opt->val >= const) vs if ((optvalid & flag) !=0 ) To keep the struct the same size in 64-bit kernels, and to keep the integer values (like ip6po_hlim, ip6po_tclass, etc) on the same cacheline, I moved them to the top. As suggested by zlei, the null check in MAKE_EXTHDR() becomes redundant, and can be removed. For our web server workload (with the ip6po_tclass option set), this drops the CPI from 2.9 to 2.4 for ip6_output Differential Revision: https://reviews.freebsd.org/D44204 Reviewed by: bz, glebius, zlei No Objection from: melifaro Sponsored by: Netflix Inc. --- sys/netinet6/ip6_output.c | 67 --- sys/netinet6/ip6_var.h| 56 +-- 2 files changed, 83 insertions(+), 40 deletions(-) diff --git a/sys/netinet6/ip6_output.c b/sys/netinet6/ip6_output.c index a2c3efad749b..530f86c36689 100644 --- a/sys/netinet6/ip6_output.c +++ b/sys/netinet6/ip6_output.c @@ -159,14 +159,12 @@ static int copypktopts(struct ip6_pktopts *, struct ip6_pktopts *, int); */ #defineMAKE_EXTHDR(hp, mp, _ol) \ do { \ - if (hp) { \ - struct ip6_ext *eh = (struct ip6_ext *)(hp);\ - error = ip6_copyexthdr((mp), (caddr_t)(hp), \ - ((eh)->ip6e_len + 1) << 3); \ - if (error) \ - goto freehdrs; \ - (_ol) += (*(mp))->m_len;\ - } \ + struct ip6_ext *eh = (struct ip6_ext *)(hp);\ + error = ip6_copyexthdr((mp), (caddr_t)(hp), \ + ((eh)->ip6e_len + 1) << 3); \ + if (error) \ + goto freehdrs; \ + (_ol) += (*(mp))->m_len;\ } while (/*CONSTCOND*/ 0) /* @@ -431,6 +429,7 @@ ip6_output(struct mbuf *m0, struct ip6_pktopts *opt, uint32_t fibnum; struct m_tag *fwd_tag = NULL; uint32_t id; + uint32_t optvalid; NET_EPOCH_ASSERT(); @@ -491,14 +490,17 @@ ip6_output(struct mbuf *m0, struct ip6_pktopts *opt, * Keep the length of the unfragmentable part for fragmentation. */ bzero(&exthdrs, sizeof(exthdrs)); - optlen = 0; + optlen = optvalid = 0; unfragpartlen = sizeof(struct ip6_hdr); if (opt) { + optvalid = opt->ip6po_valid; + /* Hop-by-Hop options header. */ - MAKE_EXTHDR(opt->ip6po_hbh, &exthdrs.ip6e_hbh, optlen); + if ((optvalid & IP6PO_VALID_HBH) != 0) + MAKE_EXTHDR(opt->ip6po_hbh, &exthdrs.ip6e_hbh, optlen); /* Destination options header (1st part). */ - if (opt->ip6po_rthdr) { + if ((optvalid & IP6PO_VALID_RHINFO) != 0) { #ifndef RTHDR_SUPPORT_IMPLEMENTED /* * If there is a routing header, discard the packet @@ -524,11 +526,13 @@ ip6_output(struct mbuf *m0, struct ip6_pktopts *opt, * options, which might automatically be inserted in * the kernel. */ - MAKE_EXTHDR(opt->ip6po_dest1, &
git: b50abe6bd45d - main - namei: Treat non-tied KLDs as if they had INVARIANTS enabled
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=b50abe6bd45dde2baac130d4c4da097598c3b9c0 commit b50abe6bd45dde2baac130d4c4da097598c3b9c0 Author: Andrew Gallatin AuthorDate: 2022-03-18 14:14:14 + Commit: Andrew Gallatin CommitDate: 2022-03-18 14:14:14 + namei: Treat non-tied KLDs as if they had INVARIANTS enabled When working with a vendor to debug their kernel module, I found that a non-tied kld which uses NDINIT will panic due to "namei: bad debugflags " on a kernel compiled with INVARIANTS because non-tied KLDs do not pick up the initialization that is done in NDINIT_DBG/NDREINIT_DBG(). Fix this by making this initialization happen for non-KLD_TIED as well as INVARIANTS Reviewed by: mjg Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D34588 --- sys/sys/namei.h | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/sys/sys/namei.h b/sys/sys/namei.h index 23718dde5bed..98cbc2ca6ed9 100644 --- a/sys/sys/namei.h +++ b/sys/sys/namei.h @@ -228,8 +228,11 @@ intcache_fplookup(struct nameidata *ndp, enum cache_fpl_status *status, /* * Note the constant pattern may *hide* bugs. + * Note also that we enable debug checks for non-TIED KLDs + * so that they can run on an INVARIANTS kernel without tripping over + * assertions on ni_debugflags state. */ -#ifdef INVARIANTS +#if defined(INVARIANTS) || (defined(KLD_MODULE) && !defined(KLD_TIED)) #define NDINIT_PREFILL(arg)memset(arg, 0xff, offsetof(struct nameidata, \ ni_dvp_seqc)) #define NDINIT_DBG(arg){ (arg)->ni_debugflags = NAMEI_DBG_INITED; }
git: a2fc8ade1057 - main - isci: use maxphys rather than 128KB to size s/g list
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=a2fc8ade10577cd35a6000fdb6e7dd7c570852d6 commit a2fc8ade10577cd35a6000fdb6e7dd7c570852d6 Author: Andrew Gallatin AuthorDate: 2021-01-07 17:45:46 + Commit: Andrew Gallatin CommitDate: 2021-01-07 17:45:46 + isci: use maxphys rather than 128KB to size s/g list In the conversion into a tunable, we converted the size of the s/g list used by the driver to be based off of a hardcoded size of 128k rather than maxphys, this caused performance problems for us. Revert this to use the maxphys tunable. Note that this constant is used to size dynamically allocated things, and not static data structs, so this is safe. Reviewed By:imp, kib, mav Tested By:i dhw Differential Revision: https://reviews.freebsd.org/D28023 Sponsored by: Netflix --- sys/dev/isci/scil/sci_controller_constants.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sys/dev/isci/scil/sci_controller_constants.h b/sys/dev/isci/scil/sci_controller_constants.h index 40f6b983601d..47712c531986 100644 --- a/sys/dev/isci/scil/sci_controller_constants.h +++ b/sys/dev/isci/scil/sci_controller_constants.h @@ -157,7 +157,7 @@ extern "C" { * posted to hardware always contain pairs of elements (with second * element set to zeroes if not needed). */ -#define __MAXPHYS_ELEMENTS ((128 * 1024 / PAGE_SIZE) + 1) +#define __MAXPHYS_ELEMENTS ((maxphys / PAGE_SIZE) + 1) #define SCI_MAX_SCATTER_GATHER_ELEMENTS ((__MAXPHYS_ELEMENTS + 1) & ~0x1) #endif ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: 52cd25eb1aa7 - main - mbuf: enable ext_pgs ("unmapped") mbufs by default
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=52cd25eb1aa75a28f6d3c3eb4757242c1f55d6cc commit 52cd25eb1aa75a28f6d3c3eb4757242c1f55d6cc Author: Andrew Gallatin AuthorDate: 2021-01-08 18:18:42 + Commit: Andrew Gallatin CommitDate: 2021-01-08 18:43:30 + mbuf: enable ext_pgs ("unmapped") mbufs by default Ext_pg mbufs allow carrying multiple pages per mbuf. This reduces mbuf linked list traversals, especially in socket buffers, thereby reducing cache misses and CPU use for applications using sendfile. Note that ext_pages use unmapped pages, eliminating KVA mapping costs on 32-bit platforms. Ext_pg mbufs are also required for ktls (KERN_TLS), and having them disabled by default is a stumbling block for those wishing to enable ktls. Reviewed-by:jhb, glebius Sponsored by: Netfix --- sys/kern/kern_mbuf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sys/kern/kern_mbuf.c b/sys/kern/kern_mbuf.c index 84e068424427..a46c576bad90 100644 --- a/sys/kern/kern_mbuf.c +++ b/sys/kern/kern_mbuf.c @@ -116,7 +116,7 @@ int nmbjumbop; /* limits number of page size jumbo clusters */ int nmbjumbo9; /* limits number of 9k jumbo clusters */ int nmbjumbo16;/* limits number of 16k jumbo clusters */ -bool mb_use_ext_pgs; /* use M_EXTPG mbufs for sendfile & TLS */ +bool mb_use_ext_pgs = true;/* use M_EXTPG mbufs for sendfile & TLS */ SYSCTL_BOOL(_kern_ipc, OID_AUTO, mb_use_ext_pgs, CTLFLAG_RWTUN, &mb_use_ext_pgs, 0, "Use unmapped mbufs for sendfile(2) and TLS offload"); ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: 7eaea04a5bb1 - main - amd64: compare TLB shootdown target to all_cpus
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=7eaea04a5bb1dc86c43ce046311e1c1a042994d3 commit 7eaea04a5bb1dc86c43ce046311e1c1a042994d3 Author: Andrew Gallatin AuthorDate: 2021-01-12 01:03:37 + Commit: Andrew Gallatin CommitDate: 2021-01-12 01:09:32 + amd64: compare TLB shootdown target to all_cpus On amd64, the pmap code passes all_cpus to smp_targeted_tlb_shootdown() when unmapping from the kernel pmap. This function has an optimized path to send IPIs to all but itself, which it intends to do when the target is all cpus. However, we need to compare the target cpu mask with all_cpus, rather than using CPU_ISFULLSET(). Comparing with CPU_ISFULLSET() will only work when we have MAXCPU cpus active in the system, otherwise, we'll be sending repeated IPIs, rather than a single IPI to all CPUs but ourself. Fixing this should reduce the time spent in native_lapic_ipi_wait() as we will be sending ipis in parallel, rather than one-by-one. This is confirmed by dtrace. Reviewed by: alc, jhb, kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D28102 --- sys/amd64/amd64/mp_machdep.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sys/amd64/amd64/mp_machdep.c b/sys/amd64/amd64/mp_machdep.c index 63777014e151..794a11bf1276 100644 --- a/sys/amd64/amd64/mp_machdep.c +++ b/sys/amd64/amd64/mp_machdep.c @@ -673,7 +673,7 @@ smp_targeted_tlb_shootdown(cpuset_t mask, pmap_t pmap, vm_offset_t addr1, /* * Check for other cpus. Return if none. */ - if (CPU_ISFULLSET(&mask)) { + if (!CPU_CMP(&mask, &all_cpus)) { if (mp_ncpus <= 1) goto local_cb; } else { @@ -719,7 +719,7 @@ smp_targeted_tlb_shootdown(cpuset_t mask, pmap_t pmap, vm_offset_t addr1, * (zeroing slot) and reading from it below (wait for * acknowledgment). */ - if (CPU_ISFULLSET(&mask)) { + if (!CPU_CMP(&mask, &all_cpus)) { ipi_all_but_self(IPI_INVLOP); other_cpus = all_cpus; CPU_CLR(PCPU_GET(cpuid), &other_cpus); ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
Re: git: 7eaea04a5bb1 - main - amd64: compare TLB shootdown target to all_cpus
On 1/12/21 12:59 AM, Mateusz Guzik wrote: This makes my 2 core vm crash on boot: Launching APs: 1 Timecounter "TSC-low" frequency 1346899854 Hz quality 1000 panic: IPI scoreboard is zero, initiator 1 target 1 Ugh, sorry for the breakage & thanks for the fix. That's what I get for not testing enough. Drew ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: efa9c21bca98 - main - KTLS: Enable KERN_TLS in GENERIC on amd64
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=efa9c21bca9873af9c9660f5aeffda9d5ae1dfb7 commit efa9c21bca9873af9c9660f5aeffda9d5ae1dfb7 Author: Andrew Gallatin AuthorDate: 2021-01-14 17:44:06 + Commit: Andrew Gallatin CommitDate: 2021-01-18 18:29:10 + KTLS: Enable KERN_TLS in GENERIC on amd64 Based on discussions on freebsd-arch@, enable KERN_TLS in GENERIC on amd64, but leave it disabled via the sysctl kern.ipc.tls.enable. Users wishing to enable ktls must set kern.ipc.tls.enable=1 While here, fix wording in NOTES to mention that KERN_TLS also does receive now. Sponsored by: Netflix Reviewed by:allanjude Differential Revision: https://reviews.freebsd.org/D28163 --- sys/amd64/conf/GENERIC | 1 + sys/conf/NOTES | 4 ++-- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/sys/amd64/conf/GENERIC b/sys/amd64/conf/GENERIC index c9ab23bb91b5..9f55a935f8a5 100644 --- a/sys/amd64/conf/GENERIC +++ b/sys/amd64/conf/GENERIC @@ -37,6 +37,7 @@ options TCP_BLACKBOX# Enhanced TCP event logging optionsTCP_HHOOK # hhook(9) framework for TCP optionsTCP_RFC7413 # TCP Fast Open optionsSCTP_SUPPORT# Allow kldload of SCTP +optionsKERN_TLS# TLS transmit & receive offload optionsFFS # Berkeley Fast Filesystem optionsSOFTUPDATES # Enable FFS soft updates support optionsUFS_ACL # Support for access control lists diff --git a/sys/conf/NOTES b/sys/conf/NOTES index 1a8059a2e5c0..b4202bb65618 100644 --- a/sys/conf/NOTES +++ b/sys/conf/NOTES @@ -666,8 +666,8 @@ options IPSEC_SUPPORT #options IPSEC_DEBUG #debug for IP security -# TLS framing and encryption of data transmitted over TCP sockets. -optionsKERN_TLS# TLS transmit offload +# TLS framing and encryption/decryption of data over TCP sockets. +optionsKERN_TLS# TLS transmit and receive offload # # SMB/CIFS requester ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: 0c864213ef1e - main - iflib: Fix a NULL pointer deref
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=0c864213ef1ee440411e3bb6437ecc04273db86b commit 0c864213ef1ee440411e3bb6437ecc04273db86b Author: Andrew Gallatin AuthorDate: 2021-01-21 14:45:15 + Commit: Andrew Gallatin CommitDate: 2021-01-21 14:47:06 + iflib: Fix a NULL pointer deref rxd_frag_to_sd() have pf_rv parameter as NULL with the current code. This patch fixes the NULL pointer dereference in that case thus avoiding a possible panic. Submitted by: rajesh1.kumar at amd.com Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D28115 --- sys/net/iflib.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/sys/net/iflib.c b/sys/net/iflib.c index 4b4952122d1e..ea2c5789a7b5 100644 --- a/sys/net/iflib.c +++ b/sys/net/iflib.c @@ -2654,7 +2654,8 @@ rxd_frag_to_sd(iflib_rxq_t rxq, if_rxd_frag_t irf, bool unload, if_rxsd_t sd, } } else { fl->ifl_sds.ifsd_m[cidx] = NULL; - *pf_rv = PFIL_PASS; + if (pf_rv != NULL) + *pf_rv = PFIL_PASS; } if (unload && irf->irf_len != 0) ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: 3183d0b68072 - main - iflib: initialize LRO unconditionally
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=3183d0b68072dda0e80bb6e03c970625f2823e97 commit 3183d0b68072dda0e80bb6e03c970625f2823e97 Author: Andrew Gallatin AuthorDate: 2021-04-23 09:51:22 + Commit: Andrew Gallatin CommitDate: 2021-04-23 09:55:20 + iflib: initialize LRO unconditionally Changes to the LRO code have exposed a bug in iflib where devices which are not capable of doing LRO are still calling tcp_lro_flush_all(), even when they have not initialized the LRO context. This used to be mostly harmless, but the LRO code now sets the VNET based on the ifp in the lro context and will try to access it through a NULL ifp resulting in a panic at boot. To fix this, we unconditionally initializes LRO so that we have a valid LRO context when calling tcp_lro_flush_all(). One alternative is to check the device capabilities before calling tcp_lro_flush_all() or adding a new state flag in the ctx. However, it seems unwise to add an extra, mostly useless test for higher performance devices when we can just initialize LRO for all devices. Reviewed by: erj, hselasky, markj, olivier Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D29928 --- sys/net/iflib.c | 22 +- 1 file changed, 9 insertions(+), 13 deletions(-) diff --git a/sys/net/iflib.c b/sys/net/iflib.c index 6dbaff556a15..fc0814d0fc19 100644 --- a/sys/net/iflib.c +++ b/sys/net/iflib.c @@ -5891,15 +5891,13 @@ iflib_rx_structures_setup(if_ctx_t ctx) for (q = 0; q < ctx->ifc_softc_ctx.isc_nrxqsets; q++, rxq++) { #if defined(INET6) || defined(INET) - if (if_getcapabilities(ctx->ifc_ifp) & IFCAP_LRO) { - err = tcp_lro_init_args(&rxq->ifr_lc, ctx->ifc_ifp, - TCP_LRO_ENTRIES, min(1024, - ctx->ifc_softc_ctx.isc_nrxd[rxq->ifr_fl_offset])); - if (err != 0) { - device_printf(ctx->ifc_dev, - "LRO Initialization failed!\n"); - goto fail; - } + err = tcp_lro_init_args(&rxq->ifr_lc, ctx->ifc_ifp, + TCP_LRO_ENTRIES, min(1024, + ctx->ifc_softc_ctx.isc_nrxd[rxq->ifr_fl_offset])); + if (err != 0) { + device_printf(ctx->ifc_dev, + "LRO Initialization failed!\n"); + goto fail; } #endif IFDI_RXQ_SETUP(ctx, rxq->ifr_id); @@ -5914,8 +5912,7 @@ fail: */ rxq = ctx->ifc_rxqs; for (i = 0; i < q; ++i, rxq++) { - if (if_getcapabilities(ctx->ifc_ifp) & IFCAP_LRO) - tcp_lro_free(&rxq->ifr_lc); + tcp_lro_free(&rxq->ifr_lc); } return (err); #endif @@ -5938,8 +5935,7 @@ iflib_rx_structures_free(if_ctx_t ctx) iflib_dma_free(&rxq->ifr_ifdi[j]); iflib_rx_sds_free(rxq); #if defined(INET6) || defined(INET) - if (if_getcapabilities(ctx->ifc_ifp) & IFCAP_LRO) - tcp_lro_free(&rxq->ifr_lc); + tcp_lro_free(&rxq->ifr_lc); #endif } free(ctx->ifc_rxqs, M_IFLIB); ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: 086a35562f47 - main - tcp: enter network epoch when calling tfb_tcp_fb_fini
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=086a35562f47917a516d30acc8b78a4884e31a4f commit 086a35562f47917a516d30acc8b78a4884e31a4f Author: Andrew Gallatin AuthorDate: 2021-05-25 17:45:37 + Commit: Andrew Gallatin CommitDate: 2021-05-25 17:45:37 + tcp: enter network epoch when calling tfb_tcp_fb_fini We need to enter the network epoch when calling into tfb_tcp_fb_fini. I noticed this when I hit an assert running the latest rack Differential Revision: https://reviews.freebsd.org/D30407 Reviewed by: rrs, tuexen Sponsored by: Netflix --- sys/netinet/tcp_usrreq.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/sys/netinet/tcp_usrreq.c b/sys/netinet/tcp_usrreq.c index 4f418f8809a7..caef798772ea 100644 --- a/sys/netinet/tcp_usrreq.c +++ b/sys/netinet/tcp_usrreq.c @@ -1818,11 +1818,14 @@ tcp_ctloutput(struct socket *so, struct sockopt *sopt) * new one already. */ if (tp->t_fb->tfb_tcp_fb_fini) { + struct epoch_tracker et; /* * Tell the stack to cleanup with 0 i.e. * the tcb is not going away. */ + NET_EPOCH_ENTER(et); (*tp->t_fb->tfb_tcp_fb_fini)(tp, 0); + NET_EPOCH_EXIT(et); } #ifdef TCPHPTS /* Assure that we are not on any hpts */ ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: df8437a93dd5 - main - cxgbe: fix enabling lro & rxtimestamps
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=df8437a93dd5268e5bfd06411c01a5cbdb38c6ac commit df8437a93dd5268e5bfd06411c01a5cbdb38c6ac Author: Andrew Gallatin AuthorDate: 2021-05-26 13:54:26 + Commit: Andrew Gallatin CommitDate: 2021-05-26 14:00:07 + cxgbe: fix enabling lro & rxtimestamps A recent change caused iq flags, like LRO, to be set before init_iq(). However, init_iq() clears those flags, so they became effectively impossible to set. This change moves the initializion of these flags to after the call to init_iq(). This fixes LRO. Differential Revision: https://reviews.freebsd.org/D30460 Reviewed by: np, rrs Sponsored by: Netflix Fixes: 43bbae19483fbde0a91e61acad8a6e71e334c8b8 <https://reviews.freebsd.org/R10:43bbae19483fbde0a91e61acad8a6e71e334c8b8>" --- sys/dev/cxgbe/t4_sge.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/sys/dev/cxgbe/t4_sge.c b/sys/dev/cxgbe/t4_sge.c index 4b685129193e..0b429c602a91 100644 --- a/sys/dev/cxgbe/t4_sge.c +++ b/sys/dev/cxgbe/t4_sge.c @@ -3938,12 +3938,7 @@ alloc_rxq(struct vi_info *vi, struct sge_rxq *rxq, int idx, int intr_idx, if (rc != 0) return (rc); MPASS(rxq->lro.ifp == ifp); /* also indicates LRO init'ed */ - - if (ifp->if_capenable & IFCAP_LRO) - rxq->iq.flags |= IQ_LRO_ENABLED; #endif - if (ifp->if_capenable & IFCAP_HWRXTSTMP) - rxq->iq.flags |= IQ_RX_TIMESTAMP; rxq->ifp = ifp; snprintf(name, sizeof(name), "%d", idx); @@ -3953,6 +3948,12 @@ alloc_rxq(struct vi_info *vi, struct sge_rxq *rxq, int idx, int intr_idx, init_iq(&rxq->iq, sc, vi->tmr_idx, vi->pktc_idx, vi->qsize_rxq, intr_idx, tnl_cong(vi->pi, cong_drop)); +#if defined(INET) || defined(INET6) + if (ifp->if_capenable & IFCAP_LRO) + rxq->iq.flags |= IQ_LRO_ENABLED; +#endif + if (ifp->if_capenable & IFCAP_HWRXTSTMP) + rxq->iq.flags |= IQ_RX_TIMESTAMP; snprintf(name, sizeof(name), "%s rxq%d-fl", device_get_nameunit(vi->dev), idx); init_fl(sc, &rxq->fl, vi->qsize_rxq / 8, maxp, name); ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: ed5e13cfc268 - main - ktls: Fix interaction with RATELIMIT
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=ed5e13cfc2689049ce415dad5057923bc7214a41 commit ed5e13cfc2689049ce415dad5057923bc7214a41 Author: Andrew Gallatin AuthorDate: 2021-06-14 14:46:13 + Commit: Andrew Gallatin CommitDate: 2021-06-14 14:51:16 + ktls: Fix interaction with RATELIMIT uipc_ktls.c was missing opt_ratelimit.h, so it was never noticing that RATELIMIT was enabled. Once it was enabled, it failed to compile as ktls_modify_txrtlmt() had accrued a compilation error when it was not being compiled in. Sponsored by: Netflix --- sys/kern/uipc_ktls.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c index b0d7ea8016dd..2ab2ef18446b 100644 --- a/sys/kern/uipc_ktls.c +++ b/sys/kern/uipc_ktls.c @@ -30,6 +30,7 @@ __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_inet6.h" +#include "opt_ratelimit.h" #include "opt_rss.h" #include @@ -1399,7 +1400,6 @@ ktls_modify_txrtlmt(struct ktls_session *tls, uint64_t max_pacing_rate) }; struct m_snd_tag *mst; struct ifnet *ifp; - int error; /* Can't get to the inp, but it should be locked. */ /* INP_LOCK_ASSERT(inp); */ ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: 517a7adb1160 - main - Make hwpmc work for userspace binaries again
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=517a7adb1160850746227e4cc30d4bcc3ff04d7d commit 517a7adb1160850746227e4cc30d4bcc3ff04d7d Author: Andrew Gallatin AuthorDate: 2021-12-15 13:38:36 + Commit: Andrew Gallatin CommitDate: 2021-12-15 13:38:36 + Make hwpmc work for userspace binaries again hwpmc has been utterly broken for userspace binaries, and has been labeling all samples from userspace binaries as dubious frames. The issues are that: -The check for ph.p_offset & (-ph.p_align) == 0 was mostly bogus. The intent was to ignore all executable segments other than the first, which when using BFD appeared in the first page, but with current LLD a read-only data segment appears before the executable segment, pushing the latter into the second page or later. This meant no executable segment was ever found, and thus pi_vaddr remained 0. Instead of relying on BFD's layout, track whether we've seen an executable segment explicitly with a local bool. -Shared libraries were not parsing the segments to calculate pi_vaddr, resulting in it always being 0. Again, when using BFD, the executable segment started at the first page, and so pi_vaddr was genuinely meant to be 0, but not with LLD's current layout. This meant that pmcstat_image_link's offset calculation gave the base address of the segment in memory, rather than the base address of the whole library in memory, and so when adding that to pi_start/pi_end to get the range of the executable sections in memory it double-counted the offset of the first executable segment within the library. Thus we need to do the exact same parsing for ET_DYN as we do for ET_EXEC, which is simpler to write as special-casing ET_REL to not look for segments. Note that, whilst PT_INTERP isn't needed for shared libraries, it will be for PIEs, which pmcstat still fails to handle due to not knowing the base address of the PIE; we get the base address for libraries by MAP_IN events, and for rtld by virtue of the process's entry address being rtld's, but have no equivalent for the executable. Fixes courtesy of jrtc27@. Reviewed by: jrtc27, jhb (earlier version) Differential Revision: https://reviews.freebsd.org/D33055 Sponsored by: Netflix --- lib/libpmcstat/libpmcstat_image.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/lib/libpmcstat/libpmcstat_image.c b/lib/libpmcstat/libpmcstat_image.c index 9ee7097e95ec..97109f203806 100644 --- a/lib/libpmcstat/libpmcstat_image.c +++ b/lib/libpmcstat/libpmcstat_image.c @@ -43,6 +43,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include #include #include #include @@ -295,6 +296,7 @@ pmcstat_image_get_elf_params(struct pmcstat_image *image, size_t i, nph, nsh; const char *path, *elfbase; char *p, *endp; + bool first_exec_segment; uintfptr_t minva, maxva; Elf *e; Elf_Scn *scn; @@ -384,7 +386,7 @@ pmcstat_image_get_elf_params(struct pmcstat_image *image, * loaded. Additionally, for dynamically linked executables, * save the pathname to the runtime linker. */ - if (eh.e_type == ET_EXEC) { + if (eh.e_type != ET_REL) { if (elf_getphnum(e, &nph) == 0) { warnx( "WARNING: Could not determine the number of program headers in \"%s\": %s.", @@ -392,6 +394,7 @@ pmcstat_image_get_elf_params(struct pmcstat_image *image, elf_errmsg(-1)); goto done; } + first_exec_segment = true; for (i = 0; i < eh.e_phnum; i++) { if (gelf_getphdr(e, i, &ph) != &ph) { warnx( @@ -416,8 +419,10 @@ pmcstat_image_get_elf_params(struct pmcstat_image *image, break; case PT_LOAD: if ((ph.p_flags & PF_X) != 0 && - (ph.p_offset & (-ph.p_align)) == 0) + first_exec_segment) { image->pi_vaddr = ph.p_vaddr & (-ph.p_align); + first_exec_segment = false; + } break; } }
git: 588f03ec9b9e - main - bectl: Improve error message when ZFS root is not found.
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=588f03ec9b9ebd3c17b3e978140ff3f3e4bcad73 commit 588f03ec9b9ebd3c17b3e978140ff3f3e4bcad73 Author: Andrew Gallatin AuthorDate: 2023-03-30 21:57:26 + Commit: Andrew Gallatin CommitDate: 2023-03-31 14:27:38 + bectl: Improve error message when ZFS root is not found. When recovering a system that is unbootable due to some problem with the active BE, it is likely you'll be booted from a rescue image running UFS. In this case, bectl needs help finding the zpool root that you want to operate on. In this case, improve the error message to suggest specifying a root, rather than just emitting a generic error message that might imply, to the naive user, that there is a ZFS compatibility issue between the rescue image and the on-disk ZFS pool. Reviewed by: imp, kevans Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D39346 --- sbin/bectl/bectl.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/sbin/bectl/bectl.c b/sbin/bectl/bectl.c index 2b7af4e55419..814b98ba8a8a 100644 --- a/sbin/bectl/bectl.c +++ b/sbin/bectl/bectl.c @@ -584,9 +584,13 @@ main(int argc, char *argv[]) } if ((be = libbe_init(root)) == NULL) { - if (!cmd->silent) + if (!cmd->silent) { fprintf(stderr, "libbe_init(\"%s\") failed.\n", root != NULL ? root : ""); + if (root == NULL) + fprintf(stderr, + "Try specifying ZFS root using -r.\n"); + } return (-1); }
git: 8b0dafdb2f18 - main - vm: implement vm_page_reclaim_contig_domain_ext()
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=8b0dafdb2f18b9bdc464a4ddbcfd749c3d3875f1 commit 8b0dafdb2f18b9bdc464a4ddbcfd749c3d3875f1 Author: Andrew Gallatin AuthorDate: 2023-05-08 13:25:40 + Commit: Andrew Gallatin CommitDate: 2023-05-09 17:09:34 + vm: implement vm_page_reclaim_contig_domain_ext() Implement vm_page_reclaim_contig_domain_ext() to reclaim multiple contiguous regions at once. This makes it more efficient for users that need multiple contiguous regions to reclaim those regions efficiently. This is needed because callers like ktls may need to reclaim many contiguous regions, and each scan of physical memory can take multiple seconds on a large memory machine (order of 100GB of RMA). Rather than modifying the core algorithm, I extended vm_page_reclaim_contig_domain() to take a "desired_runs" argument to allow the caller to request that it reclaim more than just a single run. There is no functional change intended for all existing callers. The first user for this interface is the ktls code (https://reviews.freebsd.org/D39421). By reclaiming multiple runs, ktls goes from consuming hours of CPU to refill its buffer zone to just seconds or minutes. Differential Revision: https://reviews.freebsd.org/D39739 Sponsored by: Netflix Reviewed by:alc, jhb, markj --- sys/vm/vm_page.c | 69 +++- sys/vm/vm_page.h | 3 +++ 2 files changed, 56 insertions(+), 16 deletions(-) diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c index 90413f235ec0..4b967a94aa1f 100644 --- a/sys/vm/vm_page.c +++ b/sys/vm/vm_page.c @@ -2995,9 +2995,7 @@ unlock: #defineNRUNS 16 -CTASSERT(powerof2(NRUNS)); - -#defineRUN_INDEX(count)((count) & (NRUNS - 1)) +#defineRUN_INDEX(count, nruns) ((count) % (nruns)) #defineMIN_RECLAIM 8 @@ -3025,19 +3023,42 @@ CTASSERT(powerof2(NRUNS)); * must be a power of two. */ bool -vm_page_reclaim_contig_domain(int domain, int req, u_long npages, -vm_paddr_t low, vm_paddr_t high, u_long alignment, vm_paddr_t boundary) +vm_page_reclaim_contig_domain_ext(int domain, int req, u_long npages, +vm_paddr_t low, vm_paddr_t high, u_long alignment, vm_paddr_t boundary, +int desired_runs) { struct vm_domain *vmd; vm_paddr_t curr_low; - vm_page_t m_run, m_runs[NRUNS]; + vm_page_t m_run, _m_runs[NRUNS], *m_runs; u_long count, minalign, reclaimed; - int error, i, options, req_class; + int error, i, min_reclaim, nruns, options, req_class; + bool ret; KASSERT(npages > 0, ("npages is 0")); KASSERT(powerof2(alignment), ("alignment is not a power of 2")); KASSERT(powerof2(boundary), ("boundary is not a power of 2")); + ret = false; + + /* +* If the caller wants to reclaim multiple runs, try to allocate +* space to store the runs. If that fails, fall back to the old +* behavior of just reclaiming MIN_RECLAIM pages. +*/ + if (desired_runs > 1) + m_runs = malloc((NRUNS + desired_runs) * sizeof(*m_runs), + M_TEMP, M_NOWAIT); + else + m_runs = NULL; + + if (m_runs == NULL) { + m_runs = _m_runs; + nruns = NRUNS; + } else { + nruns = NRUNS + desired_runs - 1; + } + min_reclaim = MAX(desired_runs * npages, MIN_RECLAIM); + /* * The caller will attempt an allocation after some runs have been * reclaimed and added to the vm_phys buddy lists. Due to limitations @@ -3066,7 +3087,7 @@ vm_page_reclaim_contig_domain(int domain, int req, u_long npages, if (count < npages + vmd->vmd_free_reserved || (count < npages + vmd->vmd_interrupt_free_min && req_class == VM_ALLOC_SYSTEM) || (count < npages && req_class == VM_ALLOC_INTERRUPT)) - return (false); + goto done; /* * Scan up to three times, relaxing the restrictions ("options") on @@ -3085,27 +3106,29 @@ vm_page_reclaim_contig_domain(int domain, int req, u_long npages, if (m_run == NULL) break; curr_low = VM_PAGE_TO_PHYS(m_run) + ptoa(npages); - m_runs[RUN_INDEX(count)] = m_run; + m_runs[RUN_INDEX(count, nruns)] = m_run; count++; } /* * Reclaim the highest runs in LIFO (descending) order until * the number of reclaimed pages, "reclaimed", is at least -* MIN_RECLAIM. Reset "reclaimed" each time because
git: 198558523361 - main - ktls: re-work alloc thread
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=198558523361a654409b6d3f8d63c12ba3f72ae5 commit 198558523361a654409b6d3f8d63c12ba3f72ae5 Author: Andrew Gallatin AuthorDate: 2023-05-08 13:38:59 + Commit: Andrew Gallatin CommitDate: 2023-05-09 17:09:34 + ktls: re-work alloc thread When the ktls_buffer zone needs to expand, it may fail due to a lack of physically contiguous memory. We tried to rectify that by introducing an alloc thread to provide a context where it is harmless to sleep, and letting that thread repopulate the ktls_buffer zone. However, it turns out that M_WAITOK is not enough, and we must call vm_page_reclaim_contig_domain() to reclaim contig memory. Worse, M_WAITOK results in the allocation essentially busy-looping around vm_domain_alloc_fail() returning EAGIN, causing vm_page_alloc_noobj_contig_domain() to loop and resulting in the alloc thread consuming 100% CPU. To fix this, we change the alloc thread to call vm_page_reclaim_contig_domain_ext() In order to prevent the busy loop around vm_domain_alloc_fail(), we must change the uma_zalloc flags to M_NORECLAIM | M_NOWAIT. However, once that is done, these allocations become no different than the allocations done in the critical path in ktls_buffer_alloc(), so its best to just eliminate them. Since we're no longer doing allocations but just calling vm_page_reclaim_contig_domain_ext(), the name has changed to the ktls reclaim thread. Reviewed by: jhb, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D39421 --- sys/kern/uipc_ktls.c | 82 ++-- 1 file changed, 34 insertions(+), 48 deletions(-) diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c index 4639355b1558..1e892dde9022 100644 --- a/sys/kern/uipc_ktls.c +++ b/sys/kern/uipc_ktls.c @@ -88,9 +88,9 @@ struct ktls_wq { int lastallocfail; } __aligned(CACHE_LINE_SIZE); -struct ktls_alloc_thread { +struct ktls_reclaim_thread { uint64_t wakeups; - uint64_t allocs; + uint64_t reclaims; struct thread *td; int running; }; @@ -98,7 +98,7 @@ struct ktls_alloc_thread { struct ktls_domain_info { int count; int cpu[MAXCPU]; - struct ktls_alloc_thread alloc_td; + struct ktls_reclaim_thread reclaim_td; }; struct ktls_domain_info ktls_domains[MAXMEMDOM]; @@ -154,10 +154,10 @@ SYSCTL_BOOL(_kern_ipc_tls, OID_AUTO, sw_buffer_cache, CTLFLAG_RDTUN, &ktls_sw_buffer_cache, 1, "Enable caching of output buffers for SW encryption"); -static int ktls_max_alloc = 128; -SYSCTL_INT(_kern_ipc_tls, OID_AUTO, max_alloc, CTLFLAG_RWTUN, -&ktls_max_alloc, 128, -"Max number of 16k buffers to allocate in thread context"); +static int ktls_max_reclaim = 1024; +SYSCTL_INT(_kern_ipc_tls, OID_AUTO, max_reclaim, CTLFLAG_RWTUN, +&ktls_max_reclaim, 128, +"Max number of 16k buffers to reclaim in thread context"); static COUNTER_U64_DEFINE_EARLY(ktls_tasks_active); SYSCTL_COUNTER_U64(_kern_ipc_tls, OID_AUTO, tasks_active, CTLFLAG_RD, @@ -303,7 +303,7 @@ static MALLOC_DEFINE(M_KTLS, "ktls", "Kernel TLS"); static void ktls_reset_receive_tag(void *context, int pending); static void ktls_reset_send_tag(void *context, int pending); static void ktls_work_thread(void *ctx); -static void ktls_alloc_thread(void *ctx); +static void ktls_reclaim_thread(void *ctx); static u_int ktls_get_cpu(struct socket *so) @@ -454,12 +454,12 @@ ktls_init(void) continue; if (CPU_EMPTY(&cpuset_domain[domain])) continue; - error = kproc_kthread_add(ktls_alloc_thread, + error = kproc_kthread_add(ktls_reclaim_thread, &ktls_domains[domain], &ktls_proc, - &ktls_domains[domain].alloc_td.td, - 0, 0, "KTLS", "alloc_%d", domain); + &ktls_domains[domain].reclaim_td.td, + 0, 0, "KTLS", "reclaim_%d", domain); if (error) { - printf("Can't add KTLS alloc thread %d error %d\n", + printf("Can't add KTLS reclaim thread %d error %d\n", domain, error); return (error); } @@ -2702,9 +2702,9 @@ ktls_buffer_alloc(struct ktls_wq *wq, struct mbuf *m) * see an old value of running == true. */ if (!VM_DOMAIN_EMPTY(domain)) { -
git: fd96685a4a57 - main - Revert "When stopping powerd, set the CPU frequency back to its maximum value"
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=fd96685a4a579fc84031e8e66d8f8b1ce8cdf1e5 commit fd96685a4a579fc84031e8e66d8f8b1ce8cdf1e5 Author: Andrew Gallatin AuthorDate: 2023-05-22 00:47:28 + Commit: Andrew Gallatin CommitDate: 2023-05-25 13:40:26 + Revert "When stopping powerd, set the CPU frequency back to its maximum value" This reverts commit 1dcb6ad173e57b489a859ea59ed6eaa733bdb5bc. As of "8cb16fdbea6b Restore original frequency on exit.", powerd restores the original frequency itself. Further, if the original frequency is not the same as the first frequency found in the frequency list, then the restoration done by the powerd_poststop will restore the wrong frequency. This can happen on Intel machines where Turbo is not enabled, but the turbo frequency is first in the list of frequencies. In this case, turbo will be enabled when the user did not want it to be. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D40197 Reviewed by: imp, mav --- libexec/rc/rc.d/powerd | 7 --- 1 file changed, 7 deletions(-) diff --git a/libexec/rc/rc.d/powerd b/libexec/rc/rc.d/powerd index 2fc783a627e9..6f63bb96ff42 100755 --- a/libexec/rc/rc.d/powerd +++ b/libexec/rc/rc.d/powerd @@ -14,13 +14,6 @@ name="powerd" desc="Modify the power profile based on AC line state" rcvar="powerd_enable" command="/usr/sbin/${name}" -stop_postcmd=powerd_poststop - -powerd_poststop() -{ - sysctl dev.cpu.0.freq=`sysctl -n dev.cpu.0.freq_levels | - sed -e 's:/.*::'` > /dev/null -} load_rc_config $name run_rc_command "$1"
git: 8de48df35c3b - main - ixgbe: Do not count L3/L4 checksum errors as input errors
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=8de48df35c3bf4800176b7aa54c75a01864d458b commit 8de48df35c3bf4800176b7aa54c75a01864d458b Author: Andrew Gallatin AuthorDate: 2023-02-02 15:02:44 + Commit: Andrew Gallatin CommitDate: 2023-02-02 15:14:12 + ixgbe: Do not count L3/L4 checksum errors as input errors NIC input errors have traditionally indicated problems at the link level (crc errors, runts, etc). People tend to build monitoring infrastructure around such errors in order to monitor for bad network hardware. When L3/L4 checksum errors are included in the category of input errors, it breaks such monitoring, as these errors can originate anywhere on the internet, and do not necessarily indicate faulty local network hardware. Reviewed by: erj, glebius Differential Revision: https://reviews.freebsd.org/D38346 Sponsored by: Netflix --- sys/dev/ixgbe/if_ix.c | 5 - sys/dev/ixgbe/ixgbe.h | 1 - 2 files changed, 6 deletions(-) diff --git a/sys/dev/ixgbe/if_ix.c b/sys/dev/ixgbe/if_ix.c index 4f6faeec4296..8df0e59a8346 100644 --- a/sys/dev/ixgbe/if_ix.c +++ b/sys/dev/ixgbe/if_ix.c @@ -1577,19 +1577,14 @@ ixgbe_update_stats_counters(struct ixgbe_softc *sc) * Aggregate following types of errors as RX errors: * - CRC error count, * - illegal byte error count, -* - checksum error count, * - missed packets count, * - length error count, * - undersized packets count, * - fragmented packets count, * - oversized packets count, * - jabber count. -* -* Ignore XEC errors for 82599 to workaround errata about -* UDP frames with zero checksum. */ IXGBE_SET_IERRORS(sc, stats->crcerrs + stats->illerrc + - (hw->mac.type != ixgbe_mac_82599EB ? stats->xec : 0) + stats->mpc[0] + stats->rlec + stats->ruc + stats->rfc + stats->roc + stats->rjc); } /* ixgbe_update_stats_counters */ diff --git a/sys/dev/ixgbe/ixgbe.h b/sys/dev/ixgbe/ixgbe.h index 0f81a0a2c2da..83a51b4d15e7 100644 --- a/sys/dev/ixgbe/ixgbe.h +++ b/sys/dev/ixgbe/ixgbe.h @@ -507,7 +507,6 @@ struct ixgbe_softc { "\nSum of the following RX errors counters:\n" \ " * CRC errors,\n" \ " * illegal byte error count,\n" \ -" * checksum error count,\n" \ " * missed packet count,\n" \ " * length error count,\n" \ " * undersized packets count,\n" \
git: c0e4090e3d43 - main - ktls: Accurately track if ifnet ktls is enabled
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=c0e4090e3d43eeb86270dd35835862660b045c26 commit c0e4090e3d43eeb86270dd35835862660b045c26 Author: Andrew Gallatin AuthorDate: 2023-02-08 20:37:08 + Commit: Andrew Gallatin CommitDate: 2023-02-09 17:44:44 + ktls: Accurately track if ifnet ktls is enabled This allows us to avoid spurious calls to ktls_disable_ifnet() When we implemented ifnet kTLSe, we set a flag in the tx socket buffer (SB_TLS_IFNET) to indicate ifnet kTLS. This flag meant that now, or in the past, ifnet ktls was active on a socket. Later, I added code to switch ifnet ktls sessions to software in the case of lossy TCP connections that have a high retransmit rate. Because TCP was using SB_TLS_IFNET to know if it needed to do math to calculate the retransmit ratio and potentially call into ktls_disable_ifnet(), it was doing unneeded work long after a session was moved to software. This patch carefully tracks whether or not ifnet ktls is still enabled on a TCP connection. Because the inp is now embedded in the tcpcb, and because TCP is the most frequent accessor of this state, it made sense to move this from the socket buffer flags to the tcpcb. Because we now need reliable access to the tcbcb, we take a ref on the inp when creating a tx ktls session. While here, I noticed that rack/bbr were incorrectly implementing tfb_hwtls_change(), and applying the change to all pending sends, when it should apply only to future sends. This change reduces spurious calls to ktls_disable_ifnet() by 95% or so in a Netflix CDN environment. Reviewed by: markj, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D38380 --- sys/kern/uipc_ktls.c | 145 +- sys/netinet/tcp_output.c | 2 +- sys/netinet/tcp_ratelimit.c | 4 +- sys/netinet/tcp_stacks/bbr.c | 2 +- sys/netinet/tcp_stacks/rack.c | 14 +--- sys/netinet/tcp_var.h | 3 + sys/sys/ktls.h| 3 +- sys/sys/sockbuf.h | 2 +- 8 files changed, 126 insertions(+), 49 deletions(-) diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c index ac55268728e9..b3895aee9249 100644 --- a/sys/kern/uipc_ktls.c +++ b/sys/kern/uipc_ktls.c @@ -222,6 +222,11 @@ static COUNTER_U64_DEFINE_EARLY(ktls_ifnet_disable_ok); SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, ifnet_disable_ok, CTLFLAG_RD, &ktls_ifnet_disable_ok, "TLS sessions able to switch to SW from ifnet"); +static COUNTER_U64_DEFINE_EARLY(ktls_destroy_task); +SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, destroy_task, CTLFLAG_RD, +&ktls_destroy_task, +"Number of times ktls session was destroyed via taskqueue"); + SYSCTL_NODE(_kern_ipc_tls, OID_AUTO, sw, CTLFLAG_RD | CTLFLAG_MPSAFE, 0, "Software TLS session stats"); SYSCTL_NODE(_kern_ipc_tls, OID_AUTO, ifnet, CTLFLAG_RD | CTLFLAG_MPSAFE, 0, @@ -619,10 +624,14 @@ ktls_create_session(struct socket *so, struct tls_enable *en, counter_u64_add(ktls_offload_active, 1); refcount_init(&tls->refcount, 1); - if (direction == KTLS_RX) + if (direction == KTLS_RX) { TASK_INIT(&tls->reset_tag_task, 0, ktls_reset_receive_tag, tls); - else + } else { TASK_INIT(&tls->reset_tag_task, 0, ktls_reset_send_tag, tls); + tls->inp = so->so_pcb; + in_pcbref(tls->inp); + tls->tx = true; + } tls->wq_index = ktls_get_cpu(so); @@ -757,12 +766,16 @@ ktls_clone_session(struct ktls_session *tls, int direction) counter_u64_add(ktls_offload_active, 1); refcount_init(&tls_new->refcount, 1); - if (direction == KTLS_RX) + if (direction == KTLS_RX) { TASK_INIT(&tls_new->reset_tag_task, 0, ktls_reset_receive_tag, tls_new); - else + } else { TASK_INIT(&tls_new->reset_tag_task, 0, ktls_reset_send_tag, tls_new); + tls_new->inp = tls->inp; + tls_new->tx = true; + in_pcbref(tls_new->inp); + } /* Copy fields from existing session. */ tls_new->params = tls->params; @@ -1272,6 +1285,7 @@ ktls_enable_tx(struct socket *so, struct tls_enable *en) { struct ktls_session *tls; struct inpcb *inp; + struct tcpcb *tp; int error; if (!ktls_offload_enable) @@ -1336,8 +1350,13 @@ ktls_enable_tx(struct socket *so, struct tls_enable *en) SOCKBUF_LOCK(&so->so_snd); so->so_snd.sb_tls_seqno = be64dec(en->rec_seq); so->so_snd.sb_tls_info = tls; - if (tls->mode != TCP_TLS_MODE
git: d24b032bec1b - main - ktls: Fix comments & whitespace issues with c0e4090e3d43
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=d24b032bec1b868b99fd1f3f23ec8116cd719e94 commit d24b032bec1b868b99fd1f3f23ec8116cd719e94 Author: Andrew Gallatin AuthorDate: 2023-02-09 19:09:05 + Commit: Andrew Gallatin CommitDate: 2023-02-09 19:11:24 + ktls: Fix comments & whitespace issues with c0e4090e3d43 Address some last minute review feedback on c0e4090e3d43 by fixing spacing around comments, and clarifying that the newly added destroy_task is not related to tls 1.0. No functional change intended. Pointed out by: jhb Sponsored by: Netflix --- sys/kern/uipc_ktls.c | 3 ++- sys/sys/ktls.h | 2 ++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c index b3895aee9249..cb2e3f272774 100644 --- a/sys/kern/uipc_ktls.c +++ b/sys/kern/uipc_ktls.c @@ -1478,6 +1478,7 @@ ktls_set_tx_mode(struct socket *so, int mode) /* Don't allow enabling ifnet ktls multiple times */ if (tp->t_nic_ktls_xmit) return (EALREADY); + /* * Don't enable ifnet ktls if we disabled it due to an * excessive retransmission rate @@ -1850,7 +1851,6 @@ ktls_destroy(struct ktls_session *tls) * know that we don't hold the inp rlock, and * can safely take the wlock */ - if (curthread->td_rw_rlocks == 0) { INP_WLOCK(inp); } else { @@ -3335,6 +3335,7 @@ ktls_disable_ifnet(void *arg) SOCK_UNLOCK(so); return; } + /* * note that t_nic_ktls_xmit_dis is never cleared; disabling * ifnet can only be done once per connection, so we never want diff --git a/sys/sys/ktls.h b/sys/sys/ktls.h index 909d5347bc47..549ce3ee869d 100644 --- a/sys/sys/ktls.h +++ b/sys/sys/ktls.h @@ -201,6 +201,8 @@ struct ktls_session { /* Only used for TLS 1.0. */ uint64_t next_seqno; STAILQ_HEAD(, mbuf) pending_records; + + /* Used to destroy any kTLS session */ struct task destroy_task; } __aligned(CACHE_LINE_SIZE);
git: abba58766fdd - main - LRO: Add missing checks for invalid IP addresses
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=abba58766fdd7f9720761aba39c2b9653eb4fbd3 commit abba58766fdd7f9720761aba39c2b9653eb4fbd3 Author: Andrew Gallatin AuthorDate: 2023-03-25 15:51:51 + Commit: Andrew Gallatin CommitDate: 2023-03-25 15:56:02 + LRO: Add missing checks for invalid IP addresses LRO bypasses normal ip_input()/tcp_input() and lacks several checks that are present in the normal path. Without these checks, it is possible to trigger assertions added in b0ccf53f2455 Reviewed by: glebius, rrs Sponsored by: Netflix --- sys/netinet/tcp_lro.c | 8 1 file changed, 8 insertions(+) diff --git a/sys/netinet/tcp_lro.c b/sys/netinet/tcp_lro.c index bde8fadbc05b..908f9cdd7ea4 100644 --- a/sys/netinet/tcp_lro.c +++ b/sys/netinet/tcp_lro.c @@ -292,6 +292,10 @@ tcp_lro_low_level_parser(void *ptr, struct lro_parser *parser, bool update_data, /* .. and the packet is not fragmented. */ if (parser->ip4->ip_off & htons(IP_MF|IP_OFFMASK)) break; + /* .. and the packet has valid src/dst addrs */ + if (__predict_false(parser->ip4->ip_src.s_addr == INADDR_ANY || + parser->ip4->ip_dst.s_addr == INADDR_ANY)) + break; ptr = (uint8_t *)ptr + (parser->ip4->ip_hl << 2); mlen -= sizeof(struct ip); if (update_data) { @@ -339,6 +343,10 @@ tcp_lro_low_level_parser(void *ptr, struct lro_parser *parser, bool update_data, parser->ip6 = ptr; if (__predict_false(mlen < sizeof(struct ip6_hdr))) return (NULL); + /* Ensure the packet has valid src/dst addrs */ + if (__predict_false(IN6_IS_ADDR_UNSPECIFIED(&parser->ip6->ip6_src) || + IN6_IS_ADDR_UNSPECIFIED(&parser->ip6->ip6_dst))) + return (NULL); ptr = (uint8_t *)ptr + sizeof(*parser->ip6); if (update_data) { parser->data.s_addr.v6 = parser->ip6->ip6_src;
git: 2c6ff1d6320d - main - LRO: fix BPF filters for lagg in the hpts path
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=2c6ff1d6320d57a9d0dc62c10c83145ed49a51dd commit 2c6ff1d6320d57a9d0dc62c10c83145ed49a51dd Author: Andrew Gallatin AuthorDate: 2022-08-13 00:15:46 + Commit: Andrew Gallatin CommitDate: 2022-08-13 21:33:36 + LRO: fix BPF filters for lagg in the hpts path When in the hpts path, we need to handle BPF filters since aggregated packets do not pass up the stack in the normal way. This is already done for most interfaces, but lagg needs special handling. This is because packets received via a lagg are passed up the stack with the leaf interface's ifp stored in m_pkthdr.rcvif. To handle lagg packets, we must identify that the passed rcvif is currently a lagg port by checking for IFT_IEEE8023ADLAG or IFT_INFINIBANDLAG (since lagg changes the lagg port's type to that when an interface becomes a lagg member). Then we need to find the lagg's ifp, and handle any BPF listeners on the lagg. Note: It is possible to have multiple BPF filters, one on a member port and one on the lagg itself. That is why we have to have 2 checks and 2 ETHER_BPF_MTAPs. Reviewed by: jhb, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D36136 --- sys/netinet/tcp_lro.c | 30 ++ 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/sys/netinet/tcp_lro.c b/sys/netinet/tcp_lro.c index 2633ccd12afc..fcde002bac53 100644 --- a/sys/netinet/tcp_lro.c +++ b/sys/netinet/tcp_lro.c @@ -53,6 +53,11 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include +#include +#include +#include +#include #include #include @@ -85,7 +90,8 @@ static inttcp_lro_rx_common(struct lro_ctrl *lc, struct mbuf *m, #ifdef TCPHPTS static booldo_bpf_strip_and_compress(struct inpcb *, struct lro_ctrl *, - struct lro_entry *, struct mbuf **, struct mbuf **, struct mbuf **, bool *, bool); + struct lro_entry *, struct mbuf **, struct mbuf **, struct mbuf **, + bool *, bool, bool, struct ifnet *); #endif @@ -1283,7 +1289,8 @@ tcp_lro_flush_tcphpts(struct lro_ctrl *lc, struct lro_entry *le) struct inpcb *inp; struct tcpcb *tp; struct mbuf **pp, *cmp, *mv_to; - bool bpf_req, should_wake; + struct ifnet *lagg_ifp; + bool bpf_req, lagg_bpf_req, should_wake; /* Check if packet doesn't belongs to our network interface. */ if ((tcplro_stacks_wanting_mbufq == 0) || @@ -1341,13 +1348,25 @@ tcp_lro_flush_tcphpts(struct lro_ctrl *lc, struct lro_entry *le) should_wake = true; /* Check if packets should be tapped to BPF. */ bpf_req = bpf_peers_present(lc->ifp->if_bpf); + lagg_bpf_req = false; + lagg_ifp = NULL; + if (lc->ifp->if_type == IFT_IEEE8023ADLAG || + lc->ifp->if_type == IFT_INFINIBANDLAG) { + struct lagg_port *lp = lc->ifp->if_lagg; + struct lagg_softc *sc = lp->lp_softc; + + lagg_ifp = sc->sc_ifp; + if (lagg_ifp != NULL) + lagg_bpf_req = bpf_peers_present(lagg_ifp->if_bpf); + } /* Strip and compress all the incoming packets. */ cmp = NULL; for (pp = &le->m_head; *pp != NULL; ) { mv_to = NULL; if (do_bpf_strip_and_compress(inp, lc, le, pp, -&cmp, &mv_to, &should_wake, bpf_req ) == false) { + &cmp, &mv_to, &should_wake, bpf_req, + lagg_bpf_req, lagg_ifp) == false) { /* Advance to next mbuf. */ pp = &(*pp)->m_nextpkt; } else if (mv_to != NULL) { @@ -1593,7 +1612,7 @@ build_ack_entry(struct tcp_ackent *ae, struct tcphdr *th, struct mbuf *m, static bool do_bpf_strip_and_compress(struct inpcb *inp, struct lro_ctrl *lc, struct lro_entry *le, struct mbuf **pp, struct mbuf **cmp, struct mbuf **mv_to, -bool *should_wake, bool bpf_req) +bool *should_wake, bool bpf_req, bool lagg_bpf_req, struct ifnet *lagg_ifp) { union { void *ptr; @@ -1619,6 +1638,9 @@ do_bpf_strip_and_compress(struct inpcb *inp, struct lro_ctrl *lc, if (__predict_false(bpf_req)) ETHER_BPF_MTAP(lc->ifp, m); + if (__predict_false(lagg_bpf_req)) + ETHER_BPF_MTAP(lagg_ifp, m); + tcp_hdr_offset = m->m_pkthdr.lro_tcp_h_off; lro_type = le->inner.data.lro_type; switch (lro_type) {
git: 8b19898a78d5 - main - Fix a panic on boot introduced by 555a861d6826
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=8b19898a78d52b351f4d7a6ad1d8b074d037e3b7 commit 8b19898a78d52b351f4d7a6ad1d8b074d037e3b7 Author: Andrew Gallatin AuthorDate: 2022-11-01 17:44:39 + Commit: Andrew Gallatin CommitDate: 2022-11-01 17:44:39 + Fix a panic on boot introduced by 555a861d6826 First, an sbuf_new() in device_get_path() shadows the sb passed in by dev_wired_cache_add(), leaving its sb in an unfinished state, leading to a failed KASSERT(). Fixing this is as simple as removing the sbuf_new() from device_get_path() Second, we cannot simply take a pointer to the sbuf memory and store it in the device location cache, because that sbuf is freed immediately after we add data to the cache, leading to a use-after-free and eventually a double-free. Fixing this requires allocating memory for the path. After a discussion with jhb, we decided that one malloc was better than two in dev_wired_cache_add, which is why it changed so much. Reviewed by: jhb Sponsored by: Netflix MFC after: 14 days --- sys/kern/subr_bus.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/sys/kern/subr_bus.c b/sys/kern/subr_bus.c index 5c165419af2d..2fcf650b0289 100644 --- a/sys/kern/subr_bus.c +++ b/sys/kern/subr_bus.c @@ -5310,7 +5310,7 @@ device_get_path(device_t dev, const char *locator, struct sbuf *sb) device_t parent; int error; - sb = sbuf_new(NULL, NULL, 0, SBUF_AUTOEXTEND | SBUF_INCLUDENUL); + KASSERT(sb != NULL, ("sb is NULL")); parent = device_get_parent(dev); if (parent == NULL) { error = sbuf_printf(sb, "/"); @@ -5663,8 +5663,6 @@ dev_wired_cache_fini(device_location_cache_t *dcp) struct device_location_node *dln, *tdln; TAILQ_FOREACH_SAFE(dln, &dcp->dlc_list, dln_link, tdln) { - /* Note: one allocation for both node and locator, but not path */ - free(__DECONST(void *, dln->dln_path), M_BUS); free(dln, M_BUS); } free(dcp, M_BUS); @@ -5687,12 +5685,15 @@ static struct device_location_node * dev_wired_cache_add(device_location_cache_t *dcp, const char *locator, const char *path) { struct device_location_node *dln; - char *l; - - dln = malloc(sizeof(*dln) + strlen(locator) + 1, M_BUS, M_WAITOK | M_ZERO); - dln->dln_locator = l = (char *)(dln + 1); - memcpy(l, locator, strlen(locator) + 1); - dln->dln_path = path; + size_t loclen, pathlen; + + loclen = strlen(locator) + 1; + pathlen = strlen(path) + 1; + dln = malloc(sizeof(*dln) + loclen + pathlen, M_BUS, M_WAITOK | M_ZERO); + dln->dln_locator = (char *)(dln + 1); + memcpy(__DECONST(char *, dln->dln_locator), locator, loclen); + dln->dln_path = dln->dln_locator + loclen; + memcpy(__DECONST(char *, dln->dln_path), path, pathlen); TAILQ_INSERT_HEAD(&dcp->dlc_list, dln, dln_link); return (dln);
git: 17859d538c23 - main - ixl: silence runtime warning when PCI_IOV is not enabled
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=17859d538c23d6faa5a5512262d678377130e591 commit 17859d538c23d6faa5a5512262d678377130e591 Author: Andrew Gallatin AuthorDate: 2022-12-06 16:35:18 + Commit: Andrew Gallatin CommitDate: 2022-12-06 16:35:18 + ixl: silence runtime warning when PCI_IOV is not enabled When PCI_IOV is not enabled, do not attempt to call iflib_softirq_alloc_generic(...IFLIB_INTR_IOV), as it results in boot-time warnings similar to: taskqgroup_attach_cpu: qid not found for iov cpu=2 ixl2: taskqgroup_attach_cpu failed 22 Instead, make it conditional on PCI_IOV like the other SR-IOV related code. Reviewed by:erj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D37609 --- sys/dev/ixl/if_ixl.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/sys/dev/ixl/if_ixl.c b/sys/dev/ixl/if_ixl.c index cb3ce72a95ed..352a35d95512 100644 --- a/sys/dev/ixl/if_ixl.c +++ b/sys/dev/ixl/if_ixl.c @@ -1064,8 +1064,11 @@ ixl_if_msix_intr_assign(if_ctx_t ctx, int msix) "Failed to register Admin Que handler"); return (err); } + +#ifdef PCI_IOV /* Create soft IRQ for handling VFLRs */ iflib_softirq_alloc_generic(ctx, NULL, IFLIB_INTR_IOV, pf, 0, "iov"); +#endif /* Now set up the stations */ for (i = 0, vector = 1; i < vsi->shared->isc_nrxqsets; i++, vector++, rx_que++) {
git: c4a4b2633d97 - main - allocate inpcb aligned to cachelines
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=c4a4b2633d975bd0813afca6b8e23ead29d80e82 commit c4a4b2633d975bd0813afca6b8e23ead29d80e82 Author: Andrew Gallatin AuthorDate: 2022-12-14 19:19:35 + Commit: Andrew Gallatin CommitDate: 2022-12-14 19:19:35 + allocate inpcb aligned to cachelines The inpcb struct is one of the most heavily utilized in the kernel on a busy network server. By aligning it to a cacheline boundary, we can ensure that closely related fields in the inpcb and tcbcb can be predictably located on the same cacheline. rrs has already done a lot of this work to put related fields on the same line for the tcbcb. In combination with a forthcoming patch to align the start of the tcpcb, we see a roughly 3% reduction in CPU use on a busy web server serving traffic over roughly 50,000 TCP connections. Reviewed by: glebius, markj, tuexen Differential Revision: https://reviews.freebsd.org/D37687 Sponsored by: Netflix --- sys/netinet/in_pcb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sys/netinet/in_pcb.c b/sys/netinet/in_pcb.c index 3a83682b711f..e7f425f8593a 100644 --- a/sys/netinet/in_pcb.c +++ b/sys/netinet/in_pcb.c @@ -552,7 +552,7 @@ in_pcbstorage_init(void *arg) pcbstor->ips_zone = uma_zcreate(pcbstor->ips_zone_name, pcbstor->ips_size, NULL, inpcb_dtor, pcbstor->ips_pcbinit, - inpcb_fini, UMA_ALIGN_PTR, UMA_ZONE_SMR); + inpcb_fini, UMA_ALIGN_CACHE, UMA_ZONE_SMR); pcbstor->ips_portzone = uma_zcreate(pcbstor->ips_portzone_name, sizeof(struct inpcbport), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0); uma_zone_set_smr(pcbstor->ips_portzone,
git: 1cac76c93fb7 - main - vm: reduce lock contention when processing vm batchqueues
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=1cac76c93fb7f627fd9e304cbd99e8c8a2b8fce8 commit 1cac76c93fb7f627fd9e304cbd99e8c8a2b8fce8 Author: Andrew Gallatin AuthorDate: 2022-12-14 19:34:07 + Commit: Andrew Gallatin CommitDate: 2022-12-14 19:34:07 + vm: reduce lock contention when processing vm batchqueues Rather than waiting until the batchqueue is full to acquire the lock & process the queue, we now start trying to acquire the lock using trylocks when the batchqueue is 1/2 full. This removes almost all contention on the vm pagequeue mutex for for our busy sendfile() based web workload. It also greadly reduces the amount of time a network driver ithread remains blocked on a mutex, and eliminates some packet drops under heavy load. So that the system does not loose the benefit of processing large batchqueues, I've doubled the size of the batchqueues. This way, when there is no contention, we process the same batch size as before. This has been run for several months on a busy Netflix server, as well as on my personal desktop. Reviewed by: markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D37305 --- sys/amd64/include/vmparam.h | 2 +- sys/powerpc/include/vmparam.h | 2 +- sys/vm/vm_page.c | 17 +++-- sys/vm/vm_pageout.c | 2 +- sys/vm/vm_pagequeue.h | 12 +++- 5 files changed, 25 insertions(+), 10 deletions(-) diff --git a/sys/amd64/include/vmparam.h b/sys/amd64/include/vmparam.h index fc88296f754c..205848489644 100644 --- a/sys/amd64/include/vmparam.h +++ b/sys/amd64/include/vmparam.h @@ -293,7 +293,7 @@ * Use a fairly large batch size since we expect amd64 systems to have lots of * memory. */ -#defineVM_BATCHQUEUE_SIZE 31 +#defineVM_BATCHQUEUE_SIZE 63 /* * The pmap can create non-transparent large page mappings. diff --git a/sys/powerpc/include/vmparam.h b/sys/powerpc/include/vmparam.h index 77457717a3fd..1b9873aede4a 100644 --- a/sys/powerpc/include/vmparam.h +++ b/sys/powerpc/include/vmparam.h @@ -263,7 +263,7 @@ extern int vm_level_0_order; * memory. */ #ifdef __powerpc64__ -#defineVM_BATCHQUEUE_SIZE 31 +#defineVM_BATCHQUEUE_SIZE 63 #endif /* diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c index 2b7bc6a5b66e..797207205f42 100644 --- a/sys/vm/vm_page.c +++ b/sys/vm/vm_page.c @@ -3662,19 +3662,32 @@ vm_page_pqbatch_submit(vm_page_t m, uint8_t queue) { struct vm_batchqueue *bq; struct vm_pagequeue *pq; - int domain; + int domain, slots_remaining; KASSERT(queue < PQ_COUNT, ("invalid queue %d", queue)); domain = vm_page_domain(m); critical_enter(); bq = DPCPU_PTR(pqbatch[domain][queue]); - if (vm_batchqueue_insert(bq, m)) { + slots_remaining = vm_batchqueue_insert(bq, m); + if (slots_remaining > (VM_BATCHQUEUE_SIZE >> 1)) { + /* keep building the bq */ + critical_exit(); + return; + } else if (slots_remaining > 0 ) { + /* Try to process the bq if we can get the lock */ + pq = &VM_DOMAIN(domain)->vmd_pagequeues[queue]; + if (vm_pagequeue_trylock(pq)) { + vm_pqbatch_process(pq, bq, queue); + vm_pagequeue_unlock(pq); + } critical_exit(); return; } critical_exit(); + /* if we make it here, the bq is full so wait for the lock */ + pq = &VM_DOMAIN(domain)->vmd_pagequeues[queue]; vm_pagequeue_lock(pq); critical_enter(); diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c index bb12a7e335d5..2945b53835c6 100644 --- a/sys/vm/vm_pageout.c +++ b/sys/vm/vm_pageout.c @@ -1405,7 +1405,7 @@ vm_pageout_reinsert_inactive(struct scan_state *ss, struct vm_batchqueue *bq, pq = ss->pq; if (m != NULL) { - if (vm_batchqueue_insert(bq, m)) + if (vm_batchqueue_insert(bq, m) != 0) return; vm_pagequeue_lock(pq); delta += vm_pageout_reinsert_inactive_page(pq, marker, m); diff --git a/sys/vm/vm_pagequeue.h b/sys/vm/vm_pagequeue.h index a9d4c920e5be..268d53a391db 100644 --- a/sys/vm/vm_pagequeue.h +++ b/sys/vm/vm_pagequeue.h @@ -75,7 +75,7 @@ struct vm_pagequeue { } __aligned(CACHE_LINE_SIZE); #ifndef VM_BATCHQUEUE_SIZE -#defineVM_BATCHQUEUE_SIZE 7 +#defineVM_BATCHQUEUE_SIZE 15 #endif struct vm_batchqueue { @@ -356,15 +356,17 @@ vm_batchqueue_init(struct vm_batchqueue *bq) bq->bq_cnt = 0; } -static inline bool +static inline int vm_batchqueue_insert(struct vm_batchqueue *bq, vm_page_t m) { + int s
git: ac4e3a27ab49 - main - Unbreak the build when MAC is not defined
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=ac4e3a27ab499d3401e8810c6a11713e6ed6f76b commit ac4e3a27ab499d3401e8810c6a11713e6ed6f76b Author: Andrew Gallatin AuthorDate: 2022-12-14 22:33:30 + Commit: Andrew Gallatin CommitDate: 2022-12-14 22:39:25 + Unbreak the build when MAC is not defined 7a2c93b86ef7 removed the use of "error" when MAC was not defined, resulting in an unused variable error. Sponsored by: Netflix Reviewed by: jhb --- sys/kern/sys_socket.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/sys/kern/sys_socket.c b/sys/kern/sys_socket.c index 5cfb366c150b..2ad76b15cee6 100644 --- a/sys/kern/sys_socket.c +++ b/sys/kern/sys_socket.c @@ -145,7 +145,8 @@ soo_write(struct file *fp, struct uio *uio, struct ucred *active_cred, if (error) return (error); #endif - return (sousrsend(so, NULL, uio, NULL, 0, NULL)); + error = sousrsend(so, NULL, uio, NULL, 0, NULL); + return (error); } static int
git: 8ea418299548 - main - tcp: Build RACK and BBR stacks as a part of LINT
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=8ea41829954831d345c3aef58488adf0fc8dbb42 commit 8ea41829954831d345c3aef58488adf0fc8dbb42 Author: Andrew Gallatin AuthorDate: 2023-01-10 21:09:00 + Commit: Andrew Gallatin CommitDate: 2023-01-10 21:16:43 + tcp: Build RACK and BBR stacks as a part of LINT When RACK and BBR were added to the kernel, they were put behind 'WITH_EXTRA_TCP_STACKS=1'. Unfortunately that was never added to any NOTES file, so RACK & BBR were not compiled with the various LINT-NOINET, LINT-NOINET6, and LINT-NOIP kernels. This lead to the stacks sometimes being broken. This change: - Fixes RACK so that it compiles with the various LINT-NO* kernels - Adds WITH_EXTRA_TCP_STACKS=1 to all NOTES kernels so that RACK and BBR are compile tested regularly Sponsored by: Netflix Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D37903 --- sys/conf/NOTES| 1 + sys/netinet/tcp_stacks/rack.c | 23 --- 2 files changed, 21 insertions(+), 3 deletions(-) diff --git a/sys/conf/NOTES b/sys/conf/NOTES index a1c0e71551ae..6cea39d27ad6 100644 --- a/sys/conf/NOTES +++ b/sys/conf/NOTES @@ -677,6 +677,7 @@ options TCP_OFFLOAD # TCP offload support. optionsTCP_RFC7413 # TCP Fast Open optionsTCPHPTS +makeoptionsWITH_EXTRA_TCP_STACKS=1 # RACK and BBR TCP kernel modules # In order to enable IPSEC you MUST also add device crypto to # your kernel configuration diff --git a/sys/netinet/tcp_stacks/rack.c b/sys/netinet/tcp_stacks/rack.c index 6070ad5dc17a..dafe8184a8fd 100644 --- a/sys/netinet/tcp_stacks/rack.c +++ b/sys/netinet/tcp_stacks/rack.c @@ -32,6 +32,7 @@ __FBSDID("$FreeBSD$"); #include "opt_ipsec.h" #include "opt_ratelimit.h" #include "opt_kern_tls.h" +#if defined(INET) || defined(INET6) #include #include #include @@ -12347,6 +12348,7 @@ rack_init_fsb_block(struct tcpcb *tp, struct tcp_rack *rack) ip6, rack->r_ctl.fsb.th); } else #endif /* INET6 */ +#ifdef INET { rack->r_ctl.fsb.tcp_ip_hdr_len = sizeof(struct tcpiphdr); ip = (struct ip *)rack->r_ctl.fsb.tcp_ip_hdr; @@ -12366,6 +12368,7 @@ rack_init_fsb_block(struct tcpcb *tp, struct tcp_rack *rack) tp->t_port, ip, rack->r_ctl.fsb.th); } +#endif rack->r_fsb_inited = 1; } @@ -15611,7 +15614,7 @@ rack_fast_rsm_output(struct tcpcb *tp, struct tcp_rack *rack, struct rack_sendma struct tcpopt to; u_char opt[TCP_MAXOLEN]; uint32_t hdrlen, optlen; - int32_t slot, segsiz, max_val, tso = 0, error, ulen = 0; + int32_t slot, segsiz, max_val, tso = 0, error = 0, ulen = 0; uint16_t flags; uint32_t if_hw_tsomaxsegcount = 0, startseq; uint32_t if_hw_tsomaxsegsize; @@ -15935,8 +15938,6 @@ rack_fast_rsm_output(struct tcpcb *tp, struct tcp_rack *rack, struct rack_sendma &inp->inp_route6, 0, NULL, NULL, inp); } -#endif -#if defined(INET) && defined(INET6) else #endif #ifdef INET @@ -16102,7 +16103,9 @@ rack_fast_output(struct tcpcb *tp, struct tcp_rack *rack, uint64_t ts_val, * the max-burst). We have how much to send and all the info we * need to just send. */ +#ifdef INET struct ip *ip = NULL; +#endif struct udphdr *udp = NULL; struct tcphdr *th = NULL; struct mbuf *m, *s_mb; @@ -16133,8 +16136,10 @@ rack_fast_output(struct tcpcb *tp, struct tcp_rack *rack, uint64_t ts_val, } else #endif /* INET6 */ { +#ifdef INET ip = (struct ip *)rack->r_ctl.fsb.tcp_ip_hdr; hdrlen = sizeof(struct tcpiphdr); +#endif } if (tp->t_port && (V_tcp_udp_tunneling_port == 0)) { m = NULL; @@ -16281,8 +16286,10 @@ again: else #endif { +#ifdef INET ip->ip_tos &= ~IPTOS_ECN_MASK; ip->ip_tos |= ect; +#endif } } tcp_set_flags(th, flags); @@ -18346,7 +18353,9 @@ send: ip6 = (struct ip6_hdr *)rack->r_ctl.fsb.tcp_ip_hdr; else #endif /* INET6 */ +#ifdef INET ip = (struct ip *)rack->r_ctl.fsb.tcp_ip_hdr; +#endif th = rack->r_ctl.fsb.th; udp = rack->r_ctl.fsb.udp; if (udp) { @@ -18375,6 +18384,7 @@ send: } else #endif /* INET6
git: 9cb6ba29cb70 - main - vm: centralize VM_BATCHQUEUE_SIZE definition
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=9cb6ba29cb704c180d5b82f409e280377a641a28 commit 9cb6ba29cb704c180d5b82f409e280377a641a28 Author: Andrew Gallatin AuthorDate: 2023-01-21 19:26:25 + Commit: Andrew Gallatin CommitDate: 2023-01-21 19:30:00 + vm: centralize VM_BATCHQUEUE_SIZE definition Remove the platform-specific definitions of VM_BATCHQUEUE_SIZE for amd64 and powerpc64, and instead treat all 64-bit platforms identically. This has the effect of increasing the arm64 and riscv VM_BATCHQUEUE_SIZE to match that of other platforms. Reviewed by: jhb, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D37707 --- sys/amd64/include/vmparam.h | 6 -- sys/powerpc/include/vmparam.h | 8 sys/vm/vm_pagequeue.h | 4 +++- 3 files changed, 3 insertions(+), 15 deletions(-) diff --git a/sys/amd64/include/vmparam.h b/sys/amd64/include/vmparam.h index 205848489644..880c46bba84d 100644 --- a/sys/amd64/include/vmparam.h +++ b/sys/amd64/include/vmparam.h @@ -289,12 +289,6 @@ #defineZERO_REGION_SIZE(2 * 1024 * 1024) /* 2MB */ -/* - * Use a fairly large batch size since we expect amd64 systems to have lots of - * memory. - */ -#defineVM_BATCHQUEUE_SIZE 63 - /* * The pmap can create non-transparent large page mappings. */ diff --git a/sys/powerpc/include/vmparam.h b/sys/powerpc/include/vmparam.h index 1b9873aede4a..0f3321379b47 100644 --- a/sys/powerpc/include/vmparam.h +++ b/sys/powerpc/include/vmparam.h @@ -258,14 +258,6 @@ extern int vm_level_0_order; #defineZERO_REGION_SIZE(64 * 1024) /* 64KB */ #endif -/* - * Use a fairly large batch size since we expect ppc64 systems to have lots of - * memory. - */ -#ifdef __powerpc64__ -#defineVM_BATCHQUEUE_SIZE 63 -#endif - /* * On 32-bit OEA, the only purpose for which sf_buf is used is to implement * an opaque pointer required by the machine-independent parts of the kernel. diff --git a/sys/vm/vm_pagequeue.h b/sys/vm/vm_pagequeue.h index 268d53a391db..9624d31a75b7 100644 --- a/sys/vm/vm_pagequeue.h +++ b/sys/vm/vm_pagequeue.h @@ -74,7 +74,9 @@ struct vm_pagequeue { uint64_tpq_pdpages; } __aligned(CACHE_LINE_SIZE); -#ifndef VM_BATCHQUEUE_SIZE +#if __SIZEOF_LONG__ == 8 +#defineVM_BATCHQUEUE_SIZE 63 +#else #defineVM_BATCHQUEUE_SIZE 15 #endif
git: da81cc6035f8 - main - dtrace: conditionally load the systrace_linux klds when loading dtrace.
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=da81cc6035f8283b6adda1ef466977e8c1c5389e commit da81cc6035f8283b6adda1ef466977e8c1c5389e Author: Andrew Gallatin AuthorDate: 2023-01-24 01:27:17 + Commit: Andrew Gallatin CommitDate: 2023-01-24 01:36:24 + dtrace: conditionally load the systrace_linux klds when loading dtrace. When dtrace starts, it tries to detect if the dtrace klds are loaded, and if not, it loads them by loading the dtraceall kld. This module depends on most dtrace modules, including systrace for the native freebsd and freebsd32 ABIs. However, it does not depend on the systrace_linux klds, as they in turn depend on the linux ABI klds, and we don't want to load an ABI module that the user has not explicitly requested. This can leave a naive user in a state where they think all syscall providers have been loaded, yet linux ABI syscalls are "invisible" to dtrace. To fix this, check to see if the linux ABI modules are loaded. If they are, then load their systrace klds. Reviewed by: markj, (emaste & jhb, earlier versions) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D37986 --- cddl/contrib/opensolaris/lib/libdtrace/common/dt_open.c | 9 + 1 file changed, 9 insertions(+) diff --git a/cddl/contrib/opensolaris/lib/libdtrace/common/dt_open.c b/cddl/contrib/opensolaris/lib/libdtrace/common/dt_open.c index 867259b5d77c..e11cdc954683 100644 --- a/cddl/contrib/opensolaris/lib/libdtrace/common/dt_open.c +++ b/cddl/contrib/opensolaris/lib/libdtrace/common/dt_open.c @@ -1115,6 +1115,15 @@ dt_vopen(int version, int flags, int *errp, */ if (err == ENOENT && modfind("dtraceall") < 0) { kldload("dtraceall"); /* ignore the error */ +#if __SIZEOF_LONG__ == 8 + if (modfind("linux64elf") >= 0) + kldload("systrace_linux"); + if (modfind("linuxelf") >= 0) + kldload("systrace_linux32"); +#else + if (modfind("linuxelf") >= 0) { + kldload("systrace_linux"); +#endif dtfd = open("/dev/dtrace/dtrace", O_RDWR | O_CLOEXEC); err = errno; }
Re: git: ecdd0b48cbf4 - main - dtrace: remove stray {
On 1/24/23 03:36, Kristof Provost wrote: The branch main has been updated by kp: URL: https://urldefense.com/v3/__https://cgit.FreeBSD.org/src/commit/?id=ecdd0b48cbf4dd5acc3fc14625de6dc25cf354ce__;!!OToaGQ!sSG7SsMzMdduiDQHrFvUgnJx_Vqw_UVXcGnUYdnForAGClss0veTm_Y0BxHqtle2SRvY0H7NmxQvAA$ commit ecdd0b48cbf4dd5acc3fc14625de6dc25cf354ce Author: Kristof Provost AuthorDate: 2023-01-24 07:39:37 + Commit: Kristof Provost CommitDate: 2023-01-24 07:39:37 + dtrace: remove stray { Thanks so much. I can't believe I did that! :( I had done several make universe's on the previous version, and thought I'd done one on this one as well. Obviously I was wrong. Drew
git: 9ba117960e17 - main - Fix a memory leak when ip_output_send() returns EAGAIN due to send tag issues
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=9ba117960e1755a693f9361e4d076630dfe13dba commit 9ba117960e1755a693f9361e4d076630dfe13dba Author: Andrew Gallatin AuthorDate: 2022-01-27 15:28:15 + Commit: Andrew Gallatin CommitDate: 2022-01-27 15:34:34 + Fix a memory leak when ip_output_send() returns EAGAIN due to send tag issues When ip_output_send() returns EAGAIN due to issues with send tags (route change, lagg failover, etc), it must free the mbuf. This is because ip_output_send() was written as a wrapper/replacement for a direct call to if_output(), and the contract with if_output() has historically been that it owns the mbufs once called. When ip_output_send() failed to free mbufs, it violated this assumption and lead to leaked mbufs. This was noticed when using NIC TLS in combination with hardware rate-limited connections. When seeing lots of NIC output drops triggered ratelimit send tag changes, we noticed we were leaking ktls_sessions, send tags and mbufs. This was due ip_output_send() leaking mbufs which held references to ktls_sessions, which in turn held references to send tags. Many thanks to jbh, rrs, hselasky and markj for their help in debugging this. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D34054 Reviewed by: hselasky, jhb, rrs MFC after: 2 weeks --- sys/netinet/ip_output.c | 2 ++ sys/netinet6/ip6_output.c | 2 ++ 2 files changed, 4 insertions(+) diff --git a/sys/netinet/ip_output.c b/sys/netinet/ip_output.c index e30339f8c4aa..f203bc165e61 100644 --- a/sys/netinet/ip_output.c +++ b/sys/netinet/ip_output.c @@ -239,6 +239,7 @@ ip_output_send(struct inpcb *inp, struct ifnet *ifp, struct mbuf *m, * packet. */ if (mst == NULL) { + m_freem(m); error = EAGAIN; goto done; } @@ -263,6 +264,7 @@ ip_output_send(struct inpcb *inp, struct ifnet *ifp, struct mbuf *m, KASSERT(m->m_pkthdr.rcvif == NULL, ("trying to add a send tag to a forwarded packet")); if (mst->ifp != ifp) { + m_freem(m); error = EAGAIN; goto done; } diff --git a/sys/netinet6/ip6_output.c b/sys/netinet6/ip6_output.c index 848ec6694398..406776bdb5a4 100644 --- a/sys/netinet6/ip6_output.c +++ b/sys/netinet6/ip6_output.c @@ -336,6 +336,7 @@ ip6_output_send(struct inpcb *inp, struct ifnet *ifp, struct ifnet *origifp, * packet. */ if (mst == NULL) { + m_freem(m); error = EAGAIN; goto done; } @@ -360,6 +361,7 @@ ip6_output_send(struct inpcb *inp, struct ifnet *ifp, struct ifnet *origifp, KASSERT(m->m_pkthdr.rcvif == NULL, ("trying to add a send tag to a forwarded packet")); if (mst->ifp != ifp) { + m_freem(m); error = EAGAIN; goto done; }
git: 8a7404b2aeeb - main - tcp: fix leaks in tcp_chg_pacing_rate error paths
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=8a7404b2aeeb6345bd82c13c432e56d8cbfba869 commit 8a7404b2aeeb6345bd82c13c432e56d8cbfba869 Author: Andrew Gallatin AuthorDate: 2022-01-27 15:35:03 + Commit: Andrew Gallatin CommitDate: 2022-01-27 15:35:03 + tcp: fix leaks in tcp_chg_pacing_rate error paths tcp_chg_pacing_rate() is expected to release the hw rate limit table, but failed to do so in several error cases, leading to ever increasing counts of flows using the rate. This patch was mostly done by rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D34058 Reviewed by: hselasky, rrs, jhb (inital version, outside of Differential) --- sys/netinet/tcp_ratelimit.c | 21 + 1 file changed, 21 insertions(+) diff --git a/sys/netinet/tcp_ratelimit.c b/sys/netinet/tcp_ratelimit.c index 96a38b6afd54..2f36cea4faed 100644 --- a/sys/netinet/tcp_ratelimit.c +++ b/sys/netinet/tcp_ratelimit.c @@ -1411,6 +1411,7 @@ tcp_chg_pacing_rate(const struct tcp_hwrate_limit_table *crte, * tags if it didn't allocate one when an * existing rate was present, so ignore. */ + tcp_rel_pacing_rate(crte, tp); if (error) *error = EOPNOTSUPP; return (NULL); @@ -1419,6 +1420,7 @@ tcp_chg_pacing_rate(const struct tcp_hwrate_limit_table *crte, #endif if (tp->t_inpcb->inp_snd_tag == NULL) { /* Wrong interface */ + tcp_rel_pacing_rate(crte, tp); if (error) *error = EINVAL; return (NULL); @@ -1457,10 +1459,29 @@ tcp_chg_pacing_rate(const struct tcp_hwrate_limit_table *crte, #endif err = in_pcbmodify_txrtlmt(tp->t_inpcb, nrte->rate); if (err) { + struct tcp_rate_set *lrs; + uint64_t pre; + rl_decrement_using(nrte); + lrs = __DECONST(struct tcp_rate_set *, rs); + pre = atomic_fetchadd_64(&lrs->rs_flows_using, -1); /* Do we still have a snd-tag attached? */ if (tp->t_inpcb->inp_snd_tag) in_pcbdetach_txrtlmt(tp->t_inpcb); + + if (pre == 1) { + struct epoch_tracker et; + + NET_EPOCH_ENTER(et); + mtx_lock(&rs_mtx); + /* +* Is it dead? +*/ + if (lrs->rs_flags & RS_IS_DEAD) + rs_defer_destroy(lrs); + mtx_unlock(&rs_mtx); + NET_EPOCH_EXIT(et); + } if (error) *error = err; return (NULL);
Re: git: b1f7154cb125 - main - gitignore: ignore vim swap files & .rej/.orig
On 1/17/22 04:35, Alexander V. Chernikov wrote: The branch main has been updated by melifaro: URL: https://urldefense.com/v3/__https://cgit.FreeBSD.org/src/commit/?id=b1f7154cb12517162a51d19ae19ec3f2dee88e11__;!!OToaGQ!4Lozvj8S2Opxre6qHuywX_aNhwm1heXl1CyQyb0N5f_fiBJEkTQGhLzE7KlqqP9C7A$ commit b1f7154cb12517162a51d19ae19ec3f2dee88e11 Author: Alexander V. Chernikov AuthorDate: 2022-01-08 16:14:47 + Commit: Alexander V. Chernikov CommitDate: 2022-01-17 09:35:15 + gitignore: ignore vim swap files & .rej/.orig Reviewed by:cem, avg MFC after: 2 weeks Hi, I was wondering if you might consider reverting this change? Alternatively, can you teach me how to override this file locally without carrying a diff? I'm asking because this makes life painful for my workflow. Having git clean be able to handle .orig and .rej is incredibly handy when applying large patch sets. It makes finding a rejected patch as simple as 'git clean -n | grep rej'. Also, when directories are removed *AND* they have an errant .orig or .rej file remaining in them, git rm will not garbage collect the directory. This causes the build to fail. I used to use the trick of 'git clean -nd' to find such directories, but now they are hidden. This scenario just cost me hours of parsing make output, trying to figure out why my build had failed. Thanks, Drew
git: 28d0a740dd9a - main - ktls: auto-disable ifnet (inline hw) kTLS
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=28d0a740dd9a67e4a4fa9fda5bb39b5963316f35 commit 28d0a740dd9a67e4a4fa9fda5bb39b5963316f35 Author: Andrew Gallatin AuthorDate: 2021-07-06 14:17:33 + Commit: Andrew Gallatin CommitDate: 2021-07-06 14:28:32 + ktls: auto-disable ifnet (inline hw) kTLS Ifnet (inline) hw kTLS NICs typically keep state within a TLS record, so that when transmitting in-order, they can continue encryption on each segment sent without DMA'ing extra state from the host. This breaks down when transmits are out of order (eg, TCP retransmits). In this case, the NIC must re-DMA the entire TLS record up to and including the segment being retransmitted. This means that when re-transmitting the last 1448 byte segment of a TLS record, the NIC will have to re-DMA the entire 16KB TLS record. This can lead to the NIC running out of PCIe bus bandwidth well before it saturates the network link if a lot of TCP connections have a high retransmoit rate. This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct), where TCP connections with higher retransmit rate will be switched to SW kTLS so as to conserve PCIe bandwidth. Reviewed by:hselasky, markj, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30908 --- sys/kern/uipc_ktls.c | 107 ++ sys/netinet/tcp_var.h | 13 +- sys/sys/ktls.h| 15 ++- 3 files changed, 133 insertions(+), 2 deletions(-) diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c index 7e87e7c740e3..88e29157289d 100644 --- a/sys/kern/uipc_ktls.c +++ b/sys/kern/uipc_ktls.c @@ -30,6 +30,7 @@ __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_inet6.h" +#include "opt_kern_tls.h" #include "opt_ratelimit.h" #include "opt_rss.h" @@ -121,6 +122,11 @@ SYSCTL_INT(_kern_ipc_tls_stats, OID_AUTO, threads, CTLFLAG_RD, &ktls_number_threads, 0, "Number of TLS threads in thread-pool"); +unsigned int ktls_ifnet_max_rexmit_pct = 2; +SYSCTL_UINT(_kern_ipc_tls, OID_AUTO, ifnet_max_rexmit_pct, CTLFLAG_RWTUN, +&ktls_ifnet_max_rexmit_pct, 2, +"Max percent bytes retransmitted before ifnet TLS is disabled"); + static bool ktls_offload_enable; SYSCTL_BOOL(_kern_ipc_tls, OID_AUTO, enable, CTLFLAG_RWTUN, &ktls_offload_enable, 0, @@ -184,6 +190,14 @@ static COUNTER_U64_DEFINE_EARLY(ktls_switch_failed); SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, switch_failed, CTLFLAG_RD, &ktls_switch_failed, "TLS sessions unable to switch between SW and ifnet"); +static COUNTER_U64_DEFINE_EARLY(ktls_ifnet_disable_fail); +SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, ifnet_disable_failed, CTLFLAG_RD, +&ktls_ifnet_disable_fail, "TLS sessions unable to switch to SW from ifnet"); + +static COUNTER_U64_DEFINE_EARLY(ktls_ifnet_disable_ok); +SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, ifnet_disable_ok, CTLFLAG_RD, +&ktls_ifnet_disable_ok, "TLS sessions able to switch to SW from ifnet"); + SYSCTL_NODE(_kern_ipc_tls, OID_AUTO, sw, CTLFLAG_RD | CTLFLAG_MPSAFE, 0, "Software TLS session stats"); SYSCTL_NODE(_kern_ipc_tls, OID_AUTO, ifnet, CTLFLAG_RD | CTLFLAG_MPSAFE, 0, @@ -2187,3 +2201,96 @@ ktls_work_thread(void *ctx) } } } + +static void +ktls_disable_ifnet_help(void *context, int pending __unused) +{ + struct ktls_session *tls; + struct inpcb *inp; + struct tcpcb *tp; + struct socket *so; + int err; + + tls = context; + inp = tls->inp; + if (inp == NULL) + return; + INP_WLOCK(inp); + so = inp->inp_socket; + MPASS(so != NULL); + if ((inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) || + (inp->inp_flags2 & INP_FREED)) { + goto out; + } + + if (so->so_snd.sb_tls_info != NULL) + err = ktls_set_tx_mode(so, TCP_TLS_MODE_SW); + else + err = ENXIO; + if (err == 0) { + counter_u64_add(ktls_ifnet_disable_ok, 1); + /* ktls_set_tx_mode() drops inp wlock, so recheck flags */ + if ((inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) == 0 && + (inp->inp_flags2 & INP_FREED) == 0 && + (tp = intotcpcb(inp)) != NULL && + tp->t_fb->tfb_hwtls_change != NULL) + (*tp->t_fb->tfb_hwtls_change)(tp, 0); + } else { + counter_u64_add(ktls_ifnet_disable_fail, 1); + } + +out: + SOCK_LOCK(so); + sorele(so); + if (!in_pcbrele_wlocked(inp)) +
git: 4150a5a87ed6 - main - ktls: fix NOINET build
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=4150a5a87ed6757cb6fd0118b4058eef77f735f7 commit 4150a5a87ed6757cb6fd0118b4058eef77f735f7 Author: Andrew Gallatin AuthorDate: 2021-07-07 14:38:57 + Commit: Andrew Gallatin CommitDate: 2021-07-07 14:40:02 + ktls: fix NOINET build Reported by: mjguzik Sponsored by: Netflix --- sys/kern/uipc_ktls.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c index 88e29157289d..5f7dde325740 100644 --- a/sys/kern/uipc_ktls.c +++ b/sys/kern/uipc_ktls.c @@ -2202,6 +2202,7 @@ ktls_work_thread(void *ctx) } } +#if defined(INET) || defined(INET6) static void ktls_disable_ifnet_help(void *context, int pending __unused) { @@ -2294,3 +2295,4 @@ ktls_disable_ifnet(void *arg) TASK_INIT(&tls->disable_ifnet_task, 0, ktls_disable_ifnet_help, tls); (void)taskqueue_enqueue(taskqueue_thread, &tls->disable_ifnet_task); } +#endif ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
Re: git: 28d0a740dd9a - main - ktls: auto-disable ifnet (inline hw) kTLS
On 7/7/21 7:00 AM, Mateusz Guzik wrote: This breaks NOIP kernel builds. Thanks for pointing this out, it should be fixed in 4150a5a87ed On 7/6/21, Andrew Gallatin wrote: The branch main has been updated by gallatin: URL: https://urldefense.com/v3/__https://cgit.FreeBSD.org/src/commit/?id=28d0a740dd9a67e4a4fa9fda5bb39b5963316f35__;!!OToaGQ!_d4pkzhNaWowgMsR4-c1qtLXr1H9SC_kBWNDvXvVV15lerMV4elltm-V6OZj3iET-A$ commit 28d0a740dd9a67e4a4fa9fda5bb39b5963316f35 Author: Andrew Gallatin AuthorDate: 2021-07-06 14:17:33 + Commit: Andrew Gallatin CommitDate: 2021-07-06 14:28:32 + ktls: auto-disable ifnet (inline hw) kTLS Ifnet (inline) hw kTLS NICs typically keep state within a TLS record, so that when transmitting in-order, they can continue encryption on each segment sent without DMA'ing extra state from the host. This breaks down when transmits are out of order (eg, TCP retransmits). In this case, the NIC must re-DMA the entire TLS record up to and including the segment being retransmitted. This means that when re-transmitting the last 1448 byte segment of a TLS record, the NIC will have to re-DMA the entire 16KB TLS record. This can lead to the NIC running out of PCIe bus bandwidth well before it saturates the network link if a lot of TCP connections have a high retransmoit rate. This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct), where TCP connections with higher retransmit rate will be switched to SW kTLS so as to conserve PCIe bandwidth. Reviewed by:hselasky, markj, rrs Sponsored by: Netflix Differential Revision: https://urldefense.com/v3/__https://reviews.freebsd.org/D30908__;!!OToaGQ!_d4pkzhNaWowgMsR4-c1qtLXr1H9SC_kBWNDvXvVV15lerMV4elltm-V6OYOYLaV0A$ --- sys/kern/uipc_ktls.c | 107 ++ sys/netinet/tcp_var.h | 13 +- sys/sys/ktls.h| 15 ++- 3 files changed, 133 insertions(+), 2 deletions(-) diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c index 7e87e7c740e3..88e29157289d 100644 --- a/sys/kern/uipc_ktls.c +++ b/sys/kern/uipc_ktls.c @@ -30,6 +30,7 @@ __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_inet6.h" +#include "opt_kern_tls.h" #include "opt_ratelimit.h" #include "opt_rss.h" @@ -121,6 +122,11 @@ SYSCTL_INT(_kern_ipc_tls_stats, OID_AUTO, threads, CTLFLAG_RD, &ktls_number_threads, 0, "Number of TLS threads in thread-pool"); +unsigned int ktls_ifnet_max_rexmit_pct = 2; +SYSCTL_UINT(_kern_ipc_tls, OID_AUTO, ifnet_max_rexmit_pct, CTLFLAG_RWTUN, +&ktls_ifnet_max_rexmit_pct, 2, +"Max percent bytes retransmitted before ifnet TLS is disabled"); + static bool ktls_offload_enable; SYSCTL_BOOL(_kern_ipc_tls, OID_AUTO, enable, CTLFLAG_RWTUN, &ktls_offload_enable, 0, @@ -184,6 +190,14 @@ static COUNTER_U64_DEFINE_EARLY(ktls_switch_failed); SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, switch_failed, CTLFLAG_RD, &ktls_switch_failed, "TLS sessions unable to switch between SW and ifnet"); +static COUNTER_U64_DEFINE_EARLY(ktls_ifnet_disable_fail); +SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, ifnet_disable_failed, CTLFLAG_RD, +&ktls_ifnet_disable_fail, "TLS sessions unable to switch to SW from ifnet"); + +static COUNTER_U64_DEFINE_EARLY(ktls_ifnet_disable_ok); +SYSCTL_COUNTER_U64(_kern_ipc_tls_stats, OID_AUTO, ifnet_disable_ok, CTLFLAG_RD, +&ktls_ifnet_disable_ok, "TLS sessions able to switch to SW from ifnet"); + SYSCTL_NODE(_kern_ipc_tls, OID_AUTO, sw, CTLFLAG_RD | CTLFLAG_MPSAFE, 0, "Software TLS session stats"); SYSCTL_NODE(_kern_ipc_tls, OID_AUTO, ifnet, CTLFLAG_RD | CTLFLAG_MPSAFE, 0, @@ -2187,3 +2201,96 @@ ktls_work_thread(void *ctx) } } } + +static void +ktls_disable_ifnet_help(void *context, int pending __unused) +{ + struct ktls_session *tls; + struct inpcb *inp; + struct tcpcb *tp; + struct socket *so; + int err; + + tls = context; + inp = tls->inp; + if (inp == NULL) + return; + INP_WLOCK(inp); + so = inp->inp_socket; + MPASS(so != NULL); + if ((inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) || + (inp->inp_flags2 & INP_FREED)) { + goto out; + } + + if (so->so_snd.sb_tls_info != NULL) + err = ktls_set_tx_mode(so, TCP_TLS_MODE_SW); + else + err = ENXIO; + if (err == 0) { + counter_u64_add(ktls_ifnet_disable_ok, 1); + /* ktls_set_tx_mode() drops inp wlock, so recheck flags */ + if ((inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) == 0 && + (inp->inp_flags2
git: b1e806c0ed96 - main - tcp: fix alternate stack build with LINT-NO{INET, INET6, IP}
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=b1e806c0ed960e1eb9ee889c7d0df3c168290c4f commit b1e806c0ed960e1eb9ee889c7d0df3c168290c4f Author: Andrew Gallatin AuthorDate: 2021-07-07 17:02:08 + Commit: Andrew Gallatin CommitDate: 2021-07-07 17:02:08 + tcp: fix alternate stack build with LINT-NO{INET,INET6,IP} When fixing another bug, I noticed that the alternate TCP stacks do not build when various combinations of ipv4 and ipv6 are disabled. Reviewed by:rrs, tuexen Differential Revision: https://reviews.freebsd.org/D31094 Sponsored by: Netflix --- sys/netinet/tcp_stacks/bbr.c | 9 --- sys/netinet/tcp_stacks/rack.c| 45 sys/netinet/tcp_stacks/rack_bbr_common.c | 6 - 3 files changed, 45 insertions(+), 15 deletions(-) diff --git a/sys/netinet/tcp_stacks/bbr.c b/sys/netinet/tcp_stacks/bbr.c index 8969e4e47ba1..c96fec07b6c9 100644 --- a/sys/netinet/tcp_stacks/bbr.c +++ b/sys/netinet/tcp_stacks/bbr.c @@ -3515,13 +3515,16 @@ bbr_get_header_oh(struct tcp_bbr *bbr) if (bbr->r_ctl.rc_inc_ip_oh) { /* Do we include IP overhead? */ #ifdef INET6 - if (bbr->r_is_v6) + if (bbr->r_is_v6) { seg_oh += sizeof(struct ip6_hdr); - else + } else #endif + { + #ifdef INET seg_oh += sizeof(struct ip); #endif + } } if (bbr->r_ctl.rc_inc_enet_oh) { /* Do we include the ethernet overhead? */ @@ -11956,7 +11959,7 @@ bbr_output_wtime(struct tcpcb *tp, const struct timeval *tv) uint32_t tot_len = 0; uint32_t rtr_cnt = 0; uint32_t maxseg, pace_max_segs, p_maxseg; - int32_t csum_flags; + int32_t csum_flags = 0; int32_t hw_tls; #if defined(IPSEC) || defined(IPSEC_SUPPORT) unsigned ipsec_optlen = 0; diff --git a/sys/netinet/tcp_stacks/rack.c b/sys/netinet/tcp_stacks/rack.c index 75287282cf3e..f417f8a4ee7f 100644 --- a/sys/netinet/tcp_stacks/rack.c +++ b/sys/netinet/tcp_stacks/rack.c @@ -12043,7 +12043,9 @@ rack_init_fsb_block(struct tcpcb *tp, struct tcp_rack *rack) #ifdef INET struct ip *ip = NULL; #endif +#if defined(INET) || defined(INET6) struct udphdr *udp = NULL; +#endif /* Ok lets fill in the fast block, it can only be used with no IP options! */ #ifdef INET6 @@ -12067,6 +12069,7 @@ rack_init_fsb_block(struct tcpcb *tp, struct tcp_rack *rack) ip6, rack->r_ctl.fsb.th); } else #endif /* INET6 */ +#ifdef INET { rack->r_ctl.fsb.tcp_ip_hdr_len = sizeof(struct tcpiphdr); ip = (struct ip *)rack->r_ctl.fsb.tcp_ip_hdr; @@ -12086,6 +12089,7 @@ rack_init_fsb_block(struct tcpcb *tp, struct tcp_rack *rack) tp->t_port, ip, rack->r_ctl.fsb.th); } +#endif rack->r_fsb_inited = 1; } @@ -15226,7 +15230,7 @@ rack_fast_rsm_output(struct tcpcb *tp, struct tcp_rack *rack, struct rack_sendma struct tcpopt to; u_char opt[TCP_MAXOLEN]; uint32_t hdrlen, optlen; - int32_t slot, segsiz, max_val, tso = 0, error, flags, ulen = 0; + int32_t slot, segsiz, max_val, tso = 0, error = 0, flags, ulen = 0; uint32_t us_cts; uint32_t if_hw_tsomaxsegcount = 0, startseq; uint32_t if_hw_tsomaxsegsize; @@ -15706,7 +15710,7 @@ rack_fast_output(struct tcpcb *tp, struct tcp_rack *rack, uint64_t ts_val, u_char opt[TCP_MAXOLEN]; uint32_t hdrlen, optlen; int cnt_thru = 1; - int32_t slot, segsiz, len, max_val, tso = 0, sb_offset, error, flags, ulen = 0; + int32_t slot, segsiz, len, max_val, tso = 0, sb_offset, error = 0, flags, ulen = 0; uint32_t us_cts, s_soff; uint32_t if_hw_tsomaxsegcount = 0, startseq; uint32_t if_hw_tsomaxsegsize; @@ -16119,9 +16123,9 @@ rack_output(struct tcpcb *tp) long tot_len_this_send = 0; #ifdef INET struct ip *ip = NULL; -#endif #ifdef TCPDEBUG struct ipovly *ipov = NULL; +#endif #endif struct udphdr *udp = NULL; struct tcp_rack *rack; @@ -16130,7 +16134,10 @@ rack_output(struct tcpcb *tp) uint8_t mark = 0; uint8_t wanted_cookie = 0; u_char opt[TCP_MAXOLEN]; - unsigned ipoptlen, optlen, hdrlen, ulen=0; + unsigned ipoptlen, optlen, hdrlen; +#if defined(INET) || defined(INET6) + unsigned ulen=0; +#endif uint32_t rack_seq; #if defined(IPSEC) || defined(IPSEC_SUPPORT) @@ -17830,21 +17837,29 @@ send: #endif if ((ipoptlen == 0) && (rack->r_ctl.fsb.tcp_ip_hdr) && rack->r_fsb_inited) { #ifdef INET6 - if (isipv6) +
git: 0756bdf19c5c - main - ktls: make ktls_disable_ifnet() shim static
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=0756bdf19c5c97fabf4090e844f8df9505fbd566 commit 0756bdf19c5c97fabf4090e844f8df9505fbd566 Author: Andrew Gallatin AuthorDate: 2021-07-07 19:05:49 + Commit: Andrew Gallatin CommitDate: 2021-07-07 19:08:13 + ktls: make ktls_disable_ifnet() shim static A user reported that when compiling without KERN_TLS, and with -O0, the kernel failed to link due to ktls_disable_ifnet() being undefined. Making the shim static works around this issue. Reported by: Gary Jennejohn Sponsored by: Netflix --- sys/sys/ktls.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sys/sys/ktls.h b/sys/sys/ktls.h index 7fd8831878b4..a4156eb10395 100644 --- a/sys/sys/ktls.h +++ b/sys/sys/ktls.h @@ -238,7 +238,7 @@ extern unsigned int ktls_ifnet_max_rexmit_pct; void ktls_disable_ifnet(void *arg); #else #define ktls_ifnet_max_rexmit_pct 1 -inline void +static inline void ktls_disable_ifnet(void *arg __unused) { } ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: 98215005b747 - main - ktls: start a thread to keep the 16k ktls buffer zone populated
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=98215005b747fef67f44794ca64abd473b98bade commit 98215005b747fef67f44794ca64abd473b98bade Author: Andrew Gallatin AuthorDate: 2021-08-05 14:15:09 + Commit: Andrew Gallatin CommitDate: 2021-08-05 14:19:12 + ktls: start a thread to keep the 16k ktls buffer zone populated Ktls recently received an optimization where we allocate 16k physically contiguous crypto destination buffers. This provides a large (more than 5%) reduction in CPU use in our workload. However, after several days of uptime, the performance benefit disappears because we have frequent allocation failures from the ktls buffer zone. It turns out that when load drops off, the ktls buffer zone is trimmed, and some 16k buffers are freed back to the OS. When load picks back up again, re-allocating those 16k buffers fails after some number of days of uptime because physical memory has become fragmented. This causes allocations to fail, because they are intentionally done without M_NORECLAIM, so as to avoid pausing the ktls crytpo work thread while the VM system defragments memory. To work around this, this change starts one thread per VM domain to allocate ktls buffers with M_NORECLAIM, as we don't care if this thread is paused while memory is defragged. The thread then frees the buffers back into the ktls buffer zone, thus allowing future allocations to succeed. Note that waking up the thread is intentionally racy, but neither of the races really matter. In the worst case, we could have either spurious wakeups or we could have to wait 1 second until the next rate-limited allocation failure to wake up the thread. This patch has been in use at Netflix on a handful of servers, and seems to fix the issue. Differential Revision: https://reviews.freebsd.org/D31260 Reviewed by: jhb, markj, (jtl, rrs, and dhw reviewed earlier version) Sponsored by: Netflix --- sys/kern/uipc_ktls.c | 121 ++- 1 file changed, 120 insertions(+), 1 deletion(-) diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c index 5f7dde325740..17b87195fc50 100644 --- a/sys/kern/uipc_ktls.c +++ b/sys/kern/uipc_ktls.c @@ -78,6 +78,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include struct ktls_wq { struct mtx mtx; @@ -87,9 +88,17 @@ struct ktls_wq { int lastallocfail; } __aligned(CACHE_LINE_SIZE); +struct ktls_alloc_thread { + uint64_t wakeups; + uint64_t allocs; + struct thread *td; + int running; +}; + struct ktls_domain_info { int count; int cpu[MAXCPU]; + struct ktls_alloc_thread alloc_td; }; struct ktls_domain_info ktls_domains[MAXMEMDOM]; @@ -142,6 +151,11 @@ SYSCTL_BOOL(_kern_ipc_tls, OID_AUTO, sw_buffer_cache, CTLFLAG_RDTUN, &ktls_sw_buffer_cache, 1, "Enable caching of output buffers for SW encryption"); +static int ktls_max_alloc = 128; +SYSCTL_INT(_kern_ipc_tls, OID_AUTO, max_alloc, CTLFLAG_RWTUN, +&ktls_max_alloc, 128, +"Max number of 16k buffers to allocate in thread context"); + static COUNTER_U64_DEFINE_EARLY(ktls_tasks_active); SYSCTL_COUNTER_U64(_kern_ipc_tls, OID_AUTO, tasks_active, CTLFLAG_RD, &ktls_tasks_active, "Number of active tasks"); @@ -278,6 +292,7 @@ static void ktls_cleanup(struct ktls_session *tls); static void ktls_reset_send_tag(void *context, int pending); #endif static void ktls_work_thread(void *ctx); +static void ktls_alloc_thread(void *ctx); #if defined(INET) || defined(INET6) static u_int @@ -418,6 +433,32 @@ ktls_init(void *dummy __unused) ktls_number_threads++; } + /* +* Start an allocation thread per-domain to perform blocking allocations +* of 16k physically contiguous TLS crypto destination buffers. +*/ + if (ktls_sw_buffer_cache) { + for (domain = 0; domain < vm_ndomains; domain++) { + if (VM_DOMAIN_EMPTY(domain)) + continue; + if (CPU_EMPTY(&cpuset_domain[domain])) + continue; + error = kproc_kthread_add(ktls_alloc_thread, + &ktls_domains[domain], &ktls_proc, + &ktls_domains[domain].alloc_td.td, + 0, 0, "KTLS", "alloc_%d", domain); + if (error) + panic("Can't add KTLS alloc thread %d error %d", + domain, error); + CPU_COPY(&cpuset_domain[domain], &mask); + error = cpuset_se
Re: git: 98215005b747 - main - ktls: start a thread to keep the 16k ktls buffer zone populated
On 8/5/21 11:59 AM, Ed Maste wrote: On Thu, 5 Aug 2021 at 10:22, Andrew Gallatin wrote: The branch main has been updated by gallatin: URL: https://urldefense.com/v3/__https://cgit.FreeBSD.org/src/commit/?id=98215005b747fef67f44794ca64abd473b98bade__;!!OToaGQ!6H22s_lcYmkhuynvYHpkyGHe143j9dOq8CYazaDqtTi9kyapeu9DMyf0Tvo0tDDCVw$ commit 98215005b747fef67f44794ca64abd473b98bade Author: Andrew Gallatin AuthorDate: 2021-08-05 14:15:09 + Commit: Andrew Gallatin CommitDate: 2021-08-05 14:19:12 + ktls: start a thread to keep the 16k ktls buffer zone populated My Cirrus-CI boot smoke test is now failing with: Starting KTLS alloc thread for domain 0 panic: sleeping without a lock cpuid = 0 time = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfeb20ae0 vpanic() at vpanic+0x187/frame 0xfeb20b40 panic() at panic+0x43/frame 0xfeb20ba0 _sleep() at _sleep+0x484/frame 0xfeb20c40 ktls_alloc_thread() at ktls_alloc_thread+0x1c4/frame 0xfeb20cf0 fork_exit() at fork_exit+0x80/frame 0xfeb20d30 fork_trampoline() at fork_trampoline+0xe/frame 0xfeb20d30 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic [ thread pid 2 tid 100027 ] Stopped at kdb_enter+0x37: movq $0,0x127877e(%rip) db> qemu-system-x86_64: terminating on signal 15 from pid 32579 (timeout) Did not boot successfully, see /tmp/ci-qemu-test-boot.log I'd thought that I'd tested this with INVARIANTS, but I guess I was wrong. The assert is failing because I'm sleeping forever (sbt == 0). I don't understand the point of the assert, but I've reproduced the panic and am testing a workaround. Drew ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: 2694c869ff9f - main - ktls: fix a panic with INVARIANTS
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=2694c869ff9ff60fd8e3d4d7936b7dc61763c18a commit 2694c869ff9ff60fd8e3d4d7936b7dc61763c18a Author: Andrew Gallatin AuthorDate: 2021-08-05 17:05:00 + Commit: Andrew Gallatin CommitDate: 2021-08-05 17:09:06 + ktls: fix a panic with INVARIANTS 98215005b747fef67f44794ca64abd473b98bade introduced a new thread that uses tsleep(..0) to sleep forever. This hit an assert due to sleeping with a 0 timeout. So spell "forever" using SBT_MAX instead, which does not trigger the assert. Pointy hat to: gallatin Pointed out by: emaste Sponsored by: Netflix --- sys/kern/uipc_ktls.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c index 17b87195fc50..47815c27 100644 --- a/sys/kern/uipc_ktls.c +++ b/sys/kern/uipc_ktls.c @@ -2240,7 +2240,7 @@ ktls_alloc_thread(void *ctx) nbufs = 0; for (;;) { atomic_store_int(&sc->running, 0); - tsleep(sc, PZERO, "waiting for work", 0); + tsleep_sbt(sc, PZERO, "waiting for work", SBT_MAX, SBT_1S, 0); atomic_store_int(&sc->running, 1); sc->wakeups++; if (nbufs != ktls_max_alloc) { ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
Re: git: 2694c869ff9f - main - ktls: fix a panic with INVARIANTS
On 8/5/21 1:41 PM, Ian Lepore wrote: Â if (nbufs != ktls_max_alloc) { Finding a different way to spell "forever" is not a valid way to fix a problem where you're being warned that it is not safe to sleep forever. The assert was warning you that the code was vulnerable to hanging forever due to a missed wakeup. The code is still vulnerable to that, but now the problem is hidden and will be very difficult to find (more so because the wait message also violates the convention of using a short name that can be displayed by tools such as ps(1) and SIGINFO, where the wait-channel display is currently likely to show as "waitin"). I haven't looked at the code outside of the few lines shown in the commit diff, but based on the names involved, I suspect the right fix is to protect sc->running with a mutex and use msleep() instead of trying to roll-your-own with atomics. That would certainly be true if the wakeup code is some form of "if (!sc->running) wakeup(sc);" -- Ian The code is a case where a missing wakeup does not matter. The thread is woken up by an allocation failure, which are themselves rate-limited to one attempt per second (since failures are expensive, and there is a less expensive fallback). So the worst thing that can happen is that we wait at most an extra second. Adding a mutex adds nothing except unneeded complexity. Drew ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: 1b97a054f3ac - main - tsleep: Add a PNOLOCK flag
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=1b97a054f3acaf13a5c8361b7b80e10ad16257b9 commit 1b97a054f3acaf13a5c8361b7b80e10ad16257b9 Author: Andrew Gallatin AuthorDate: 2021-08-05 21:16:30 + Commit: Andrew Gallatin CommitDate: 2021-08-05 21:16:30 + tsleep: Add a PNOLOCK flag Add a PNOLOCK flag so that, in the race circumstance where wakeup races are externally mitigated, tsleep() can be called with a sleep time of 0 without triggering an an assertion. Reviewed by: jhb Sponsored by: Netflix --- sys/kern/kern_synch.c | 3 ++- sys/sys/param.h | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/sys/kern/kern_synch.c b/sys/kern/kern_synch.c index 793c5309a30b..7bf5193fb7b1 100644 --- a/sys/kern/kern_synch.c +++ b/sys/kern/kern_synch.c @@ -148,7 +148,8 @@ _sleep(const void *ident, struct lock_object *lock, int priority, #endif WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, lock, "Sleeping on \"%s\"", wmesg); - KASSERT(sbt != 0 || mtx_owned(&Giant) || lock != NULL, + KASSERT(sbt != 0 || mtx_owned(&Giant) || lock != NULL || + (priority & PNOLOCK) != 0, ("sleeping without a lock")); KASSERT(ident != NULL, ("_sleep: NULL ident")); KASSERT(TD_IS_RUNNING(td), ("_sleep: curthread not running")); diff --git a/sys/sys/param.h b/sys/sys/param.h index f842b344e9f9..8864063e3d9b 100644 --- a/sys/sys/param.h +++ b/sys/sys/param.h @@ -246,7 +246,8 @@ #definePRIMASK 0x0ff #definePCATCH 0x100 /* OR'd with pri for tsleep to check signals */ #definePDROP 0x200 /* OR'd with pri to stop re-entry of interlock mutex */ -#definePRILASTFLAG 0x200 /* Last flag defined above */ +#definePNOLOCK 0x400 /* OR'd with pri to allow sleeping w/o a lock */ +#definePRILASTFLAG 0x400 /* Last flag defined above */ #defineNZERO 0 /* default "nice" */ ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: 09066b98663d - main - ktls: Use the new PNOLOCK flag
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=09066b98663d92f4d129bab25105805adf0abaf7 commit 09066b98663d92f4d129bab25105805adf0abaf7 Author: Andrew Gallatin AuthorDate: 2021-08-05 21:19:12 + Commit: Andrew Gallatin CommitDate: 2021-08-05 21:19:12 + ktls: Use the new PNOLOCK flag Use the new PNOLOCK flag to tsleep() to indicate that we are managing potential races, and don't need to sleep with a lock, or have a backstop timeout. Reviewed by: jhb Sponsored by: Netflix --- sys/kern/uipc_ktls.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c index 47815c27..1cc1f2e8b8c4 100644 --- a/sys/kern/uipc_ktls.c +++ b/sys/kern/uipc_ktls.c @@ -2240,7 +2240,7 @@ ktls_alloc_thread(void *ctx) nbufs = 0; for (;;) { atomic_store_int(&sc->running, 0); - tsleep_sbt(sc, PZERO, "waiting for work", SBT_MAX, SBT_1S, 0); + tsleep(sc, PZERO | PNOLOCK, "-", 0); atomic_store_int(&sc->running, 1); sc->wakeups++; if (nbufs != ktls_max_alloc) { ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: 739de953ecc1 - main - ktls: Move KERN_TLS ifdef to tcp_var.h
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=739de953ecc13afa930e2f55b7ee2a04e41e3519 commit 739de953ecc13afa930e2f55b7ee2a04e41e3519 Author: Andrew Gallatin AuthorDate: 2021-08-05 23:17:35 + Commit: Andrew Gallatin CommitDate: 2021-08-05 23:17:35 + ktls: Move KERN_TLS ifdef to tcp_var.h This allows us to remove stubs in ktls.h and allows us to sort the function prototypes. Reviewed by: jhb Sponsored by: Netflix --- sys/netinet/tcp_var.h | 6 +++--- sys/sys/ktls.h| 14 +++--- 2 files changed, 6 insertions(+), 14 deletions(-) diff --git a/sys/netinet/tcp_var.h b/sys/netinet/tcp_var.h index 8cfd2c5417c2..64e954cf4ad5 100644 --- a/sys/netinet/tcp_var.h +++ b/sys/netinet/tcp_var.h @@ -1144,8 +1144,6 @@ static inline void tcp_account_for_send(struct tcpcb *tp, uint32_t len, uint8_t is_rxt, uint8_t is_tlp, int hw_tls) { - uint64_t rexmit_percent; - if (is_tlp) { tp->t_sndtlppack++; tp->t_sndtlpbyte += len; @@ -1156,11 +1154,13 @@ tcp_account_for_send(struct tcpcb *tp, uint32_t len, uint8_t is_rxt, else tp->t_sndbytes += len; +#ifdef KERN_TLS if (hw_tls && is_rxt) { - rexmit_percent = (1000ULL * tp->t_snd_rxt_bytes) / (10ULL * (tp->t_snd_rxt_bytes + tp->t_sndbytes)); + uint64_t rexmit_percent = (1000ULL * tp->t_snd_rxt_bytes) / (10ULL * (tp->t_snd_rxt_bytes + tp->t_sndbytes)); if (rexmit_percent > ktls_ifnet_max_rexmit_pct) ktls_disable_ifnet(tp); } +#endif } #endif /* _KERNEL */ diff --git a/sys/sys/ktls.h b/sys/sys/ktls.h index a4156eb10395..437e36f965ea 100644 --- a/sys/sys/ktls.h +++ b/sys/sys/ktls.h @@ -197,7 +197,10 @@ struct ktls_session { bool disable_ifnet_pending; } __aligned(CACHE_LINE_SIZE); +extern unsigned int ktls_ifnet_max_rexmit_pct; + void ktls_check_rx(struct sockbuf *sb); +void ktls_disable_ifnet(void *arg); int ktls_enable_rx(struct socket *so, struct tls_enable *en); int ktls_enable_tx(struct socket *so, struct tls_enable *en); void ktls_destroy(struct ktls_session *tls); @@ -233,16 +236,5 @@ ktls_free(struct ktls_session *tls) ktls_destroy(tls); } -#ifdef KERN_TLS -extern unsigned int ktls_ifnet_max_rexmit_pct; -void ktls_disable_ifnet(void *arg); -#else -#define ktls_ifnet_max_rexmit_pct 1 -static inline void -ktls_disable_ifnet(void *arg __unused) -{ -} -#endif - #endif /* !_KERNEL */ #endif /* !_SYS_KTLS_H_ */ ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: 95c51fafa40d - main - ktls: Init reset tag task for cloned sessions
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=95c51fafa40d56d0a32aff857261097acc65ec92 commit 95c51fafa40d56d0a32aff857261097acc65ec92 Author: Andrew Gallatin AuthorDate: 2021-08-11 18:06:43 + Commit: Andrew Gallatin CommitDate: 2021-08-11 18:06:43 + ktls: Init reset tag task for cloned sessions When cloning a ktls session (which is needed when we need to switch output NICs for a NIC TLS session), we need to also init the reset task, like we do when creating a new tls session. Reviewed by: jhb Sponsored by: Netflix --- sys/kern/uipc_ktls.c | 1 + 1 file changed, 1 insertion(+) diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c index 1cc1f2e8b8c4..79da902095b3 100644 --- a/sys/kern/uipc_ktls.c +++ b/sys/kern/uipc_ktls.c @@ -709,6 +709,7 @@ ktls_clone_session(struct ktls_session *tls) counter_u64_add(ktls_offload_active, 1); refcount_init(&tls_new->refcount, 1); + TASK_INIT(&tls_new->reset_tag_task, 0, ktls_reset_send_tag, tls_new); /* Copy fields from existing session. */ tls_new->params = tls->params; ___ dev-commits-src-main@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/dev-commits-src-main To unsubscribe, send any mail to "dev-commits-src-main-unsubscr...@freebsd.org"
git: 43c72c45a185 - main - lacp: Remove racy kassert
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=43c72c45a1856c6cdf25a22d259528d5a4040973 commit 43c72c45a1856c6cdf25a22d259528d5a4040973 Author: Andrew Gallatin AuthorDate: 2022-06-13 15:32:10 + Commit: Andrew Gallatin CommitDate: 2022-06-13 15:32:10 + lacp: Remove racy kassert In lacp_select_tx_port_by_hash(), we assert that the selected port is DISTRIBUTING. However, the port state is protected by the LACP_LOCK(), which is not held around lacp_select_tx_port_by_hash(). So this assertion is racy, and can result in a spurious panic when links are flapping. It is certainly possible to fix it by acquiring LACP_LOCK(), but this seems like an early development assert, and it seems best to just remove it, rather than add complexity inside an ifdef INVARIANTS. Sponsored by: Netflix Reviewed by: hselasky Differential Revision: https://reviews.freebsd.org/D35396 --- sys/net/ieee8023ad_lacp.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/sys/net/ieee8023ad_lacp.c b/sys/net/ieee8023ad_lacp.c index 6656ebb2b400..65b3a337eedc 100644 --- a/sys/net/ieee8023ad_lacp.c +++ b/sys/net/ieee8023ad_lacp.c @@ -876,9 +876,6 @@ lacp_select_tx_port_by_hash(struct lagg_softc *sc, uint32_t hash, hash %= count; lp = map[hash]; - KASSERT((lp->lp_state & LACP_STATE_DISTRIBUTING) != 0, - ("aggregated port is not distributing")); - return (lp->lp_lagg); }
git: 0aa150775179 - main - pmcstat: fix log analysis
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=0aa150775179a4f683fade5f1d6325a47b5f695f commit 0aa150775179a4f683fade5f1d6325a47b5f695f Author: Andrew Gallatin AuthorDate: 2022-07-04 16:40:35 + Commit: Andrew Gallatin CommitDate: 2022-07-04 16:42:39 + pmcstat: fix log analysis pmcstat has been broken for analyzing logs since D35342 / b6e28991bf3aadb. This is because the pmc for the first CPU is not added when reading logs because unlike its clones, its event id is not invalid. That causes us to fail the assertion at lib/libpmcstat/libpmcstat_logging.c:293 when encountering samples from cpu0. Fix this by removing the check that the PMC is invalid Reviewed by: tsoome Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D35709 --- usr.sbin/pmcstat/pmcstat.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/usr.sbin/pmcstat/pmcstat.c b/usr.sbin/pmcstat/pmcstat.c index 08e43d5d446a..f366e2175a25 100644 --- a/usr.sbin/pmcstat/pmcstat.c +++ b/usr.sbin/pmcstat/pmcstat.c @@ -1187,8 +1187,7 @@ main(int argc, char **argv) */ STAILQ_FOREACH(ev, &args.pa_events, ev_next) { - if (ev->ev_pmcid == PMC_ID_INVALID && - pmc_allocate(ev->ev_spec, ev->ev_mode, + if (pmc_allocate(ev->ev_spec, ev->ev_mode, ev->ev_flags, ev->ev_cpu, &ev->ev_pmcid, ev->ev_count) < 0) err(EX_OSERR,
git: 713ceb99b685 - main - lagg: fix lagg ifioctl after SIOCSIFCAPNV
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=713ceb99b68568232bf9895bbe1811797bfde63c commit 713ceb99b68568232bf9895bbe1811797bfde63c Author: Andrew Gallatin AuthorDate: 2022-07-28 14:36:22 + Commit: Andrew Gallatin CommitDate: 2022-07-28 14:39:00 + lagg: fix lagg ifioctl after SIOCSIFCAPNV Lagg was broken by SIOCSIFCAPNV when all underlying devices support SIOCSIFCAPNV. This change updates lagg to work with SIOCSIFCAPNV and if_capabilities2. Reviewed by: kib, hselasky Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D35865 --- sys/net/if_lagg.c | 62 +++ sys/net/if_lagg.h | 1 + 2 files changed, 45 insertions(+), 18 deletions(-) diff --git a/sys/net/if_lagg.c b/sys/net/if_lagg.c index 3894b6d55cea..8e273c4ed391 100644 --- a/sys/net/if_lagg.c +++ b/sys/net/if_lagg.c @@ -157,7 +157,7 @@ static void lagg_ratelimit_query(struct ifnet *, #endif static int lagg_setmulti(struct lagg_port *); static int lagg_clrmulti(struct lagg_port *); -static int lagg_setcaps(struct lagg_port *, int cap); +static voidlagg_setcaps(struct lagg_port *, int cap, int cap2); static int lagg_setflag(struct lagg_port *, int, int, int (*func)(struct ifnet *, int)); static int lagg_setflags(struct lagg_port *, int status); @@ -664,17 +664,20 @@ static void lagg_capabilities(struct lagg_softc *sc) { struct lagg_port *lp; - int cap, ena, pena; + int cap, cap2, ena, ena2, pena, pena2; uint64_t hwa; struct ifnet_hw_tsomax hw_tsomax; LAGG_XLOCK_ASSERT(sc); /* Get common enabled capabilities for the lagg ports */ - ena = ~0; - CK_SLIST_FOREACH(lp, &sc->sc_ports, lp_entries) + ena = ena2 = ~0; + CK_SLIST_FOREACH(lp, &sc->sc_ports, lp_entries) { ena &= lp->lp_ifp->if_capenable; - ena = (ena == ~0 ? 0 : ena); + ena2 &= lp->lp_ifp->if_capenable2; + } + if (CK_SLIST_FIRST(&sc->sc_ports) == NULL) + ena = ena2 = 0; /* * Apply common enabled capabilities back to the lagg ports. @@ -682,30 +685,36 @@ lagg_capabilities(struct lagg_softc *sc) */ do { pena = ena; + pena2 = ena2; CK_SLIST_FOREACH(lp, &sc->sc_ports, lp_entries) { - lagg_setcaps(lp, ena); + lagg_setcaps(lp, ena, ena2); ena &= lp->lp_ifp->if_capenable; + ena2 &= lp->lp_ifp->if_capenable2; } - } while (pena != ena); + } while (pena != ena || pena2 != ena2); /* Get other capabilities from the lagg ports */ - cap = ~0; + cap = cap2 = ~0; hwa = ~(uint64_t)0; memset(&hw_tsomax, 0, sizeof(hw_tsomax)); CK_SLIST_FOREACH(lp, &sc->sc_ports, lp_entries) { cap &= lp->lp_ifp->if_capabilities; + cap2 &= lp->lp_ifp->if_capabilities2; hwa &= lp->lp_ifp->if_hwassist; if_hw_tsomax_common(lp->lp_ifp, &hw_tsomax); } - cap = (cap == ~0 ? 0 : cap); - hwa = (hwa == ~(uint64_t)0 ? 0 : hwa); + if (CK_SLIST_FIRST(&sc->sc_ports) == NULL) + cap = cap2 = hwa = 0; if (sc->sc_ifp->if_capabilities != cap || sc->sc_ifp->if_capenable != ena || + sc->sc_ifp->if_capenable2 != ena2 || sc->sc_ifp->if_hwassist != hwa || if_hw_tsomax_update(sc->sc_ifp, &hw_tsomax) != 0) { sc->sc_ifp->if_capabilities = cap; + sc->sc_ifp->if_capabilities2 = cap2; sc->sc_ifp->if_capenable = ena; + sc->sc_ifp->if_capenable2 = ena2; sc->sc_ifp->if_hwassist = hwa; getmicrotime(&sc->sc_ifp->if_lastchange); @@ -982,7 +991,7 @@ lagg_port_destroy(struct lagg_port *lp, int rundelport) if (lp->lp_detaching == 0) { lagg_setflags(lp, 0); - lagg_setcaps(lp, lp->lp_ifcapenable); + lagg_setcaps(lp, lp->lp_ifcapenable, lp->lp_ifcapenable2); if_setlladdr(ifp, lp->lp_lladdr, ifp->if_addrlen); } @@ -1038,6 +1047,7 @@ lagg_port_ioctl(struct ifnet *ifp, u_long cmd, caddr_t data) break; case SIOCSIFCAP: + case SIOCSIFCAPNV: if (lp->lp_ioctl == NULL) { error = EINVAL; break; @@ -1690,6 +1700,7 @@ lagg_ioctl(struct ifnet *ifp, u_long cmd, caddr_t data) break; case SIOCSIFCAP: + case SIOCSI
git: b2921fdc2330 - main - arm64: Implement bus_get_resource and bus_delete_resource.
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=b2921fdc2330a5750f557fa321b94f972d5a7702 commit b2921fdc2330a5750f557fa321b94f972d5a7702 Author: Andrew Gallatin AuthorDate: 2023-11-11 17:54:19 + Commit: Andrew Gallatin CommitDate: 2023-11-11 17:57:39 + arm64: Implement bus_get_resource and bus_delete_resource. These devmethods were not defined, leading to the surprising result of using bus_set_resource(), and then immediately turning around and getting zeros back from bus_get_resource(). These are now simply passed through to the generic definitions, since there is no need for them to be arm64 specific. Note that jhb plans to replace most of the devmethods with the generic versions. Suggested by: jhb Sponsored by: Netflix --- sys/arm64/arm64/nexus.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/sys/arm64/arm64/nexus.c b/sys/arm64/arm64/nexus.c index b9871f0e9b3a..6ba73cd456ef 100644 --- a/sys/arm64/arm64/nexus.c +++ b/sys/arm64/arm64/nexus.c @@ -136,6 +136,8 @@ static device_method_t nexus_methods[] = { DEVMETHOD(bus_adjust_resource, nexus_adjust_resource), DEVMETHOD(bus_alloc_resource, nexus_alloc_resource), DEVMETHOD(bus_deactivate_resource, nexus_deactivate_resource), + DEVMETHOD(bus_delete_resource, bus_generic_rl_delete_resource), + DEVMETHOD(bus_get_resource, bus_generic_rl_get_resource), DEVMETHOD(bus_get_resource_list, nexus_get_reslist), DEVMETHOD(bus_map_resource, nexus_map_resource), DEVMETHOD(bus_release_resource, nexus_release_resource),
git: ab063ac4444e - main - ipmi_ssif: Fix typo in debug print
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=ab063ace426759cb5d053e50e02fa078a3c6 commit ab063ace426759cb5d053e50e02fa078a3c6 Author: Andrew Gallatin AuthorDate: 2023-11-14 00:44:27 + Commit: Andrew Gallatin CommitDate: 2023-11-14 00:46:56 + ipmi_ssif: Fix typo in debug print Fix a typo in a debug print that prevents compilation. Sponsored by: Netflix --- sys/dev/ipmi/ipmi_ssif.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sys/dev/ipmi/ipmi_ssif.c b/sys/dev/ipmi/ipmi_ssif.c index 3ac1e04c2eda..532d1f7f485c 100644 --- a/sys/dev/ipmi/ipmi_ssif.c +++ b/sys/dev/ipmi/ipmi_ssif.c @@ -200,7 +200,7 @@ read_start: goto fail; } #ifdef SSIF_DEBUG - device_printf("SSIF: READ_START: ok\n"); + device_printf(dev, "SSIF: READ_START: ok\n"); #endif /*
git: ba0e4d7971e0 - main - smbios: handle smbios3 for arm64
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=ba0e4d7971e05ee64281a4fc49a2fb408c8ad816 commit ba0e4d7971e05ee64281a4fc49a2fb408c8ad816 Author: Andrew Gallatin AuthorDate: 2023-11-15 16:11:53 + Commit: Andrew Gallatin CommitDate: 2023-11-15 16:20:04 + smbios: handle smbios3 for arm64 Get smbios working on arm64 where it seems to be exclusively smbios version 3.x The "interesting" thing here is that the smbios table seems to be RAM in the EFI runtime services table. This makes it owned by "ram0", and not io memory. That prevents bus_alloc_resource() from being able to claim it, since ram0 already owns it. According to jhb, this is how things are supposed to work. Eg, bus_alloc_resource() is meant to be used with IO memory, not physical memory. Following his suggestion, I converted the driver to simply use pmap_mapbios(). This is a prerequisite for getting IPMI to attach via the SSIF attachment on arm64 servers, where all IPMI that I've seen uses SSIF. Note that this change is based on initial work by Allan Jude in https://reviews.freebsd.org/D28739. Reviewed by: imp Sponsored by: Netflix, Ampere Computing LLC (D28739) Differential Revision: https://reviews.freebsd.org/D42592 --- sys/dev/smbios/smbios.c | 176 sys/dev/smbios/smbios.h | 15 + 2 files changed, 132 insertions(+), 59 deletions(-) diff --git a/sys/dev/smbios/smbios.c b/sys/dev/smbios/smbios.c index b9dd8a40e9e4..7f89430226c8 100644 --- a/sys/dev/smbios/smbios.c +++ b/sys/dev/smbios/smbios.c @@ -57,41 +57,49 @@ struct smbios_softc { device_tdev; - struct resource * res; - int rid; - - struct smbios_eps * eps; + union { + struct smbios_eps * eps; + struct smbios3_eps *eps3; + }; + bool is_eps3; }; -#defineRES2EPS(res)((struct smbios_eps *)rman_get_virtual(res)) - static voidsmbios_identify (driver_t *, device_t); static int smbios_probe(device_t); static int smbios_attach (device_t); static int smbios_detach (device_t); static int smbios_modevent (module_t, int, void *); -static int smbios_cksum(struct smbios_eps *); +static int smbios_cksum(void *); +static boolsmbios_eps3 (void *); static void smbios_identify (driver_t *driver, device_t parent) { #ifdef ARCH_MAY_USE_EFI struct uuid efi_smbios = EFI_TABLE_SMBIOS; + struct uuid efi_smbios3 = EFI_TABLE_SMBIOS3; void *addr_efi; #endif struct smbios_eps *eps; + struct smbios3_eps *eps3; + void *ptr; device_t child; vm_paddr_t addr = 0; + size_t map_size = sizeof (*eps); int length; - int rid; if (!device_is_alive(parent)) return; #ifdef ARCH_MAY_USE_EFI - if (!efi_get_table(&efi_smbios, &addr_efi)) + if (!efi_get_table(&efi_smbios3, &addr_efi)) { addr = (vm_paddr_t)addr_efi; + map_size = sizeof (*eps3); + } else if (!efi_get_table(&efi_smbios, &addr_efi)) { + addr = (vm_paddr_t)addr_efi; + } + #endif #if defined(__amd64__) || defined(__i386__) @@ -101,28 +109,50 @@ smbios_identify (driver_t *driver, device_t parent) #endif if (addr != 0) { - eps = pmap_mapbios(addr, 0x1f); - rid = 0; - length = eps->length; - - if (length != 0x1f) { + ptr = pmap_mapbios(addr, map_size); + if (ptr == NULL) + return; + if (map_size == sizeof (*eps3)) { + eps3 = ptr; + length = eps3->length; + if (memcmp(eps3->anchor_string, + SMBIOS3_SIG, SMBIOS3_LEN) != 0) { + printf("smbios3: corrupt sig %s found\n", + eps3->anchor_string); + return; + } + } else { + eps = ptr; + length = eps->length; + if (memcmp(eps->anchor_string, + SMBIOS_SIG, SMBIOS_LEN) != 0) { + printf("smbios: corrupt sig %s found\n", + eps->anchor_string); + return; + } + } + if (length != map_size) { u_int8_t major, minor; major = eps->major_version; minor = eps->minor_version; /* SMBIOS
git: 6f38d2e7b059 - main - acpi: Add workaround for Altra I2C memory resource
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=6f38d2e7b0599f9b61c04686eb9a7faf3264b8ec commit 6f38d2e7b0599f9b61c04686eb9a7faf3264b8ec Author: Andrew Gallatin AuthorDate: 2023-11-15 21:22:00 + Commit: Andrew Gallatin CommitDate: 2023-11-15 21:25:00 + acpi: Add workaround for Altra I2C memory resource Submitted by: allanjude Sponsored by: Ampere Computing LLC Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D28741 --- sys/dev/acpica/acpi_resource.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/sys/dev/acpica/acpi_resource.c b/sys/dev/acpica/acpi_resource.c index 373cc6da9820..b845fd146f67 100644 --- a/sys/dev/acpica/acpi_resource.c +++ b/sys/dev/acpica/acpi_resource.c @@ -517,6 +517,13 @@ acpi_parse_resources(device_t dev, ACPI_HANDLE handle, acpi_MatchHid(handle, "ARMHD620") != ACPI_MATCHHID_NOMATCH) arc.ignore_producer_flag = true; +/* + * The DesignWare I2C Controller on Ampere Altra sets ResourceProducer on + * memory resources. + */ +if (acpi_MatchHid(handle, "APMC0D0F") != ACPI_MATCHHID_NOMATCH) + arc.ignore_producer_flag = true; + status = AcpiWalkResources(handle, "_CRS", acpi_parse_resource, &arc); if (ACPI_FAILURE(status) && status != AE_NOT_FOUND) { printf("can't fetch resources for %s - %s\n",
git: 5972ffde919a - main - ig4(4): Add an EMAG device type
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=5972ffde919ab65ba29d4d51ccf735da18d52719 commit 5972ffde919ab65ba29d4d51ccf735da18d52719 Author: Andrew Gallatin AuthorDate: 2023-11-16 00:51:28 + Commit: Andrew Gallatin CommitDate: 2023-11-16 00:53:21 + ig4(4): Add an EMAG device type Sponsored by: Ampere Computing LLC, Netflix Submitted by: allanjude Differential Revision: https://reviews.freebsd.org/D28746 Reviewed by: imp --- sys/dev/ichiic/ig4_acpi.c | 12 ++-- sys/dev/ichiic/ig4_iic.c | 3 +++ sys/dev/ichiic/ig4_var.h | 1 + 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/sys/dev/ichiic/ig4_acpi.c b/sys/dev/ichiic/ig4_acpi.c index f88cca6cf13d..3f370ae7abb9 100644 --- a/sys/dev/ichiic/ig4_acpi.c +++ b/sys/dev/ichiic/ig4_acpi.c @@ -83,13 +83,21 @@ static int ig4iic_acpi_attach(device_t dev) { ig4iic_softc_t *sc; + char *str; int error; sc = device_get_softc(dev); sc->dev = dev; - /* All the HIDs matched are Atom SOCs. */ - sc->version = IG4_ATOM; + error = ACPI_ID_PROBE(device_get_parent(dev), dev, ig4iic_ids, &str); + if (error > 0) + return (error); + if (strcmp(str, "APMC0D0F") == 0) { + sc->version = IG4_EMAG; + } else { + /* All the other HIDs matched are Atom SOCs. */ + sc->version = IG4_ATOM; + } sc->regs_rid = 0; sc->regs_res = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &sc->regs_rid, RF_ACTIVE); diff --git a/sys/dev/ichiic/ig4_iic.c b/sys/dev/ichiic/ig4_iic.c index 195bca62928a..3dc72c458b24 100644 --- a/sys/dev/ichiic/ig4_iic.c +++ b/sys/dev/ichiic/ig4_iic.c @@ -89,6 +89,9 @@ * Ig4 hardware parameters except Haswell are taken from intel_lpss driver */ static const struct ig4_hw ig4iic_hw[] = { + [IG4_EMAG] = { + .ic_clock_rate = 100, /* MHz */ + }, [IG4_HASWELL] = { .ic_clock_rate = 100, /* MHz */ .sda_hold_time = 90,/* nsec */ diff --git a/sys/dev/ichiic/ig4_var.h b/sys/dev/ichiic/ig4_var.h index 964a610e7408..989cf23779a2 100644 --- a/sys/dev/ichiic/ig4_var.h +++ b/sys/dev/ichiic/ig4_var.h @@ -42,6 +42,7 @@ #include "iicbus_if.h" enum ig4_vers { + IG4_EMAG, IG4_HASWELL, IG4_ATOM, IG4_SKYLAKE,
git: 5cd08d9ecf52 - main - apei: Mark ReadAckRegister resource as shareable
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=5cd08d9ecf52d37229f4888e38631cb91ce97eb9 commit 5cd08d9ecf52d37229f4888e38631cb91ce97eb9 Author: Andrew Gallatin AuthorDate: 2024-01-09 20:52:07 + Commit: Andrew Gallatin CommitDate: 2024-01-09 21:07:34 + apei: Mark ReadAckRegister resource as shareable Work around vendors who use the same address for multiple ReadAckRegisters in their ACPI HEST table. This allows apei to attach cleanly on Ampere Altra servers. Note the issue is not specific to Ampere, I've run into it with at least one other vendor (whose server is not yet released). Sponsored by: Netflix Reviewed by: jhb --- sys/dev/acpica/acpi_apei.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sys/dev/acpica/acpi_apei.c b/sys/dev/acpica/acpi_apei.c index 6a3d9d10edd4..9cfd46c97430 100644 --- a/sys/dev/acpica/acpi_apei.c +++ b/sys/dev/acpica/acpi_apei.c @@ -711,7 +711,7 @@ apei_attach(device_t dev) if (ge->v1.Header.Type == ACPI_HEST_TYPE_GENERIC_ERROR_V2) { ge->res2_rid = rid++; acpi_bus_alloc_gas(dev, &ge->res2_type, &ge->res2_rid, - &ge->v2.ReadAckRegister, &ge->res2, 0); + &ge->v2.ReadAckRegister, &ge->res2, RF_SHAREABLE); if (ge->res2 == NULL) device_printf(dev, "Can't allocate ack resource.\n"); }
git: be91b4797e2c - main - acpi_ged: Handle events directly
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=be91b4797e2c8f3440f6fe3aac7e246886f9ebca commit be91b4797e2c8f3440f6fe3aac7e246886f9ebca Author: Andrew Gallatin AuthorDate: 2023-10-12 15:15:06 + Commit: Andrew Gallatin CommitDate: 2023-10-12 15:27:44 + acpi_ged: Handle events directly Handle ged interrupts directly from the interrupt handler, while the interrupt source is masked, so as to conform with the acpi spec, and avoid spurious interrupts and lockups on boot. When an acpi ged interrupt is encountered, the spec requires the os (as stated in 5.6.4: General Purpose Event Handling) to leave the interrupt source masked until it runs the EOI handler. This is not a good fit for our method of queuing the work (including the EOI ack of the interrupt), via the AcpiOsExecute() taskqueue mechanism. Note this fixes a bug where an arm64 server could lock up if it encountered a ged interrupt at boot. The lockup was due to running on a single core (due to arm64 not using EARLY_AP_STARTUP), and due to that core encountering a new interrupt each time the interrupt handler unmasked the interrupt source, and having the EOI queued on a taskqueue which never got a chance to run. This is also possible on any platform when using just a single processor. The symptom of this is a lockup at boot, with: "AcpiOsExecute: failed to enqueue task, consider increasing the debug.acpi.max_tasks tunable" scrolling on console. Similarly, spurious interrupts would occur when running with multiple cores, because it was likely that the interrupt would fire again immediately, before the ged task could be run, and before an EOI could be sent to lower the interrupt line. I would typically see 3-5 copies of every ged event due to this issue. This adds a tunable, debug.acpi.ged_defer, which can be set to 1 to restore the old behavior. This was done because acpi is a complex system, and it may be theoretically possible something the ged handler does may sleep (though I cannot easily find anthing by inspection). MFC after: 1 month Reviewed by: andrew, jhb, imp Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D42158 --- sys/dev/acpica/acpi_ged.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/sys/dev/acpica/acpi_ged.c b/sys/dev/acpica/acpi_ged.c index e7dcc1ffb0ac..23e125f277c5 100644 --- a/sys/dev/acpica/acpi_ged.c +++ b/sys/dev/acpica/acpi_ged.c @@ -81,6 +81,11 @@ static driver_t acpi_ged_driver = { DRIVER_MODULE(acpi_ged, acpi, acpi_ged_driver, 0, 0); MODULE_DEPEND(acpi_ged, acpi, 1, 1, 1); +static int acpi_ged_defer; +SYSCTL_INT(_debug_acpi, OID_AUTO, ged_defer, CTLFLAG_RWTUN, +&acpi_ged_defer, 0, +"Handle ACPI GED via a task, rather than in the ISR"); + static void acpi_ged_evt(void *arg) { @@ -92,7 +97,12 @@ acpi_ged_evt(void *arg) static void acpi_ged_intr(void *arg) { - AcpiOsExecute(OSL_GPE_HANDLER, acpi_ged_evt, arg); + struct acpi_ged_event *evt = arg; + + if (acpi_ged_defer) + AcpiOsExecute(OSL_GPE_HANDLER, acpi_ged_evt, arg); + else + AcpiEvaluateObject(evt->ah, NULL, &evt->args, NULL); } static int acpi_ged_probe(device_t dev)
git: fd67ff5c7a6c - main - Use the correct idle routine on recent AMD EPYC servers
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=fd67ff5c7a6cd9a2e82e6a02ea249cec76a4c030 commit fd67ff5c7a6cd9a2e82e6a02ea249cec76a4c030 Author: Andrew Gallatin AuthorDate: 2024-11-08 21:37:32 + Commit: Andrew Gallatin CommitDate: 2024-11-08 22:10:44 + Use the correct idle routine on recent AMD EPYC servers We have been incorrectly choosing the "hlt" idle method on modern AMD EPYC servers for C1 idle. This is because AMD also uses the Functional Fixed Hardware interface. Due to not parsing the table properly for AMD, and due to a weird quirk where the mwait latency for C1 is mis-interpreted as the latency for hlt, we wind up choosing hlt for c1, which has a far higher wake up latency (similar to IO) of roughly 400us on my test system (AMD 7502P). This patch fixes this by: - Looking for AMD in addition to Intel in the FFH (Note the vendor id of "2" for AMD is not publically documented, but AMD has confirmed they are using "2" and has promised to document it.) - Using mwait on AMD when specified in the table, and when CPUid says its supported - Fixing a weird issue where we copy the contents of cx_ptr for C1 and when moving to C2, we do not reinitialize cx_ptr. This leads to mwait being selected, and ignoring the specified i/o halt method unless we clear mwait before looking at the table for C2. Differential Revision: https://reviews.freebsd.org/D47444 Reviewed by: dab, kib, vangyzen Sponsored by: Netflix --- sys/dev/acpica/acpi_cpu.c | 9 +++-- sys/x86/include/x86_var.h | 1 + sys/x86/x86/identcpu.c| 2 ++ 3 files changed, 10 insertions(+), 2 deletions(-) diff --git a/sys/dev/acpica/acpi_cpu.c b/sys/dev/acpica/acpi_cpu.c index 80855cf168e9..63e17de1ff28 100644 --- a/sys/dev/acpica/acpi_cpu.c +++ b/sys/dev/acpica/acpi_cpu.c @@ -131,6 +131,7 @@ struct acpi_cpu_device { #define PIIX4_PCNTRL_BST_EN(1<<10) #defineCST_FFH_VENDOR_INTEL1 +#defineCST_FFH_VENDOR_AMD 2 #defineCST_FFH_INTEL_CL_C1IO 1 #defineCST_FFH_INTEL_CL_MWAIT 2 #defineCST_FFH_MWAIT_HW_COORD 0x0001 @@ -855,7 +856,8 @@ acpi_cpu_cx_cst(struct acpi_cpu_softc *sc) acpi_cpu_cx_cst_free_plvlx(sc->cpu_dev, cx_ptr); #if defined(__i386__) || defined(__amd64__) if (acpi_PkgFFH_IntelCpu(pkg, 0, &vendor, &class, &address, - &accsize) == 0 && vendor == CST_FFH_VENDOR_INTEL) { + &accsize) == 0 && + (vendor == CST_FFH_VENDOR_INTEL || vendor == CST_FFH_VENDOR_AMD)) { if (class == CST_FFH_INTEL_CL_C1IO) { /* C1 I/O then Halt */ cx_ptr->res_rid = sc->cpu_cx_count; @@ -872,7 +874,9 @@ acpi_cpu_cx_cst(struct acpi_cpu_softc *sc) "degrading to C1 Halt", (int)address); } } else if (class == CST_FFH_INTEL_CL_MWAIT) { - acpi_cpu_cx_cst_mwait(cx_ptr, address, accsize); + if (vendor == CST_FFH_VENDOR_INTEL || + (vendor == CST_FFH_VENDOR_AMD && cpu_mon_mwait_edx != 0)) + acpi_cpu_cx_cst_mwait(cx_ptr, address, accsize); } } #endif @@ -922,6 +926,7 @@ acpi_cpu_cx_cst(struct acpi_cpu_softc *sc) acpi_PkgGas(sc->cpu_dev, pkg, 0, &cx_ptr->res_type, &cx_ptr->res_rid, &cx_ptr->p_lvlx, RF_SHAREABLE); if (cx_ptr->p_lvlx) { + cx_ptr->do_mwait = false; ACPI_DEBUG_PRINT((ACPI_DB_INFO, "acpi_cpu%d: Got C%d - %d latency\n", device_get_unit(sc->cpu_dev), cx_ptr->type, diff --git a/sys/x86/include/x86_var.h b/sys/x86/include/x86_var.h index f19c557e270b..6609871bf89e 100644 --- a/sys/x86/include/x86_var.h +++ b/sys/x86/include/x86_var.h @@ -62,6 +62,7 @@ externcharcpu_vendor[]; extern charcpu_model[]; extern u_int cpu_vendor_id; extern u_int cpu_mon_mwait_flags; +extern u_int cpu_mon_mwait_edx; extern u_int cpu_mon_min_size; extern u_int cpu_mon_max_size; extern u_int cpu_maxphyaddr; diff --git a/sys/x86/x86/identcpu.c b/sys/x86/x86/identcpu.c index d3aec5b5e0c6..3f8f11fda011 100644 --- a/sys/x86/x86/identcpu.c +++ b/sys/x86/x86/identcpu.c @@ -117,6 +117,7 @@ u_int cpu_stdext_feature3;/* %edx */ uint64_t cpu_ia32_arch_caps; u_int cpu_max_ext_state_size; u_int cpu_mon_mwait_flags;/* MONITOR/MWAIT flags (CPUID.05H.ECX) */ +u_int cpu_mon_mwait_edx; /* MONITOR/MWAIT supported on AMD (CPUID.05H.EDX) */ u_int cpu_mon_min_size; /* MONITOR minimum range size, bytes */ u_int cpu_mon_max_size; /* MONITOR minimum range size
git: 49597c3e84c4 - main - mlx5e: Use M_WAITOK when allocating TLS tags
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=49597c3e84c4a1cc35f2c392d93db8d0a1cccac2 commit 49597c3e84c4a1cc35f2c392d93db8d0a1cccac2 Author: Andrew Gallatin AuthorDate: 2024-10-23 19:56:14 + Commit: Andrew Gallatin CommitDate: 2024-10-23 19:56:14 + mlx5e: Use M_WAITOK when allocating TLS tags Now that it is clear we're in a sleepable context, use M_WAITOK when allocating TLS tags. Suggested by: kib Sponsored by: Netflix --- sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c b/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c index c347de650250..b5caa3ba53dd 100644 --- a/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c +++ b/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c @@ -335,9 +335,7 @@ mlx5e_tls_snd_tag_alloc(if_t ifp, return (EOPNOTSUPP); /* allocate new tag from zone, if any */ - ptag = uma_zalloc(priv->tls.zone, M_NOWAIT); - if (ptag == NULL) - return (ENOMEM); + ptag = uma_zalloc(priv->tls.zone, M_WAITOK); /* sanity check default values */ MPASS(ptag->dek_index == 0);
git: 81dbc22ce8b6 - main - mlx5e: Immediately initialize TLS send tags
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=81dbc22ce8b66759a9fc4ebdef5cfc7a6185af22 commit 81dbc22ce8b66759a9fc4ebdef5cfc7a6185af22 Author: Andrew Gallatin AuthorDate: 2024-10-23 19:16:19 + Commit: Andrew Gallatin CommitDate: 2024-10-23 19:16:19 + mlx5e: Immediately initialize TLS send tags Under massive connection thrashing (web server restarting), we see long periods where the web server blocks when enabling ktls offload when NIC ktls offload is enabled. It turns out the driver uses a single-threaded linux work queue to serialize the commands that must be sent to the nic to allocate and free tls resources. When freeing sessions, this work is handled asynchronously. However, when allocating sessions, the work is handled synchronously and the driver waits for the work to complete before returning. When under massive connection thrashing, the work queue is first filled by TLS sessions closing. Then when new sessions arrive, the web server enables kTLS and blocks while the tens or hundreds of thousands of sessions closes queued up are processed by the NIC. Rather than using the work queue to open a TLS session on the NIC, switch to doing the open directly. This allows use to cut in front of all those sessions that are waiting to close, and minimize the amount of time the web server blocks. The risk is that the NIC may be out of resources because it has not processed all of those session frees. So if we fail to open a session directly, we fall back to using the work queue. Differential Revision: https://reviews.freebsd.org/D47260 Sponsored by: Netflix Reviewed by: kib --- sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c | 86 +-- 1 file changed, 52 insertions(+), 34 deletions(-) diff --git a/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c b/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c index a8522d68d5aa..c347de650250 100644 --- a/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c +++ b/sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c @@ -213,54 +213,63 @@ mlx5e_tls_cleanup(struct mlx5e_priv *priv) counter_u64_free(ptls->stats.arg[x]); } + +static int +mlx5e_tls_st_init(struct mlx5e_priv *priv, struct mlx5e_tls_tag *ptag) +{ + int err; + + /* try to open TIS, if not present */ + if (ptag->tisn == 0) { + err = mlx5_tls_open_tis(priv->mdev, 0, priv->tdn, + priv->pdn, &ptag->tisn); + if (err) { + MLX5E_TLS_STAT_INC(ptag, tx_error, 1); + return (err); + } + } + MLX5_SET(sw_tls_cntx, ptag->crypto_params, progress.pd, ptag->tisn); + + /* try to allocate a DEK context ID */ + err = mlx5_encryption_key_create(priv->mdev, priv->pdn, + MLX5_GENERAL_OBJECT_TYPE_ENCRYPTION_KEY_TYPE_TLS, + MLX5_ADDR_OF(sw_tls_cntx, ptag->crypto_params, key.key_data), + MLX5_GET(sw_tls_cntx, ptag->crypto_params, key.key_len), + &ptag->dek_index); + if (err) { + MLX5E_TLS_STAT_INC(ptag, tx_error, 1); + return (err); + } + + MLX5_SET(sw_tls_cntx, ptag->crypto_params, param.dek_index, ptag->dek_index); + + ptag->dek_index_ok = 1; + + MLX5E_TLS_TAG_LOCK(ptag); + if (ptag->state == MLX5E_TLS_ST_INIT) + ptag->state = MLX5E_TLS_ST_SETUP; + MLX5E_TLS_TAG_UNLOCK(ptag); + return (0); +} + static void mlx5e_tls_work(struct work_struct *work) { struct mlx5e_tls_tag *ptag; struct mlx5e_priv *priv; - int err; ptag = container_of(work, struct mlx5e_tls_tag, work); priv = container_of(ptag->tls, struct mlx5e_priv, tls); switch (ptag->state) { case MLX5E_TLS_ST_INIT: - /* try to open TIS, if not present */ - if (ptag->tisn == 0) { - err = mlx5_tls_open_tis(priv->mdev, 0, priv->tdn, - priv->pdn, &ptag->tisn); - if (err) { - MLX5E_TLS_STAT_INC(ptag, tx_error, 1); - break; - } - } - MLX5_SET(sw_tls_cntx, ptag->crypto_params, progress.pd, ptag->tisn); - - /* try to allocate a DEK context ID */ - err = mlx5_encryption_key_create(priv->mdev, priv->pdn, - MLX5_GENERAL_OBJECT_TYPE_ENCRYPTION_KEY_TYPE_TLS, - MLX5_ADDR_OF(sw_tls_cntx, ptag->crypto_params, key.key_data), - MLX5_GET(sw_tls_cntx, ptag->crypto_params, key.key_len), - &ptag->dek_index); - if (err) { - MLX5E_TLS_STAT_INC(ptag, tx_error,
git: 4605a99b51ab - main - aio: remove write-only jobid & kernelinfo
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=4605a99b51ab72351d7554fbadbb24985f4667b1 commit 4605a99b51ab72351d7554fbadbb24985f4667b1 Author: Andrew Gallatin AuthorDate: 2024-11-15 15:41:34 + Commit: Andrew Gallatin CommitDate: 2024-11-15 15:47:46 + aio: remove write-only jobid & kernelinfo The jobid (which was stored in kernelinfo) was used to look up jobs until 1ce9182407f6, where it became essentially write only. Remove it to simplify the code and pave the way for future work to make aio scale better. Note this has been slated for removal "soon" for 18 years. Suggested by: jhb Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D47583 --- sys/kern/vfs_aio.c | 42 +- sys/sys/aio.h | 2 +- 2 files changed, 2 insertions(+), 42 deletions(-) diff --git a/sys/kern/vfs_aio.c b/sys/kern/vfs_aio.c index e7302f4b7a9e..eb08716fbeda 100644 --- a/sys/kern/vfs_aio.c +++ b/sys/kern/vfs_aio.c @@ -71,12 +71,6 @@ #include #include -/* - * Counter for allocating reference ids to new jobs. Wrapped to 1 on - * overflow. (XXX will be removed soon.) - */ -static u_long jobrefid; - /* * Counter for aio_fsync. */ @@ -297,7 +291,6 @@ struct aiocb_ops { long(*fetch_error)(struct aiocb *ujob); int (*store_status)(struct aiocb *ujob, long status); int (*store_error)(struct aiocb *ujob, long error); - int (*store_kernelinfo)(struct aiocb *ujob, long jobref); int (*store_aiocb)(struct aiocb **ujobp, struct aiocb *ujob); }; @@ -418,7 +411,6 @@ aio_onceonly(void) aiolio_zone = uma_zcreate("AIOLIO", sizeof(struct aioliojob), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0); aiod_lifetime = AIOD_LIFETIME_DEFAULT; - jobrefid = 1; p31b_setcfg(CTL_P1003_1B_ASYNCHRONOUS_IO, _POSIX_ASYNCHRONOUS_IO); p31b_setcfg(CTL_P1003_1B_AIO_MAX, MAX_AIO_QUEUE); p31b_setcfg(CTL_P1003_1B_AIO_PRIO_DELTA_MAX, 0); @@ -1455,13 +1447,6 @@ aiocb_store_error(struct aiocb *ujob, long error) return (suword(&ujob->_aiocb_private.error, error)); } -static int -aiocb_store_kernelinfo(struct aiocb *ujob, long jobref) -{ - - return (suword(&ujob->_aiocb_private.kernelinfo, jobref)); -} - static int aiocb_store_aiocb(struct aiocb **ujobp, struct aiocb *ujob) { @@ -1475,7 +1460,6 @@ static struct aiocb_ops aiocb_ops = { .fetch_error = aiocb_fetch_error, .store_status = aiocb_store_status, .store_error = aiocb_store_error, - .store_kernelinfo = aiocb_store_kernelinfo, .store_aiocb = aiocb_store_aiocb, }; @@ -1486,7 +1470,6 @@ static struct aiocb_ops aiocb_ops_osigevent = { .fetch_error = aiocb_fetch_error, .store_status = aiocb_store_status, .store_error = aiocb_store_error, - .store_kernelinfo = aiocb_store_kernelinfo, .store_aiocb = aiocb_store_aiocb, }; #endif @@ -1507,7 +1490,6 @@ aio_aqueue(struct thread *td, struct aiocb *ujob, struct aioliojob *lj, int opcode; int error; int fd, kqfd; - int jid; u_short evflags; if (p->p_aioinfo == NULL) @@ -1517,7 +1499,6 @@ aio_aqueue(struct thread *td, struct aiocb *ujob, struct aioliojob *lj, ops->store_status(ujob, -1); ops->store_error(ujob, 0); - ops->store_kernelinfo(ujob, -1); if (num_queue_count >= max_queue_count || ki->kaio_count >= max_aio_queue_per_proc) { @@ -1630,16 +1611,8 @@ aio_aqueue(struct thread *td, struct aiocb *ujob, struct aioliojob *lj, job->fd_file = fp; mtx_lock(&aio_job_mtx); - jid = jobrefid++; job->seqno = jobseqno++; mtx_unlock(&aio_job_mtx); - error = ops->store_kernelinfo(ujob, jid); - if (error) { - error = EINVAL; - goto err3; - } - job->uaiocb._aiocb_private.kernelinfo = (void *)(intptr_t)jid; - if (opcode == LIO_NOP) { fdrop(fp, td); MPASS(job->uiop == &job->uio || job->uiop == NULL); @@ -2728,7 +2701,7 @@ filt_lio(struct knote *kn, long hint) struct __aiocb_private32 { int32_t status; int32_t error; - uint32_t kernelinfo; + uint32_t spare; }; #ifdef COMPAT_FREEBSD6 @@ -2807,7 +2780,6 @@ aiocb32_copyin_old_sigevent(struct aiocb *ujob, struct kaiocb *kjob, CP(job32, *kcb, aio_reqprio); CP(job32, *kcb, _aiocb_private.status); CP(job32, *kcb, _aiocb_private.error); - PTRIN_CP(job32, *kcb, _aiocb_private.kernelinfo); return (convert_old_sigevent32(&job32.aio_sigevent, &kcb->aio_sigevent)); } @@ -2844,7 +2816,6 @@ aiocb32_copyin(struct aiocb *ujob, struct kaioc
git: 194bb58b80c1 - main - x86: Fixes for nmi/pmi interrupt sharing
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=194bb58b80c184b8230edef0ed7f292b4bf706b0 commit 194bb58b80c184b8230edef0ed7f292b4bf706b0 Author: Andrew Gallatin AuthorDate: 2025-02-04 22:04:57 + Commit: Andrew Gallatin CommitDate: 2025-02-05 15:26:27 + x86: Fixes for nmi/pmi interrupt sharing - Fix a bug where the semantics of refcount_release() were reversed. This would lead to the nmi interrupt being prematurely masked in the local apic, leading to an out-of-tree profiling tool only getting results the first time it was run. - Stop executing nmi handlers after one claims the interrupt. The core2 hwpmc handler seems to be especially heavy, and running it in addition to vtune's handler caused roughly 50% of the nmi interrupts to be lost (and caused vtune to give worse results). Reviewed by: bojan Sponsored by: Netflix --- sys/x86/x86/cpu_machdep.c | 11 --- sys/x86/x86/local_apic.c | 2 +- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/sys/x86/x86/cpu_machdep.c b/sys/x86/x86/cpu_machdep.c index 4df652f1f2a8..5b4abfe71642 100644 --- a/sys/x86/x86/cpu_machdep.c +++ b/sys/x86/x86/cpu_machdep.c @@ -65,6 +65,7 @@ #include #include #include +#include #include #include #include @@ -955,6 +956,7 @@ nmi_handle_intr(struct trapframe *frame) { int (*func)(struct trapframe *); struct nmi_handler *hp; + int rv; bool handled; #ifdef SMP @@ -965,13 +967,16 @@ nmi_handle_intr(struct trapframe *frame) handled = false; hp = (struct nmi_handler *)atomic_load_acq_ptr( (uintptr_t *)&nmi_handlers_head); - while (hp != NULL) { + while (!handled && hp != NULL) { func = hp->func; if (func != NULL) { atomic_add_int(&hp->running, 1); - if (func(frame) != 0) - handled = true; + rv = func(frame); atomic_subtract_int(&hp->running, 1); + if (rv != 0) { + handled = true; + break; + } } hp = (struct nmi_handler *)atomic_load_acq_ptr( (uintptr_t *)&hp->next); diff --git a/sys/x86/x86/local_apic.c b/sys/x86/x86/local_apic.c index 86cbe9a050dc..db9a1eb757de 100644 --- a/sys/x86/x86/local_apic.c +++ b/sys/x86/x86/local_apic.c @@ -895,7 +895,7 @@ lapic_disable_pcint(void) maxlvt = (lapic_read32(LAPIC_VERSION) & APIC_VER_MAXLVT) >> MAXLVTSHIFT; if (maxlvt < APIC_LVT_PMC) return; - if (refcount_release(&pcint_refcnt)) + if (!refcount_release(&pcint_refcnt)) return; lvts[APIC_LVT_PMC].lvt_masked = 1;
git: 36fdc42c6a4c - main - mlx5en: Fix SIOCSIFCAPNV
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=36fdc42c6a4c828d334471438c4f852e4b5a25e2 commit 36fdc42c6a4c828d334471438c4f852e4b5a25e2 Author: Andrew Gallatin AuthorDate: 2025-01-31 01:07:06 + Commit: Andrew Gallatin CommitDate: 2025-01-31 01:57:35 + mlx5en: Fix SIOCSIFCAPNV In 4cc5d081d8c23, a change was introduced that manipulated drv_ioctl_data->reqcap using IFCAP2 bits. This was noticed when creating a mixed lagg with mce0 and ixl0 caused the interfaces' txcsum caps to be disabled. Fixes: 4cc5d081d8c23 Reviewed by: glebius Sponsored by: Netflix MFC After: 7 days --- sys/dev/mlx5/mlx5_en/mlx5_en_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sys/dev/mlx5/mlx5_en/mlx5_en_main.c b/sys/dev/mlx5/mlx5_en/mlx5_en_main.c index 3e7df834d080..c17da50c1a5e 100644 --- a/sys/dev/mlx5/mlx5_en/mlx5_en_main.c +++ b/sys/dev/mlx5/mlx5_en/mlx5_en_main.c @@ -3557,12 +3557,12 @@ siocsifcap_driver: IFCAP_TXTLS6); } if (!mlx5e_is_tlsrx_capable(priv->mdev)) { - drv_ioctl_data->reqcap &= ~( + drv_ioctl_data->reqcap2 &= ~( IFCAP2_BIT(IFCAP2_RXTLS4) | IFCAP2_BIT(IFCAP2_RXTLS6)); } if (!mlx5e_is_ipsec_capable(priv->mdev)) { - drv_ioctl_data->reqcap &= + drv_ioctl_data->reqcap2 &= ~IFCAP2_BIT(IFCAP2_IPSEC_OFFLOAD); } if (!mlx5e_is_ratelimit_capable(priv->mdev)) {
git: cf9070746742 - main - Introduce the UMA_ZONE_NOTRIM uma zone type
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=cf907074674206b1825f79c6864c4c4a32089ecc commit cf907074674206b1825f79c6864c4c4a32089ecc Author: Andrew Gallatin AuthorDate: 2025-01-15 17:11:51 + Commit: Andrew Gallatin CommitDate: 2025-01-15 17:23:00 + Introduce the UMA_ZONE_NOTRIM uma zone type The ktls buffer zone allocates 16k contiguous buffers, and often needs to call vm_page_reclaim_contig_domain_ext() to free up contiguous memory, which can be expensive. Web servers which have a daily pattern of peaks and troughs end up having UMA trim the ktls_buffer_zone when they are in their trough, and end up re-building it on the way to their peak. Rather than calling vm_page_reclaim_contig_domain_ext() multiple times on a daily basis, lets mark the ktls_buffer_zone with a new UMA flag, UMA_ZONE_NOTRIM. This disables UMA_RECLAIM_TRIM on the zone, but allows UMA_RECLAIM_DRAIN* operations, so that if we become extremely short of memory (vm_page_count_severe()), the uma reclaim worker can still free up memory. Note that UMA_ZONE_UNMANAGED already exists, but can never be drained or trimmed, so it may hold on to memory during times of severe memory pressure. Using UMA_ZONE_NOTRIM rather than UMA_ZONE_UNMANAGED is an attempt to keep this zone more reactive in the face of severe memory pressure. Sponsored by: Netflix Reviewed by: jhb, kib, markj (via slack) Differential Revision: https://reviews.freebsd.org/D48451 --- sys/kern/uipc_ktls.c | 2 +- sys/vm/uma.h | 1 + sys/vm/uma_core.c| 11 --- 3 files changed, 10 insertions(+), 4 deletions(-) diff --git a/sys/kern/uipc_ktls.c b/sys/kern/uipc_ktls.c index 881825bf1d9f..6815667594a4 100644 --- a/sys/kern/uipc_ktls.c +++ b/sys/kern/uipc_ktls.c @@ -495,7 +495,7 @@ ktls_init(void) ktls_buffer_zone = uma_zcache_create("ktls_buffers", roundup2(ktls_maxlen, PAGE_SIZE), NULL, NULL, NULL, NULL, ktls_buffer_import, ktls_buffer_release, NULL, - UMA_ZONE_FIRSTTOUCH); + UMA_ZONE_FIRSTTOUCH | UMA_ZONE_NOTRIM); } /* diff --git a/sys/vm/uma.h b/sys/vm/uma.h index 38865df7ae02..4f2b143a2fae 100644 --- a/sys/vm/uma.h +++ b/sys/vm/uma.h @@ -252,6 +252,7 @@ uma_zone_t uma_zcache_create(const char *name, int size, uma_ctor ctor, #defineUMA_ZONE_SECONDARY 0x0200 /* Zone is a Secondary Zone */ #defineUMA_ZONE_NOBUCKET 0x0400 /* Do not use buckets. */ #defineUMA_ZONE_MAXBUCKET 0x0800 /* Use largest buckets. */ +#defineUMA_ZONE_NOTRIM 0x1000 /* Don't trim this zone */ #defineUMA_ZONE_CACHESPREAD0x2000 /* * Spread memory start locations across * all possible cache lines. May diff --git a/sys/vm/uma_core.c b/sys/vm/uma_core.c index e93c561d759a..4de850afcb66 100644 --- a/sys/vm/uma_core.c +++ b/sys/vm/uma_core.c @@ -1222,7 +1222,7 @@ zone_timeout(uma_zone_t zone, void *unused) trim: /* Trim caches not used for a long time. */ - if ((zone->uz_flags & UMA_ZONE_UNMANAGED) == 0) { + if ((zone->uz_flags & (UMA_ZONE_UNMANAGED | UMA_ZONE_NOTRIM)) == 0) { for (int i = 0; i < vm_ndomains; i++) { if (bucket_cache_reclaim_domain(zone, false, false, i) && (zone->uz_flags & UMA_ZFLAG_CACHE) == 0) @@ -5306,8 +5306,13 @@ uma_reclaim_domain_cb(uma_zone_t zone, void *arg) struct uma_reclaim_args *args; args = arg; - if ((zone->uz_flags & UMA_ZONE_UNMANAGED) == 0) - uma_zone_reclaim_domain(zone, args->req, args->domain); + if ((zone->uz_flags & UMA_ZONE_UNMANAGED) != 0) + return; + if ((args->req == UMA_RECLAIM_TRIM) && + (zone->uz_flags & UMA_ZONE_NOTRIM) !=0) + return; + + uma_zone_reclaim_domain(zone, args->req, args->domain); } /* See uma.h */
git: 709348c21351 - main - ifconfig: fix reporting optics on most 100g interfaces
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=709348c21351a783ff0025519d1f7cf884771077 commit 709348c21351a783ff0025519d1f7cf884771077 Author: Andrew Gallatin AuthorDate: 2025-02-25 19:17:14 + Commit: Andrew Gallatin CommitDate: 2025-02-25 19:26:07 + ifconfig: fix reporting optics on most 100g interfaces This fixes a bug where optics on 100G and faster NICs were not properly reported. The problem is that we pull the string from the correct table in ifconfig_sfp_physical_spec only when sfp_eth_1040g contains the SFP_ETH_1040G_EXTENDED bit. However, we were never saving that bit when it was encountered. This change records that bit into sfp_eth_1040g, allowing us to later select the appropriate ID string. This should cause most 100G interfaces to stop being identified as "unknown" in the "plugged" output of ifconfig -v, and to start being identified as what they really are. Example output from a Chelsio T6 with SR4 optics in one port and DR1 optics in another: Before: plugged: QSFP28 Unknown (MPO 1x12 Parallel Optic) plugged: QSFP28 Unknown (LC) After: plugged: QSFP28 100GBASE-SR4 or 25GBASE-SR (MPO 1x12 Parallel Optic) plugged: QSFP28 100GBASE-DR (LC) Reviewed by: kbowling, np Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D49127 MFC after: 7 days --- lib/libifconfig/libifconfig_sfp.c | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/libifconfig/libifconfig_sfp.c b/lib/libifconfig/libifconfig_sfp.c index 17f130606765..8292135d3e47 100644 --- a/lib/libifconfig/libifconfig_sfp.c +++ b/lib/libifconfig/libifconfig_sfp.c @@ -181,6 +181,7 @@ get_qsfp_info(struct i2c_info *ii, struct ifconfig_sfp_info *sfp) if (code & SFF_8636_EXT_COMPLIANCE) { read_i2c(ii, SFF_8436_BASE, SFF_8436_OPTIONS_START, 1, &sfp->sfp_eth_ext); + sfp->sfp_eth_1040g = code; } else { /* Check 10/40G Ethernet class only */ sfp->sfp_eth_1040g =
git: 20e15e905c58 - main - mlx5: Decrease FW init timeout from 120 seconds to 5 seconds
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=20e15e905c58e9e2020b2c3e40caa2e8406e5827 commit 20e15e905c58e9e2020b2c3e40caa2e8406e5827 Author: Andrew Gallatin AuthorDate: 2025-06-29 20:51:50 + Commit: Andrew Gallatin CommitDate: 2025-06-29 20:51:50 + mlx5: Decrease FW init timeout from 120 seconds to 5 seconds When encountering a failed NIC, the mlx5 driver will wait up to 120 secs for the firmware to respond. This timeout is absurdly huge, and leads to boot times of 40 minutes to over an hour on our servers when a NIC fails. This is because the driver will attempt to attach to the failed NIC multiple times (once for each driver loaded after mlx5), and wait 2 minutes on each attempt. This happens because the mlx5 driver is still the best match for the device. This delay then triggers watchdog timeouts in our environment, rendering servers with a failed NIC entirely unbootable without manual intervention. Note that FW_INIT_WARN_MESSAGE_INTERVAL must also be decreased, as it must be less than the init timeout. Reviewed by: kib (initial version, before reducing warn interval) Sponsored by: Netflix --- sys/dev/mlx5/device.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sys/dev/mlx5/device.h b/sys/dev/mlx5/device.h index e6d46507a5d2..3e2c4f15a5cc 100644 --- a/sys/dev/mlx5/device.h +++ b/sys/dev/mlx5/device.h @@ -32,8 +32,8 @@ #defineFW_INIT_TIMEOUT_MILI2000 #defineFW_INIT_WAIT_MS 2 -#defineFW_PRE_INIT_TIMEOUT_MILI12 -#defineFW_INIT_WARN_MESSAGE_INTERVAL 2 +#defineFW_PRE_INIT_TIMEOUT_MILI5000 +#defineFW_INIT_WARN_MESSAGE_INTERVAL 2000 #if defined(__LITTLE_ENDIAN) #define MLX5_SET_HOST_ENDIANNESS 0
git: 78bdaa57cfba - main - lagg: Fix if_hw_tsomax_update() not being called
The branch main has been updated by gallatin: URL: https://cgit.FreeBSD.org/src/commit/?id=78bdaa57cfbac759a6d79ecad2fae570e294a4b3 commit 78bdaa57cfbac759a6d79ecad2fae570e294a4b3 Author: Andrew Gallatin AuthorDate: 2025-07-12 22:35:29 + Commit: Andrew Gallatin CommitDate: 2025-07-12 22:35:29 + lagg: Fix if_hw_tsomax_update() not being called In a mixed lagg, its likely that ifcaps or hwassist may not match between members. If this is true, the logical OR will be short-circuited and if_hw_tsomax_update() will not be called. Fix this by calling it inside the body of the if as well Sponsored by: Netflix MFC after: 2 weeks --- sys/net/if_lagg.c | 1 + 1 file changed, 1 insertion(+) diff --git a/sys/net/if_lagg.c b/sys/net/if_lagg.c index 9867a718e148..5b52bfa80e3b 100644 --- a/sys/net/if_lagg.c +++ b/sys/net/if_lagg.c @@ -718,6 +718,7 @@ lagg_capabilities(struct lagg_softc *sc) sc->sc_ifp->if_capenable = ena; sc->sc_ifp->if_capenable2 = ena2; sc->sc_ifp->if_hwassist = hwa; + (void)if_hw_tsomax_update(sc->sc_ifp, &hw_tsomax); getmicrotime(&sc->sc_ifp->if_lastchange); if (sc->sc_ifflags & IFF_DEBUG)