Re: [SeaBIOS] [SeaBIOS PATCH] xen: require Xen info structure at 0x1000 to detect Xen
On Wed, 2023-02-01 at 21:13 -0500, Kevin O'Connor wrote: > On Fri, Jan 20, 2023 at 11:33:19AM +, David Woodhouse wrote: > > From: David Woodhouse > > > > When running under Xen, hvmloader places a table at 0x1000 with the e820 > > information and BIOS tables. If this isn't present, SeaBIOS will > > currently panic. > > > > We now have support for running Xen guests natively in QEMU/KVM, which > > boots SeaBIOS directly instead of via hvmloader, and does not provide > > the same structure. > > > > As it happens, this doesn't matter on first boot. because although we > > set PlatformRunningOn to PF_QEMU|PF_XEN, reading it back again still > > gives zero. Presumably because in true Xen, this is all already RAM. But > > in QEMU with a faithfully-emulated PAM config in the host bridge, it's > > still in ROM mode at this point so we don't see what we've just written. > > > > On reboot, however, the region *is* set to RAM mode and we do see the > > updated value of PlatformRunningOn, do manage to remember that we've > > detected Xen in CPUID, and hit the panic. > > > > It's not trivial to detect QEMU vs. real Xen at the time xen_preinit() > > runs, because it's so early. We can't even make a XENVER_extraversion > > hypercall to look for hints, because we haven't set up the hypercall > > page (and don't have an allocator to give us a page in which to do so). > > > > So just make Xen detection contingent on the info structure being > > present. If it wasn't, we were going to panic anyway. That leaves us > > taking the standard QEMU init path for Xen guests in native QEMU, > > which is just fine. > > Thanks. I committed this change. > > -Kevin Thanks, Kevin. I'd like to get the rest of the Xen platform support in to qemu 8.0 if possible. Which is currently scheduled for March. Is there likely to be a SeaBIOS release before then which Gerd would pull into qemu anyway, or should I submit a submodule update to a snapshot of today's tree? That would just pull in this commit, and the one other fix that's in the SeaBIOS tree since 1.16.1? $ git shortlog rel-1.16.1.. David Woodhouse (1): xen: require Xen info structure at 0x1000 to detect Xen Qi Zhou (1): usb: fix wrong init of keyboard/mouse's if first interface is not boot protocol smime.p7s Description: S/MIME cryptographic signature
qemu-x86_64: clang: RIP: 0010:sched_clock_cpu
Following kernel crash noticed on qemu-x86_64 intermittently with Linux next. qemu-system-x86, installed at version: 7.2 Reported-by: Linux Kernel Functional Testing Boot log: --- <5>[0.00] Linux version 6.2.0-rc6-next-20230202 (tuxmake@tuxmake) (Debian clang version 15.0.7, Debian LLD 15.0.7) #1 SMP PREEMPT_DYNAMIC @1675308517 <6>[0.00] Command line: console=ttyS0,115200 rootwait root=/dev/sda debug verbose console_msg_format=syslog earlycon <6>[0.00] x86/fpu: x87 FPU will use FXSAVE <6>[0.00] signal: max sigframe size: 1440 <6>[0.00] BIOS-provided physical RAM map: <6>[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable <6>[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved <6>[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved <6>[0.00] BIOS-e820: [mem 0x0010-0x7ffdefff] usable <6>[0.00] BIOS-e820: [mem 0x7ffdf000-0x7fff] reserved <6>[0.00] BIOS-e820: [mem 0xb000-0xbfff] reserved <6>[0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved <6>[0.00] BIOS-e820: [mem 0xfffc-0x] reserved <6>[0.00] BIOS-e820: [mem 0x0001-0x00017fff] usable <5>[0.00] random: crng init done <6>[0.00] NX (Execute Disable) protection: active <7>[0.00] e820: update [mem 0x00f37c20-0x00f37c2f] usable ==> usable <7>[0.00] e820: update [mem 0x00f37c20-0x00f37c2f] usable ==> usable <6>[0.00] extended physical RAM map: <6>[0.00] reserve setup_data: [mem 0x-0x0009fbff] usable <6>[0.00] reserve setup_data: [mem 0x0009fc00-0x0009] reserved <6>[0.00] reserve setup_data: [mem 0x000f-0x000f] reserved <6>[0.00] reserve setup_data: [mem 0x0010-0x00f37c1f] usable <6>[0.00] reserve setup_data: [mem 0x00f37c20-0x00f37c2f] usable <6>[0.00] reserve setup_data: [mem 0x00f37c30-0x7ffdefff] usable <6>[0.00] reserve setup_data: [mem 0x7ffdf000-0x7fff] reserved <6>[0.00] reserve setup_data: [mem 0xb000-0xbfff] reserved <6>[0.00] reserve setup_data: [mem 0xfed1c000-0xfed1] reserved <6>[0.00] reserve setup_data: [mem 0xfffc-0x] reserved <6>[0.00] reserve setup_data: [mem 0x0001-0x00017fff] usable <6>[0.00] SMBIOS 2.8 present. <6>[0.00] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 <6>[0.00] tsc: Fast TSC calibration using PIT <6>[0.00] tsc: Detected 2649.953 MHz processor <7>[0.001000] e820: update [mem 0x-0x0fff] usable ==> reserved <7>[0.001000] e820: remove [mem 0x000a-0x000f] usable <6>[0.001000] last_pfn = 0x18 max_arch_pfn = 0x4 <6>[0.001000] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT <6>[0.001000] last_pfn = 0x7ffdf max_arch_pfn = 0x4 <6>[0.001000] found SMP MP-table at [mem 0x000f5ce0-0x000f5cef] <6>[0.001000] ACPI: Early table checksum verification disabled <6>[0.001000] ACPI: RSDP 0x000F5B10 14 (v00 BOCHS ) <6>[0.001000] ACPI: RSDT 0x7FFE2319 38 (v01 BOCHS BXPC 0001 BXPC 0001) <6>[0.001000] ACPI: FACP 0x7FFE2109 F4 (v03 BOCHS BXPC 0001 BXPC 0001) <6>[0.001000] ACPI: DSDT 0x7FFE0040 0020C9 (v01 BOCHS BXPC 0001 BXPC 0001) <6>[0.001000] ACPI: FACS 0x7FFE 40 <6>[0.001000] ACPI: APIC 0x7FFE21FD 80 (v01 BOCHS BXPC 0001 BXPC 0001) <6>[0.001000] ACPI: HPET 0x7FFE227D 38 (v01 BOCHS BXPC 0001 BXPC 0001) <6>[0.001000] ACPI: MCFG 0x7FFE22B5 3C (v01 BOCHS BXPC 0001 BXPC 0001) <6>[0.001000] ACPI: WAET 0x7FFE22F1 28 (v01 BOCHS BXPC 0001 BXPC 0001) <6>[0.001000] ACPI: Reserving FACP table memory at [mem 0x7ffe2109-0x7ffe21fc] <6>[0.001000] ACPI: Reserving DSDT table memory at [mem 0x7ffe0040-0x7ffe2108] <6>[0.001000] ACPI: Reserving FACS table memory at [mem 0x7ffe-0x7ffe003f] <6>[0.001000] ACPI: Reserving APIC table memory at [mem 0x7ffe21fd-0x7ffe227c] <6>[0.001000] ACPI: Reserving HPET table memory at [mem 0x7ffe227d-0x7ffe22b4] <6>[0.001000] ACPI: Reserving MCFG table memor
Re: [PATCH v5 4/8] igb: implement VFRE and VFTE registers
On 2023/02/02 16:26, Sriram Yagnaraman wrote: Also introduce: - Checks for RXDCTL/TXDCTL queue enable bits - IGB_NUM_VM_POOLS enum (Sec 1.5: Table 1-7) Signed-off-by: Sriram Yagnaraman --- hw/net/igb_core.c | 39 +++ hw/net/igb_core.h | 1 + hw/net/igb_regs.h | 3 +++ 3 files changed, 35 insertions(+), 8 deletions(-) diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c index 1ddf54f630..8437cd6829 100644 --- a/hw/net/igb_core.c +++ b/hw/net/igb_core.c @@ -780,6 +780,18 @@ igb_txdesc_writeback(IGBCore *core, dma_addr_t base, return igb_tx_wb_eic(core, txi->idx); } +static inline bool +igb_tx_enabled(IGBCore *core, const E1000E_RingInfo *txi) +{ +bool vmdq = core->mac[MRQC] & 1; +uint16_t qn = txi->idx; +uint16_t pool = qn % IGB_NUM_VM_POOLS; + +return (core->mac[TCTL] & E1000_TCTL_EN) && +(!vmdq || core->mac[VFTE] & BIT(pool)) && +(core->mac[TXDCTL0 + (qn * 16)] & E1000_TXDCTL_QUEUE_ENABLE); +} + static void igb_start_xmit(IGBCore *core, const IGB_TxRing *txr) { @@ -789,8 +801,7 @@ igb_start_xmit(IGBCore *core, const IGB_TxRing *txr) const E1000E_RingInfo *txi = txr->i; uint32_t eic = 0; -/* TODO: check if the queue itself is enabled too. */ -if (!(core->mac[TCTL] & E1000_TCTL_EN)) { +if (!igb_tx_enabled(core, txi)) { trace_e1000e_tx_disabled(); return; } @@ -866,6 +877,9 @@ igb_can_receive(IGBCore *core) for (i = 0; i < IGB_NUM_QUEUES; i++) { E1000E_RxRing rxr; +if (!(core->mac[RXDCTL0 + (i * 16)] & E1000_RXDCTL_QUEUE_ENABLE)) { +continue; +} igb_rx_ring_init(core, &rxr, i); if (igb_ring_enabled(core, rxr.i) && igb_has_rxbufs(core, rxr.i, 1)) { @@ -932,7 +946,7 @@ static uint16_t igb_receive_assign(IGBCore *core, const struct eth_header *ehdr, if (core->mac[MRQC] & 1) { if (is_broadcast_ether_addr(ehdr->h_dest)) { -for (i = 0; i < 8; i++) { +for (i = 0; i < IGB_NUM_VM_POOLS; i++) { if (core->mac[VMOLR0 + i] & E1000_VMOLR_BAM) { queues |= BIT(i); } @@ -966,7 +980,7 @@ static uint16_t igb_receive_assign(IGBCore *core, const struct eth_header *ehdr, f = ta_shift[(rctl >> E1000_RCTL_MO_SHIFT) & 3]; f = (((ehdr->h_dest[5] << 8) | ehdr->h_dest[4]) >> f) & 0xfff; if (macp[f >> 5] & (1 << (f & 0x1f))) { -for (i = 0; i < 8; i++) { +for (i = 0; i < IGB_NUM_VM_POOLS; i++) { if (core->mac[VMOLR0 + i] & E1000_VMOLR_ROMPE) { queues |= BIT(i); } @@ -989,7 +1003,7 @@ static uint16_t igb_receive_assign(IGBCore *core, const struct eth_header *ehdr, } } } else { -for (i = 0; i < 8; i++) { +for (i = 0; i < IGB_NUM_VM_POOLS; i++) { if (core->mac[VMOLR0 + i] & E1000_VMOLR_AUPE) { mask |= BIT(i); } @@ -1005,6 +1019,7 @@ static uint16_t igb_receive_assign(IGBCore *core, const struct eth_header *ehdr, queues = BIT(def_pl >> E1000_VT_CTL_DEFAULT_POOL_SHIFT); } +queues &= core->mac[VFRE]; igb_rss_parse_packet(core, core->rx_pkt, external_tx != NULL, rss_info); if (rss_info->queue & 1) { queues <<= 8; @@ -1562,12 +1577,12 @@ igb_receive_internal(IGBCore *core, const struct iovec *iov, int iovcnt, igb_rx_fix_l4_csum(core, core->rx_pkt); for (i = 0; i < IGB_NUM_QUEUES; i++) { -if (!(queues & BIT(i))) { +if (!(queues & BIT(i)) || +!(core->mac[RXDCTL0 + (i * 16)] & E1000_RXDCTL_QUEUE_ENABLE)) { continue; } igb_rx_ring_init(core, &rxr, i); - This deletion of line is irrelevant and should be removed. This is probably the last change I request. I will test this series and gives Reviewed-by: and Tested-by: after that. if (!igb_has_rxbufs(core, rxr.i, total_size)) { n |= E1000_ICS_RXO; trace_e1000e_rx_not_written_to_guest(rxr.i->idx); @@ -1966,9 +1981,16 @@ static void igb_set_vfmailbox(IGBCore *core, int index, uint32_t val) static void igb_vf_reset(IGBCore *core, uint16_t vfn) { +uint16_t qn0 = vfn; +uint16_t qn1 = vfn + IGB_NUM_VM_POOLS; + /* disable Rx and Tx for the VF*/ -core->mac[VFTE] &= ~BIT(vfn); +core->mac[RXDCTL0 + (qn0 * 16)] &= ~E1000_RXDCTL_QUEUE_ENABLE; +core->mac[RXDCTL0 + (qn1 * 16)] &= ~E1000_RXDCTL_QUEUE_ENABLE; +core->mac[TXDCTL0 + (qn0 * 16)] &= ~E1000_TXDCTL_QUEUE_ENABLE; +core->mac[TXDCTL0 + (qn1 * 16)] &= ~E1000_TXDCTL_QUEUE_ENABLE; core->mac[VFRE] &= ~BIT(vfn); +core->mac[VFTE] &= ~BIT(vfn); /* indicate VF reset to P
Re: [PATCH v5 04/20] scripts/clean-includes: Improve --git commit message
Juan Quintela writes: > Markus Armbruster wrote: >> Juan Quintela writes: >> >>> Markus Armbruster wrote: Juan Quintela writes: > Markus Armbruster wrote: >> The script drops #include "qemu/osdep.h" from headers. Mention it in >> the commit message it uses for --git. >> >> Signed-off-by: Markus Armbruster >> --- >> scripts/clean-includes | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/scripts/clean-includes b/scripts/clean-includes >> index f0466a6262..f9722c3aec 100755 >> --- a/scripts/clean-includes >> +++ b/scripts/clean-includes >> @@ -193,8 +193,8 @@ if [ "$GIT" = "yes" ]; then >> git commit --signoff -F - <> $GITSUBJ: Clean up includes >> >> -Clean up includes so that osdep.h is included first and headers >> -which it implies are not included manually. >> +Clean up includes so that qemu/osdep.h is included first in .c, and >> +not in .h, and headers which it implies are not included manually. > > I give a tree. > > Clean up includes so qemu/osdep.h is never used in .h files. It makes > sure that qemu/osdep.h is only used in .c files. Once there, it assures > that .h files already included in qemu/osdep.h are not included a second > time on the .c file. > > What do you think? Neglects to mention qemu/osdep.h goes first in .c. >>> >>> /me tries again >>> >>> What about: >>> >>> The file qemu/osdep.h should only be included in .c files. And it has >>> to be the first included file. >> >> Suggest "has to be included first." > > Ok to this change. > >> >>> This script does several things: >>> - Remove qemu/osdep.h from .h files. >> >> Correct. Could say "inclusion of qemu/osdep.h" > > I try to minimize whatever word that "includes" "includ*" (pun intended). > >>> - If qemu/osdep.h is included in a .c file it is moved to the first >>> included position if it is anywhere else. >> >> Not quite. The script ensures all the .c include it, and include it >> first. > > Oh, then it is easier. > > - It ensures that qemu/osdep.h is the first included file in all .c files. > >>> - Remove from .h files all include files that are already present in >>> qemu/osdep.h. >> >> They're removed from .h, too. > > Ah, didn't know this bit. > >> Sure you want to continue wordsmithing? ;) > > Yeap, I *hate* error messages that I can't parse (or have to read it ten > times before I understand them). > > So, we end with: > > The file qemu/osdep.h should only be included in .c files. And it has > to be included first. > > This script does three things: > - Remove qemu/osdep.h from .h files. > - It ensures that qemu/osdep.h is the first included file in all .c files. > - Include files contained in qemu/osdep.h are removed form all .c and .h > files. > > Is this better? It's less terse. Fine with me. The mix of passive and active voice feels a bit awkward, though. Another try: All .c should include qemu/osdep.h first. This script performs three related cleanups: * Ensure .c files include qemu/osdep.h first. * Including it in a .h is redundant, since the .c already includes it. Drop such inclusions. * Likewise, including headers qemu/osdep.h includes is redundant. Drop these, too.
[PATCH v6 0/8] igb: merge changes from <20221229190817.25500-1-sriram.yagnara...@est.tech>
Based-on: <20230201042615.34706-1-akihiko.od...@daynix.com> ([PATCH v7 0/9] Introduce igb) Rebased on latest changes from Akihiko, and merged changes from my original patchset: https://lists.gnu.org/archive/html/qemu-devel/2022-12/msg04670.html Changes since v5: - Added back an unecessarily removed empty line Changes since v4: - Removed the change implementing VTCTL.IGMAC, it needs more thought and implementation of DTXSWC.LLE and VLVF.LVLAN first Changes since v3: - Fix comments - Rebased on latest patchset from Akihiko - Remove Rx loop improvements that Akihiko has pulled into his patchset Changes since v2: - Fixed more comments from Akhiko - Reordered the patches to make changes easier to understand Changes since v1: - Fix review comments from Akihiko Sriram Yagnaraman (8): MAINTAINERS: Add Sriram Yagnaraman as a igb reviewer igb: handle PF/VF reset properly igb: add ICR_RXDW igb: implement VFRE and VFTE registers igb: check oversized packets for VMDq igb: respect E1000_VMOLR_RSSE igb: implement VF Tx and Rx stats igb: respect VMVIR and VMOLR for VLAN MAINTAINERS | 1 + hw/net/e1000x_regs.h | 4 + hw/net/igb_core.c| 199 ++- hw/net/igb_core.h| 1 + hw/net/igb_regs.h| 6 ++ hw/net/trace-events | 2 + 6 files changed, 175 insertions(+), 38 deletions(-) -- 2.34.1
[PATCH v6 1/8] MAINTAINERS: Add Sriram Yagnaraman as a igb reviewer
I would like to review and be informed on changes to igb device Signed-off-by: Sriram Yagnaraman --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index ece23b2b15..7d0e84ce37 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2224,6 +2224,7 @@ F: tests/qtest/libqos/e1000e.* igb M: Akihiko Odaki +R: Sriram Yagnaraman S: Maintained F: docs/system/devices/igb.rst F: hw/net/igb* -- 2.34.1
[PATCH v6 2/8] igb: handle PF/VF reset properly
Use PFRSTD to reset RSTI bit for VFs, and raise VFLRE interrupt when VF is reset. Signed-off-by: Sriram Yagnaraman --- hw/net/igb_core.c | 33 + hw/net/igb_regs.h | 3 +++ hw/net/trace-events | 2 ++ 3 files changed, 26 insertions(+), 12 deletions(-) diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c index cb3e2d0be3..b484e6ac30 100644 --- a/hw/net/igb_core.c +++ b/hw/net/igb_core.c @@ -1887,14 +1887,6 @@ static void igb_set_eims(IGBCore *core, int index, uint32_t val) igb_update_interrupt_state(core); } -static void igb_vf_reset(IGBCore *core, uint16_t vfn) -{ -/* TODO: Reset of the queue enable and the interrupt registers of the VF. */ - -core->mac[V2PMAILBOX0 + vfn] &= ~E1000_V2PMAILBOX_RSTI; -core->mac[V2PMAILBOX0 + vfn] = E1000_V2PMAILBOX_RSTD; -} - static void mailbox_interrupt_to_vf(IGBCore *core, uint16_t vfn) { uint32_t ent = core->mac[VTIVAR_MISC + vfn]; @@ -1972,6 +1964,17 @@ static void igb_set_vfmailbox(IGBCore *core, int index, uint32_t val) } } +static void igb_vf_reset(IGBCore *core, uint16_t vfn) +{ +/* disable Rx and Tx for the VF*/ +core->mac[VFTE] &= ~BIT(vfn); +core->mac[VFRE] &= ~BIT(vfn); +/* indicate VF reset to PF */ +core->mac[VFLRE] |= BIT(vfn); +/* VFLRE and mailbox use the same interrupt cause */ +mailbox_interrupt_to_pf(core); +} + static void igb_w1c(IGBCore *core, int index, uint32_t val) { core->mac[index] &= ~val; @@ -2226,14 +2229,20 @@ igb_set_status(IGBCore *core, int index, uint32_t val) static void igb_set_ctrlext(IGBCore *core, int index, uint32_t val) { -trace_e1000e_link_set_ext_params(!!(val & E1000_CTRL_EXT_ASDCHK), - !!(val & E1000_CTRL_EXT_SPD_BYPS)); - -/* TODO: PFRSTD */ +trace_igb_link_set_ext_params(!!(val & E1000_CTRL_EXT_ASDCHK), + !!(val & E1000_CTRL_EXT_SPD_BYPS), + !!(val & E1000_CTRL_EXT_PFRSTD)); /* Zero self-clearing bits */ val &= ~(E1000_CTRL_EXT_ASDCHK | E1000_CTRL_EXT_EE_RST); core->mac[CTRL_EXT] = val; + +if (core->mac[CTRL_EXT] & E1000_CTRL_EXT_PFRSTD) { +for (int vfn = 0; vfn < IGB_MAX_VF_FUNCTIONS; vfn++) { +core->mac[V2PMAILBOX0 + vfn] &= ~E1000_V2PMAILBOX_RSTI; +core->mac[V2PMAILBOX0 + vfn] |= E1000_V2PMAILBOX_RSTD; +} +} } static void diff --git a/hw/net/igb_regs.h b/hw/net/igb_regs.h index ebf3e95023..ddc0f931d6 100644 --- a/hw/net/igb_regs.h +++ b/hw/net/igb_regs.h @@ -240,6 +240,9 @@ union e1000_adv_rx_desc { /* from igb/e1000_defines.h */ +/* Physical Func Reset Done Indication */ +#define E1000_CTRL_EXT_PFRSTD 0x4000 + #define E1000_IVAR_VALID 0x80 #define E1000_GPIE_NSICR 0x0001 #define E1000_GPIE_MSIX_MODE 0x0010 diff --git a/hw/net/trace-events b/hw/net/trace-events index 0092919b9b..56651f 100644 --- a/hw/net/trace-events +++ b/hw/net/trace-events @@ -280,6 +280,8 @@ igb_core_mdic_read_unhandled(uint32_t addr) "MDIC READ: PHY[%u] UNHANDLED" igb_core_mdic_write(uint32_t addr, uint32_t data) "MDIC WRITE: PHY[%u] = 0x%x" igb_core_mdic_write_unhandled(uint32_t addr) "MDIC WRITE: PHY[%u] UNHANDLED" +igb_link_set_ext_params(bool asd_check, bool speed_select_bypass, bool pfrstd) "Set extended link params: ASD check: %d, Speed select bypass: %d, PF reset done: %d" + igb_rx_desc_buff_size(uint32_t b) "buffer size: %u" igb_rx_desc_buff_write(uint64_t addr, uint16_t offset, const void* source, uint32_t len) "addr: 0x%"PRIx64", offset: %u, from: %p, length: %u" -- 2.34.1
[PATCH v6 8/8] igb: respect VMVIR and VMOLR for VLAN
Add support for stripping/inserting VLAN for VFs. Had to move CSUM calculation back into the for loop, since packet data is pulled inside the loop based on strip VLAN decision for every VF. net_rx_pkt_fix_l4_csum should be extended to accept a buffer instead for igb. Work for a future patch. Signed-off-by: Sriram Yagnaraman --- hw/net/igb_core.c | 54 ++- 1 file changed, 44 insertions(+), 10 deletions(-) diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c index 25a5e0ec87..cd4fba383c 100644 --- a/hw/net/igb_core.c +++ b/hw/net/igb_core.c @@ -386,6 +386,25 @@ igb_rss_parse_packet(IGBCore *core, struct NetRxPkt *pkt, bool tx, info->queue = E1000_RSS_QUEUE(&core->mac[RETA], info->hash); } +static inline bool +igb_tx_insert_vlan(IGBCore *core, uint16_t qn, + struct igb_tx *tx, bool desc_vle) +{ +if (core->mac[MRQC] & 1) { +uint16_t pool = qn % IGB_NUM_VM_POOLS; + +if (core->mac[VMVIR0 + pool] & E1000_VMVIR_VLANA_DEFAULT) { +/* always insert default VLAN */ +desc_vle = true; +tx->vlan = core->mac[VMVIR0 + pool] & 0x; +} else if (core->mac[VMVIR0 + pool] & E1000_VMVIR_VLANA_NEVER) { +return false; +} +} + +return desc_vle && e1000x_vlan_enabled(core->mac); +} + static bool igb_setup_tx_offloads(IGBCore *core, struct igb_tx *tx) { @@ -581,7 +600,8 @@ igb_process_tx_desc(IGBCore *core, if (cmd_type_len & E1000_TXD_CMD_EOP) { if (!tx->skip_cp && net_tx_pkt_parse(tx->tx_pkt)) { -if (cmd_type_len & E1000_TXD_CMD_VLE) { +if (igb_tx_insert_vlan(core, queue_index, tx, +!!(cmd_type_len & E1000_TXD_CMD_VLE))) { net_tx_pkt_setup_vlan_header_ex(tx->tx_pkt, tx->vlan, core->mac[VET] & 0x); } @@ -1536,6 +1556,20 @@ igb_write_packet_to_guest(IGBCore *core, struct NetRxPkt *pkt, igb_update_rx_stats(core, rxi, size, total_size); } +static bool +igb_rx_strip_vlan(IGBCore *core, const E1000E_RingInfo *rxi) +{ +if (core->mac[MRQC] & 1) { +uint16_t pool = rxi->idx % IGB_NUM_VM_POOLS; +/* Sec 7.10.3.8: CTRL.VME is ignored, only VMOLR/RPLOLR is used */ +return (net_rx_pkt_get_packet_type(core->rx_pkt) == ETH_PKT_MCAST) ? +core->mac[RPLOLR] & E1000_RPLOLR_STRVLAN : +core->mac[VMOLR0 + pool] & E1000_VMOLR_STRVLAN; +} + +return e1000x_vlan_enabled(core->mac); +} + static inline void igb_rx_fix_l4_csum(IGBCore *core, struct NetRxPkt *pkt) { @@ -1616,10 +1650,7 @@ igb_receive_internal(IGBCore *core, const struct iovec *iov, int iovcnt, ehdr = PKT_GET_ETH_HDR(filter_buf); net_rx_pkt_set_packet_type(core->rx_pkt, get_eth_packet_type(ehdr)); - -net_rx_pkt_attach_iovec_ex(core->rx_pkt, iov, iovcnt, iov_ofs, - e1000x_vlan_enabled(core->mac), - core->mac[VET] & 0x); +net_rx_pkt_set_protocols(core->rx_pkt, filter_buf, size); queues = igb_receive_assign(core, ehdr, size, &rss_info, external_tx); if (!queues) { @@ -1627,11 +1658,6 @@ igb_receive_internal(IGBCore *core, const struct iovec *iov, int iovcnt, return orig_size; } -total_size = net_rx_pkt_get_total_len(core->rx_pkt) + -e1000x_fcs_len(core->mac); - -igb_rx_fix_l4_csum(core, core->rx_pkt); - for (i = 0; i < IGB_NUM_QUEUES; i++) { if (!(queues & BIT(i)) || !(core->mac[RXDCTL0 + (i * 16)] & E1000_RXDCTL_QUEUE_ENABLE)) { @@ -1640,12 +1666,20 @@ igb_receive_internal(IGBCore *core, const struct iovec *iov, int iovcnt, igb_rx_ring_init(core, &rxr, i); +net_rx_pkt_attach_iovec_ex(core->rx_pkt, iov, iovcnt, iov_ofs, + igb_rx_strip_vlan(core, rxr.i), + core->mac[VET] & 0x); + +total_size = net_rx_pkt_get_total_len(core->rx_pkt) + +e1000x_fcs_len(core->mac); + if (!igb_has_rxbufs(core, rxr.i, total_size)) { n |= E1000_ICS_RXO; trace_e1000e_rx_not_written_to_guest(rxr.i->idx); continue; } +igb_rx_fix_l4_csum(core, core->rx_pkt); igb_write_packet_to_guest(core, core->rx_pkt, &rxr, &rss_info); core->mac[EICR] |= igb_rx_wb_eic(core, rxr.i->idx); -- 2.34.1
[PATCH v6 5/8] igb: check oversized packets for VMDq
Signed-off-by: Sriram Yagnaraman --- hw/net/igb_core.c | 41 - 1 file changed, 36 insertions(+), 5 deletions(-) diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c index c4a2bff4c1..03139c76f7 100644 --- a/hw/net/igb_core.c +++ b/hw/net/igb_core.c @@ -915,12 +915,26 @@ igb_rx_l4_cso_enabled(IGBCore *core) return !!(core->mac[RXCSUM] & E1000_RXCSUM_TUOFLD); } +static bool +igb_rx_is_oversized(IGBCore *core, uint16_t qn, size_t size) +{ +uint16_t pool = qn % IGB_NUM_VM_POOLS; +bool lpe = !!(core->mac[VMOLR0 + pool] & E1000_VMOLR_LPE); +int max_ethernet_lpe_size = +core->mac[VMOLR0 + pool] & E1000_VMOLR_RLPML_MASK; +int max_ethernet_vlan_size = 1522; + +return size > (lpe ? max_ethernet_lpe_size : max_ethernet_vlan_size); +} + static uint16_t igb_receive_assign(IGBCore *core, const struct eth_header *ehdr, - E1000E_RSSInfo *rss_info, bool *external_tx) + size_t size, E1000E_RSSInfo *rss_info, + bool *external_tx) { static const int ta_shift[] = { 4, 3, 2, 0 }; uint32_t f, ra[2], *macp, rctl = core->mac[RCTL]; uint16_t queues = 0; +uint16_t oversized = 0; uint16_t vid = lduw_be_p(&PKT_GET_VLAN_HDR(ehdr)->h_tci) & VLAN_VID_MASK; bool accepted = false; int i; @@ -1020,9 +1034,26 @@ static uint16_t igb_receive_assign(IGBCore *core, const struct eth_header *ehdr, } queues &= core->mac[VFRE]; -igb_rss_parse_packet(core, core->rx_pkt, external_tx != NULL, rss_info); -if (rss_info->queue & 1) { -queues <<= 8; +if (queues) { +for (i = 0; i < IGB_NUM_VM_POOLS; i++) { +if ((queues & BIT(i)) && igb_rx_is_oversized(core, i, size)) { +oversized |= BIT(i); +} +} +/* 8.19.37 increment ROC if packet is oversized for all queues */ +if (oversized == queues) { +trace_e1000x_rx_oversized(size); +e1000x_inc_reg_if_not_full(core->mac, ROC); +} +queues &= ~oversized; +} + +if (queues) { +igb_rss_parse_packet(core, core->rx_pkt, + external_tx != NULL, rss_info); +if (rss_info->queue & 1) { +queues <<= 8; +} } } else { switch (net_rx_pkt_get_packet_type(core->rx_pkt)) { @@ -1565,7 +1596,7 @@ igb_receive_internal(IGBCore *core, const struct iovec *iov, int iovcnt, e1000x_vlan_enabled(core->mac), core->mac[VET] & 0x); -queues = igb_receive_assign(core, ehdr, &rss_info, external_tx); +queues = igb_receive_assign(core, ehdr, size, &rss_info, external_tx); if (!queues) { trace_e1000e_rx_flt_dropped(); return orig_size; -- 2.34.1
[PATCH v6 6/8] igb: respect E1000_VMOLR_RSSE
RSS for VFs is only enabled if VMOLR[n].RSSE is set. Signed-off-by: Sriram Yagnaraman --- hw/net/igb_core.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c index 03139c76f7..9994724a39 100644 --- a/hw/net/igb_core.c +++ b/hw/net/igb_core.c @@ -1051,8 +1051,15 @@ static uint16_t igb_receive_assign(IGBCore *core, const struct eth_header *ehdr, if (queues) { igb_rss_parse_packet(core, core->rx_pkt, external_tx != NULL, rss_info); +/* Sec 8.26.1: PQn = VFn + VQn*8 */ if (rss_info->queue & 1) { -queues <<= 8; +for (i = 0; i < IGB_NUM_VM_POOLS; i++) { +if ((queues & BIT(i)) && +(core->mac[VMOLR0 + i] & E1000_VMOLR_RSSE)) { +queues |= BIT(i + IGB_NUM_VM_POOLS); +queues &= ~BIT(i); +} +} } } } else { -- 2.34.1
[PATCH v6 7/8] igb: implement VF Tx and Rx stats
Please note that loopback counters for VM to VM traffic is not implemented yet: VFGOTLBC, VFGPTLBC, VFGORLBC and VFGPRLBC. Signed-off-by: Sriram Yagnaraman --- hw/net/igb_core.c | 26 ++ 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c index 9994724a39..25a5e0ec87 100644 --- a/hw/net/igb_core.c +++ b/hw/net/igb_core.c @@ -490,7 +490,7 @@ igb_tx_pkt_send(IGBCore *core, struct igb_tx *tx, int queue_index) } static void -igb_on_tx_done_update_stats(IGBCore *core, struct NetTxPkt *tx_pkt) +igb_on_tx_done_update_stats(IGBCore *core, struct NetTxPkt *tx_pkt, int qn) { static const int PTCregs[6] = { PTC64, PTC127, PTC255, PTC511, PTC1023, PTC1522 }; @@ -517,6 +517,13 @@ igb_on_tx_done_update_stats(IGBCore *core, struct NetTxPkt *tx_pkt) core->mac[GPTC] = core->mac[TPT]; core->mac[GOTCL] = core->mac[TOTL]; core->mac[GOTCH] = core->mac[TOTH]; + +if (core->mac[MRQC] & 1) { +uint16_t pool = qn % IGB_NUM_VM_POOLS; + +core->mac[PVFGOTC0 + (pool * 64)] += tot_len; +core->mac[PVFGPTC0 + (pool * 64)]++; +} } static void @@ -579,7 +586,7 @@ igb_process_tx_desc(IGBCore *core, core->mac[VET] & 0x); } if (igb_tx_pkt_send(core, tx, queue_index)) { -igb_on_tx_done_update_stats(core, tx->tx_pkt); +igb_on_tx_done_update_stats(core, tx->tx_pkt, queue_index); } } @@ -1398,7 +1405,8 @@ igb_write_to_rx_buffers(IGBCore *core, } static void -igb_update_rx_stats(IGBCore *core, size_t data_size, size_t data_fcs_size) +igb_update_rx_stats(IGBCore *core, const E1000E_RingInfo *rxi, +size_t data_size, size_t data_fcs_size) { e1000x_update_rx_total_stats(core->mac, data_size, data_fcs_size); @@ -1414,6 +1422,16 @@ igb_update_rx_stats(IGBCore *core, size_t data_size, size_t data_fcs_size) default: break; } + +if (core->mac[MRQC] & 1) { +uint16_t pool = rxi->idx % IGB_NUM_VM_POOLS; + +core->mac[PVFGORC0 + (pool * 64)] += data_size + 4; +core->mac[PVFGPRC0 + (pool * 64)]++; +if (net_rx_pkt_get_packet_type(core->rx_pkt) == ETH_PKT_MCAST) { +core->mac[PVFMPRC0 + (pool * 64)]++; +} +} } static inline bool @@ -1515,7 +1533,7 @@ igb_write_packet_to_guest(IGBCore *core, struct NetRxPkt *pkt, } while (desc_offset < total_size); -igb_update_rx_stats(core, size, total_size); +igb_update_rx_stats(core, rxi, size, total_size); } static inline void -- 2.34.1
[PATCH v6 4/8] igb: implement VFRE and VFTE registers
Also introduce: - Checks for RXDCTL/TXDCTL queue enable bits - IGB_NUM_VM_POOLS enum (Sec 1.5: Table 1-7) Signed-off-by: Sriram Yagnaraman --- hw/net/igb_core.c | 38 +++--- hw/net/igb_core.h | 1 + hw/net/igb_regs.h | 3 +++ 3 files changed, 35 insertions(+), 7 deletions(-) diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c index 1ddf54f630..c4a2bff4c1 100644 --- a/hw/net/igb_core.c +++ b/hw/net/igb_core.c @@ -780,6 +780,18 @@ igb_txdesc_writeback(IGBCore *core, dma_addr_t base, return igb_tx_wb_eic(core, txi->idx); } +static inline bool +igb_tx_enabled(IGBCore *core, const E1000E_RingInfo *txi) +{ +bool vmdq = core->mac[MRQC] & 1; +uint16_t qn = txi->idx; +uint16_t pool = qn % IGB_NUM_VM_POOLS; + +return (core->mac[TCTL] & E1000_TCTL_EN) && +(!vmdq || core->mac[VFTE] & BIT(pool)) && +(core->mac[TXDCTL0 + (qn * 16)] & E1000_TXDCTL_QUEUE_ENABLE); +} + static void igb_start_xmit(IGBCore *core, const IGB_TxRing *txr) { @@ -789,8 +801,7 @@ igb_start_xmit(IGBCore *core, const IGB_TxRing *txr) const E1000E_RingInfo *txi = txr->i; uint32_t eic = 0; -/* TODO: check if the queue itself is enabled too. */ -if (!(core->mac[TCTL] & E1000_TCTL_EN)) { +if (!igb_tx_enabled(core, txi)) { trace_e1000e_tx_disabled(); return; } @@ -866,6 +877,9 @@ igb_can_receive(IGBCore *core) for (i = 0; i < IGB_NUM_QUEUES; i++) { E1000E_RxRing rxr; +if (!(core->mac[RXDCTL0 + (i * 16)] & E1000_RXDCTL_QUEUE_ENABLE)) { +continue; +} igb_rx_ring_init(core, &rxr, i); if (igb_ring_enabled(core, rxr.i) && igb_has_rxbufs(core, rxr.i, 1)) { @@ -932,7 +946,7 @@ static uint16_t igb_receive_assign(IGBCore *core, const struct eth_header *ehdr, if (core->mac[MRQC] & 1) { if (is_broadcast_ether_addr(ehdr->h_dest)) { -for (i = 0; i < 8; i++) { +for (i = 0; i < IGB_NUM_VM_POOLS; i++) { if (core->mac[VMOLR0 + i] & E1000_VMOLR_BAM) { queues |= BIT(i); } @@ -966,7 +980,7 @@ static uint16_t igb_receive_assign(IGBCore *core, const struct eth_header *ehdr, f = ta_shift[(rctl >> E1000_RCTL_MO_SHIFT) & 3]; f = (((ehdr->h_dest[5] << 8) | ehdr->h_dest[4]) >> f) & 0xfff; if (macp[f >> 5] & (1 << (f & 0x1f))) { -for (i = 0; i < 8; i++) { +for (i = 0; i < IGB_NUM_VM_POOLS; i++) { if (core->mac[VMOLR0 + i] & E1000_VMOLR_ROMPE) { queues |= BIT(i); } @@ -989,7 +1003,7 @@ static uint16_t igb_receive_assign(IGBCore *core, const struct eth_header *ehdr, } } } else { -for (i = 0; i < 8; i++) { +for (i = 0; i < IGB_NUM_VM_POOLS; i++) { if (core->mac[VMOLR0 + i] & E1000_VMOLR_AUPE) { mask |= BIT(i); } @@ -1005,6 +1019,7 @@ static uint16_t igb_receive_assign(IGBCore *core, const struct eth_header *ehdr, queues = BIT(def_pl >> E1000_VT_CTL_DEFAULT_POOL_SHIFT); } +queues &= core->mac[VFRE]; igb_rss_parse_packet(core, core->rx_pkt, external_tx != NULL, rss_info); if (rss_info->queue & 1) { queues <<= 8; @@ -1562,7 +1577,8 @@ igb_receive_internal(IGBCore *core, const struct iovec *iov, int iovcnt, igb_rx_fix_l4_csum(core, core->rx_pkt); for (i = 0; i < IGB_NUM_QUEUES; i++) { -if (!(queues & BIT(i))) { +if (!(queues & BIT(i)) || +!(core->mac[RXDCTL0 + (i * 16)] & E1000_RXDCTL_QUEUE_ENABLE)) { continue; } @@ -1966,9 +1982,16 @@ static void igb_set_vfmailbox(IGBCore *core, int index, uint32_t val) static void igb_vf_reset(IGBCore *core, uint16_t vfn) { +uint16_t qn0 = vfn; +uint16_t qn1 = vfn + IGB_NUM_VM_POOLS; + /* disable Rx and Tx for the VF*/ -core->mac[VFTE] &= ~BIT(vfn); +core->mac[RXDCTL0 + (qn0 * 16)] &= ~E1000_RXDCTL_QUEUE_ENABLE; +core->mac[RXDCTL0 + (qn1 * 16)] &= ~E1000_RXDCTL_QUEUE_ENABLE; +core->mac[TXDCTL0 + (qn0 * 16)] &= ~E1000_TXDCTL_QUEUE_ENABLE; +core->mac[TXDCTL0 + (qn1 * 16)] &= ~E1000_TXDCTL_QUEUE_ENABLE; core->mac[VFRE] &= ~BIT(vfn); +core->mac[VFTE] &= ~BIT(vfn); /* indicate VF reset to PF */ core->mac[VFLRE] |= BIT(vfn); /* VFLRE and mailbox use the same interrupt cause */ @@ -3874,6 +3897,7 @@ igb_phy_reg_init[] = { static const uint32_t igb_mac_reg_init[] = { [LEDCTL]= 2 | (3 << 8) | BIT(15) | (6 << 16) | (7 << 24), [EEMNGCTL] = BIT(31), +[TXDCTL0] = E1000_TXDCTL_QUEUE_ENABLE, [RXDCTL0] = E1000_RXDCTL_QUEUE_ENABLE | (1 << 16), [RXDCTL1] = 1 << 16, [RXDCTL2] = 1 << 16, diff --git a/hw/net/igb_core.h b/hw
Re: [SeaBIOS] Re: [SeaBIOS PATCH] xen: require Xen info structure at 0x1000 to detect Xen
> Thanks, Kevin. > > I'd like to get the rest of the Xen platform support in to qemu 8.0 if > possible. Which is currently scheduled for March. > > Is there likely to be a SeaBIOS release before then which Gerd would > pull into qemu anyway, or should I submit a submodule update to a > snapshot of today's tree? That would just pull in this commit, and the > one other fix that's in the SeaBIOS tree since 1.16.1? Tagging 1.16.2 in time for the qemu 8.0 should not be a problem given that we have only bugfixes in master. Roughly around soft freeze is probably a good time for that. take care, Gerd
Re: [PATCH 3/5] docs: flesh out raw format driver description
On Wed, Feb 01, 2023 at 04:12:32PM -0500, Stefan Hajnoczi wrote: > Modernize the description and document the size=/offset= runtime > options. > > Signed-off-by: Stefan Hajnoczi > --- > docs/system/qemu-block-drivers.rst.inc | 32 ++ > 1 file changed, 27 insertions(+), 5 deletions(-) > > diff --git a/docs/system/qemu-block-drivers.rst.inc > b/docs/system/qemu-block-drivers.rst.inc > index be6eec1eb6..ec9ebb2066 100644 > --- a/docs/system/qemu-block-drivers.rst.inc > +++ b/docs/system/qemu-block-drivers.rst.inc > @@ -16,11 +16,11 @@ options that are supported for it. > .. option:: raw > >Raw disk image format. This format has the advantage of > - being simple and easily exportable to all other emulators. If your > - file system supports *holes* (for example in ext2 or ext3 on > - Linux or NTFS on Windows), then only the written sectors will reserve > - space. Use ``qemu-img info`` to know the real size used by the > - image or ``ls -ls`` on Unix/Linux. > + being simple and easily exportable to all other emulators. Modern > + file systems support *holes* (for example in btrfs/XFS/ext4 on > + Linux or NTFS on Windows) where space is allocated on demand as sectors are > + written. Use ``qemu-img info`` to know the real size used by the image or > + ``ls -ls`` on Unix/Linux. > >Supported create options: > > @@ -33,6 +33,28 @@ options that are supported for it. > for image by writing data to underlying storage. This data may or > may not be zero, depending on the storage location. > > + Supported runtime options: > + > + .. program:: raw > + .. option:: offset > + > +The byte position in the underlying file where the virtual disk starts. > +This is handy when you want to present just a single partition from a > +physical disk as the virtual disk. This option is usually used in > +conjunction with the ``size`` option. > + > + .. option:: size > + > +Limit the virtual disk size to the given number of bytes, regardless of > how > +large the underlying file is. This option is usually used in conjunction > +with the ``offset`` option. > + > + Note the raw format can be omitted when no runtime options are being used. > In > + that case the raw format does nothing besides forwarding I/O requests to > the > + protocol blockdev. You can improve performance slightly by eliminating > + ``--blockdev raw,file=file0,node-name=drive0`` and renaming the "file0" > + blockdev to "drive0". This paragraphs reads a bit softly. I would word this such that we explicitly and strongly recommend against using the 'raw' format driver. Almost no one will have a need for this. All the protocol drivers expose a raw format, which can be consumed directly. So the 'raw' format driver should NEVER be used unless the user needs to apply a limited window over the underlying disk capacity, which is pretty rare. With regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: An issue with x86 tcg and MMIO
On Wed, 1 Feb 2023 11:50:50 -1000 Richard Henderson wrote: > On 2/1/23 05:24, Jørgen Hansen wrote: > > Hello Richard, > > > > We are using x86 qemu to test some CXL stuff, and in that process we are > > running into an issue with tcg. In qemu, CXL memory is mapped as an MMIO > > region, so when using CXL memory as part of the system memory, > > application code and data can be spread out across a combination of DRAM > > and CXL memory (we are using the Linux tiered memory numa balancing, > > that will migrate individual pages between DRAM and CXL memory based on > > access patterns). When we are running memory intensive applications, we > > hit the following assert in translator_access: > > > >/* We cannot handle MMIO as second page. */ > >assert(tb->page_addr[1] != -1); > > > > introduced in your commit 50627f1b. This is using existing applications > > and standard Linux. We discussed this with Alistair Francis and he > > mentioned that it looks like a load across a page boundary is happening, > > and it so happens that the first page is DRAM and second page MMIO. We > > tried - as a workaround - to return NULL instead of the assert to > > trigger the slow path processing, and that allows the system to make > > forward progress, but we aren't familiar with tcg, and as such aren't > > sure if that is a correct fix. > > > > So we'd like to get your input on this - and understand whether the > > above usage isn't supported at all or if there are other possible > > workarounds. > > Well, this may answer my question in > > https://lore.kernel.org/qemu-devel/1d6b1894-9c45-2d70-abde-9c10c1b3b...@linaro.org/ > > as to how this could occur. > > Until relatively recently, TCG would refuse to execute out of MMIO *at all*. > This was > relaxed to support Arm m-profile, which needs to execute a few instructions > out of MMIO > during the boot process, before jumping into flash. > > This works by reading one instruction, translating it, executing it, and > immediately > discarding the translation. It could be possible to adjust the translator to > allow the > second half of an instruction to be in MMIO, such that we execute and > discard, however... > > What is it about CXL that requires modeling with MMIO? If it is intended to > be used > interchangeably with RAM by the guest, then you really won't like the > performance you will > see with TCG executing out of these regions. To be honest I wasn't aware of this restriction. I 'thought' it was necessary to support interleaving as we need some callbacks in the read / write paths to map through to the right address space. So performance will suck even if we can model as memory (I'm not sure how to do whilst maintaining the interleaving code). To have anything approaching accurate modeling we need to apply a memory decoders in the host, host-bridge and switches. We could have faked all this and mapped directly through to a single memory backend, but that would then have been useless for actually testing the interleaving. Specifically what happens in cxl_cfmws_find_device() https://elixir.bootlin.com/qemu/latest/source/hw/cxl/cxl-host.c#L129 If we can do that for other types of region then I'm fine with changing it. > > Could memory across the CXL link be modeled as a ROM device, similar to > flash? This does > not have the same restrictions as MMIO. Not sure - if we can do the handling above then sure we could make that change. I can see there is a path to register the callbacks but I'd kind of assumed ROM meant read only... Jonathan > > > r~ >
[PATCH 09/18] i386: Fix comment style in topology.h
From: Zhao Liu For function comments in this file, keep the comment style consistent with other places. Signed-off-by: Zhao Liu --- include/hw/i386/topology.h | 33 + 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/include/hw/i386/topology.h b/include/hw/i386/topology.h index b0174c18b7bd..5de905dc00d3 100644 --- a/include/hw/i386/topology.h +++ b/include/hw/i386/topology.h @@ -24,7 +24,8 @@ #ifndef HW_I386_TOPOLOGY_H #define HW_I386_TOPOLOGY_H -/* This file implements the APIC-ID-based CPU topology enumeration logic, +/* + * This file implements the APIC-ID-based CPU topology enumeration logic, * documented at the following document: * Intel® 64 Architecture Processor Topology Enumeration * http://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/ @@ -41,7 +42,8 @@ #include "qemu/bitops.h" -/* APIC IDs can be 32-bit, but beware: APIC IDs > 255 require x2APIC support +/* + * APIC IDs can be 32-bit, but beware: APIC IDs > 255 require x2APIC support */ typedef uint32_t apic_id_t; @@ -60,8 +62,7 @@ typedef struct X86CPUTopoInfo { unsigned threads_per_core; } X86CPUTopoInfo; -/* Return the bit width needed for 'count' IDs - */ +/* Return the bit width needed for 'count' IDs */ static unsigned apicid_bitwidth_for_count(unsigned count) { g_assert(count >= 1); @@ -69,15 +70,13 @@ static unsigned apicid_bitwidth_for_count(unsigned count) return count ? 32 - clz32(count) : 0; } -/* Bit width of the SMT_ID (thread ID) field on the APIC ID - */ +/* Bit width of the SMT_ID (thread ID) field on the APIC ID */ static inline unsigned apicid_smt_width(X86CPUTopoInfo *topo_info) { return apicid_bitwidth_for_count(topo_info->threads_per_core); } -/* Bit width of the Core_ID field - */ +/* Bit width of the Core_ID field */ static inline unsigned apicid_core_width(X86CPUTopoInfo *topo_info) { /* @@ -94,8 +93,7 @@ static inline unsigned apicid_die_width(X86CPUTopoInfo *topo_info) return apicid_bitwidth_for_count(topo_info->dies_per_pkg); } -/* Bit offset of the Core_ID field - */ +/* Bit offset of the Core_ID field */ static inline unsigned apicid_core_offset(X86CPUTopoInfo *topo_info) { return apicid_smt_width(topo_info); @@ -107,14 +105,14 @@ static inline unsigned apicid_die_offset(X86CPUTopoInfo *topo_info) return apicid_core_offset(topo_info) + apicid_core_width(topo_info); } -/* Bit offset of the Pkg_ID (socket ID) field - */ +/* Bit offset of the Pkg_ID (socket ID) field */ static inline unsigned apicid_pkg_offset(X86CPUTopoInfo *topo_info) { return apicid_die_offset(topo_info) + apicid_die_width(topo_info); } -/* Make APIC ID for the CPU based on Pkg_ID, Core_ID, SMT_ID +/* + * Make APIC ID for the CPU based on Pkg_ID, Core_ID, SMT_ID * * The caller must make sure core_id < nr_cores and smt_id < nr_threads. */ @@ -127,7 +125,8 @@ static inline apic_id_t x86_apicid_from_topo_ids(X86CPUTopoInfo *topo_info, topo_ids->smt_id; } -/* Calculate thread/core/package IDs for a specific topology, +/* + * Calculate thread/core/package IDs for a specific topology, * based on (contiguous) CPU index */ static inline void x86_topo_ids_from_idx(X86CPUTopoInfo *topo_info, @@ -154,7 +153,8 @@ static inline void x86_topo_ids_from_idx(X86CPUTopoInfo *topo_info, topo_ids->smt_id = cpu_index % nr_threads; } -/* Calculate thread/core/package IDs for a specific topology, +/* + * Calculate thread/core/package IDs for a specific topology, * based on APIC ID */ static inline void x86_topo_ids_from_apicid(apic_id_t apicid, @@ -178,7 +178,8 @@ static inline void x86_topo_ids_from_apicid(apic_id_t apicid, topo_ids->pkg_id = apicid >> apicid_pkg_offset(topo_info); } -/* Make APIC ID for the CPU 'cpu_index' +/* + * Make APIC ID for the CPU 'cpu_index' * * 'cpu_index' is a sequential, contiguous ID for the CPU. */ -- 2.34.1
[PATCH 00/18] Support smp.clusters for x86
From: Zhao Liu Hi list, This series is based on 13356ed (Merge tag 'block-pull-request' of https://gitlab.com/stefanha/qemu into staging), and adds the cluster support for x86 PC machine, which allows x86 can use smp.clusters to configure the modlue level CPU topology of x86. And since the compatibility issue (see section: ## Why not share L2 cache in cluster directly), this series also introduce a new command to adjust the topology of the x86 L2 cache. Welcome your comments! # Backgroud The "clusters" parameter in "smp" is introduced by ARM [1], but x86 hasn't supported it. At present, x86 defaults L2 cache is shared in one core, but this is not enough. There're some platforms that multiple cores share the same L2 cache, e.g., Alder Lake-P shares L2 cache for one module of Atom cores [2], that is, every four Atom cores shares one L2 cache. Therefore, we need the new CPU topology level (cluster/module). Another reason is for hybrid architecture. cluster support not only provides another level of topology definition in x86, but would aslo provide required code change for future our hybrid topology support. # Overview ## Introduction of module level for x86 "cluster" in smp is the CPU topology level which is between "core" and die. For x86, the "cluster" in smp is corresponding to the module level [3], which is above the core level. So use the "module" other than "cluster" in x86 code. And please note that x86 already has a cpu topology level also named "cluster" [3], this level is at the upper level of the package. Here, the cluster in x86 cpu topology is completely different from the "clusters" as the smp parameter. After the module level is introduced, the cluster as the smp parameter will actually refer to the module level of x86. ## Why not share L2 cache in cluster directly Though "clusters" was introduced to help define L2 cache topology [1], using cluster to define x86's L2 cache topology will cause the compatibility problem: Currently, x86 defaults that the L2 cache is shared in one core, which actually implies a default setting "cores per L2 cache is 1" and therefore implicitly defaults to having as many L2 caches as cores. For example (i386 PC machine): -smp 16,sockets=2,dies=2,cores=2,threads=2,maxcpus=16 (*) Considering the topology of the L2 cache, this (*) implicitly means "1 core per L2 cache" and "2 L2 caches per die". If we use cluster to configure L2 cache topology with the new default setting "clusters per L2 cache is 1", the above semantics will change to "2 cores per cluster" and "1 cluster per L2 cache", that is, "2 cores per L2 cache". So the same command (*) will cause changes in the L2 cache topology, further affecting the performance of the virtual machine. Therefore, x86 should only treat cluster as a cpu topology level and avoid using it to change L2 cache by default for compatibility. ## module level in CPUID Currently, we don't expose module level in CPUID.1FH because currently linux (v6.2-rc6) doesn't support module level. And exposing module and die levels at the same time in CPUID.1FH will cause linux to calculate wrong die_id. The module level should be exposed until the real machine has the module level in CPUID.1FH. We can configure CPUID.04H.02H (L2 cache topology) with module level by a new command: "-cpu,x-l2-cache-topo=cluster" More information about this command, please see the section: "## New property: x-l2-cache-topo". ## New cache topology info in CPUCacheInfo Currently, by default, the cache topology is encoded as: 1. i/d cache is shared in one core. * In fact, the original is SMT level, which will be changed to core level in the patch: [PATCH 04/18] i386/cpu: Fix number of addressable IDs in CPUID.04H). 2. L2 cache is shared in one core. 3. L3 cache is shared in one die. This default general setting has caused a misunderstanding, that is, the cache topology is completely equated with a specific cpu topology, such as the connection between L2 cache and core level, and the connection between L3 cache and die level. In fact, the settings of these topologies depend on the specific platform and are not static. For example, on Alder Lake-P, every four Atom cores share the same L2 cache [2]. Thus, in this patch set, we explicitly define the corresponding cache topology for different cpu models and this has two benefits: 1. Easy to expand to new CPU models in the future, which has different cache topology. 2. It can easily support custom cache topology by some command (e.g., x-l2-cache-topo). ## New property: x-l2-cache-topo The property l2-cache-topo will be used to change the L2 cache topology in CPUID.04H. Now it allows user to set the L2 cache is shared in core level or cluster level. If user passes "-cpu x-l2-cache-topo=[core|cluster]" then older L2 cache topology will be overrided by the new topology setting. Since CPUID.04H is used by intel cpus, this property is available on intel cpus as fo
[PATCH 07/18] i386: Support modules_per_die in X86CPUTopoInfo
From: Zhuocheng Ding Support module level in i386 cpu topology structure "X86CPUTopoInfo". Before updating APIC ID parsing rule with module level, the apicid_core_width() temporarily combines the core and module levels together. At present, we don't expose module level in CPUID.1FH because currently linux (v6.2-rc6) doesn't support module level. And exposing module and die levels at the same time in CPUID.1FH will cause linux to calculate the wrong die_id. The module level should be exposed until the real machine has the module level in CPUID.1FH. In addition, update topology structure in test-x86-apicid.c. Signed-off-by: Zhuocheng Ding Co-developed-by: Zhao Liu Signed-off-by: Zhao Liu --- hw/i386/x86.c| 3 ++- include/hw/i386/topology.h | 13 --- target/i386/cpu.c| 17 -- tests/unit/test-x86-apicid.c | 45 +++- 4 files changed, 46 insertions(+), 32 deletions(-) diff --git a/hw/i386/x86.c b/hw/i386/x86.c index 66902d1c0923..3f6e0dd7b827 100644 --- a/hw/i386/x86.c +++ b/hw/i386/x86.c @@ -70,7 +70,8 @@ inline void init_topo_info(X86CPUTopoInfo *topo_info, MachineState *ms = MACHINE(x86ms); topo_info->dies_per_pkg = ms->smp.dies; -topo_info->cores_per_die = ms->smp.cores; +topo_info->modules_per_die = ms->smp.clusters; +topo_info->cores_per_module = ms->smp.cores; topo_info->threads_per_core = ms->smp.threads; } diff --git a/include/hw/i386/topology.h b/include/hw/i386/topology.h index 81573f6cfde0..bbb00dc4aad8 100644 --- a/include/hw/i386/topology.h +++ b/include/hw/i386/topology.h @@ -54,7 +54,8 @@ typedef struct X86CPUTopoIDs { typedef struct X86CPUTopoInfo { unsigned dies_per_pkg; -unsigned cores_per_die; +unsigned modules_per_die; +unsigned cores_per_module; unsigned threads_per_core; } X86CPUTopoInfo; @@ -78,7 +79,12 @@ static inline unsigned apicid_smt_width(X86CPUTopoInfo *topo_info) */ static inline unsigned apicid_core_width(X86CPUTopoInfo *topo_info) { -return apicid_bitwidth_for_count(topo_info->cores_per_die); +/* + * TODO: Will separate module info from core_width when update + * APIC ID with module level. + */ +return apicid_bitwidth_for_count(topo_info->cores_per_module * + topo_info->modules_per_die); } /* Bit width of the Die_ID field */ @@ -128,7 +134,8 @@ static inline void x86_topo_ids_from_idx(X86CPUTopoInfo *topo_info, X86CPUTopoIDs *topo_ids) { unsigned nr_dies = topo_info->dies_per_pkg; -unsigned nr_cores = topo_info->cores_per_die; +unsigned nr_cores = topo_info->cores_per_module * +topo_info->modules_per_die; unsigned nr_threads = topo_info->threads_per_core; topo_ids->pkg_id = cpu_index / (nr_dies * nr_cores * nr_threads); diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 61ec9a7499b8..6f3d114c7d12 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -336,7 +336,9 @@ static void encode_cache_cpuid801d(CPUCacheInfo *cache, /* L3 is shared among multiple cores */ if (cache->level == 3) { -l3_threads = topo_info->cores_per_die * topo_info->threads_per_core; +l3_threads = topo_info->modules_per_die * + topo_info->cores_per_module * + topo_info->threads_per_core; *eax |= (l3_threads - 1) << 14; } else { *eax |= ((topo_info->threads_per_core - 1) << 14); @@ -5218,11 +5220,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, uint32_t cpus_per_pkg; topo_info.dies_per_pkg = env->nr_dies; -topo_info.cores_per_die = cs->nr_cores / env->nr_dies; +topo_info.modules_per_die = env->nr_modules; +topo_info.cores_per_module = cs->nr_cores / env->nr_dies / env->nr_modules; topo_info.threads_per_core = cs->nr_threads; -cpus_per_pkg = topo_info.dies_per_pkg * topo_info.cores_per_die * - topo_info.threads_per_core; +cpus_per_pkg = topo_info.dies_per_pkg * topo_info.modules_per_die * + topo_info.cores_per_module * topo_info.threads_per_core; /* Calculate & apply limits for different index ranges */ if (index >= 0xC000) { @@ -5298,8 +5301,8 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, if (*eax & 31) { int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14); int vcpus_per_socket = cpus_per_pkg; -int cores_per_socket = topo_info.cores_per_die * - topo_info.dies_per_pkg; +int cores_per_socket = cpus_per_pkg / + topo_info.threads_per_core; if (cores_per_socket > 1) { *eax &= ~0xFC00; *eax |= (pow2ceil(cores_per_socket) - 1) << 26; @@ -548
[PATCH 03/18] softmmu: Fix CPUSTATE.nr_cores' calculation
From: Zhuocheng Ding >From CPUState.nr_cores' comment, it represents "number of cores within this CPU package". After 003f230 (machine: Tweak the order of topology members in struct CpuTopology), the meaning of smp.cores changed to "the number of cores in one die", but this commit missed to change CPUState.nr_cores' caculation, so that CPUState.nr_cores became wrong and now it misses to consider numbers of clusters and dies. At present, only i386 is using CPUState.nr_cores. But as for i386, which supports die level, the uses of CPUState.nr_cores are very confusing: Early uses are based on the meaning of "cores per package" (before die is introduced into i386), and later uses are based on "cores per die" (after die's introduction). This difference is due to that commit a94e142 (target/i386: Add CPUID.1F generation support for multi-dies PCMachine) misunderstood that CPUState.nr_cores means "cores per die" when caculated CPUID.1FH.01H:EBX. After that, the changes in i386 all followed this wrong understanding. With the influence of 003f230 and a94e142, for i386 currently the result of CPUState.nr_cores is "cores per die", thus the original uses of CPUState.cores based on the meaning of "cores per package" are wrong when mutiple dies exist: 1. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.01H:EBX[bits 23:16] is incorrect because it expects "cpus per package" but now the result is "cpus per die". 2. In cpu_x86_cpuid() of target/i386/cpu.c, for all leaves of CPUID.04H: EAX[bits 31:26] is incorrect because they expect "cpus per package" but now the result is "cpus per die". The error not only impacts the EAX caculation in cache_info_passthrough case, but also impacts other cases of setting cache topology for Intel CPU according to cpu topology (specifically, the incoming parameter "num_cores" expects "cores per package" in encode_cache_cpuid4()). 3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.0BH.01H:EBX[bits 15:00] is incorrect because the EBX of 0BH.01H (core level) expects "cpus per package", which may be different with 1FH.01H (The reason is 1FH can support more levels. For QEMU, 1FH also supports die, 1FH.01H:EBX[bits 15:00] expects "cpus per die"). 4. In cpu_x86_cpuid() of target/i386/cpu.c, when CPUID.8001H is caculated, here "cpus per package" is expected to be checked, but in fact, now it checks "cpus per die". Though "cpus per die" also works for this code logic, this isn't consistent with AMD's APM. 5. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.8008H:ECX expects "cpus per package" but it obtains "cpus per die". 6. In simulate_rdmsr() of target/i386/hvf/x86_emu.c, in kvm_rdmsr_core_thread_count() of target/i386/kvm/kvm.c, and in helper_rdmsr() of target/i386/tcg/sysemu/misc_helper.c, MSR_CORE_THREAD_COUNT expects "cpus per package" and "cores per package", but in these functions, it obtains "cpus per die" and "cores per die". On the other hand, these uses are correct now (they are added in/after a94e142): 1. In cpu_x86_cpuid() of target/i386/cpu.c, topo_info.cores_per_die meets the actual meaning of CPUState.nr_cores ("cores per die"). 2. In cpu_x86_cpuid() of target/i386/cpu.c, vcpus_per_socket (in CPUID. 04H's caculation) considers number of dies, so it's correct. 3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.1FH.01H:EBX[bits 15:00] needs "cpus per die" and it gets the correct result, and CPUID.1FH.02H:EBX[bits 15:00] gets correct "cpus per package". When CPUState.nr_cores is correctly changed to "cores per package" again , the above errors will be fixed without extra work, but the "currently" correct cases will go wrong and need special handling to pass correct "cpus/cores per die" they want. Thus in this patch, we fix CPUState.nr_cores' caculation to fit the original meaning "cores per package", as well as changing caculation of topo_info.cores_per_die, vcpus_per_socket and CPUID.1FH. In addition, in the nr_threads' comment, specify it represents the number of threads in the "core" to avoid confusion. Fixes: a94e142 (target/i386: Add CPUID.1F generation support for multi-dies PCMachine) Fixes: 003f230 (machine: Tweak the order of topology members in struct CpuTopology) Signed-off-by: Zhuocheng Ding Co-developed-by: Zhao Liu Signed-off-by: Zhao Liu --- include/hw/core/cpu.h | 2 +- softmmu/cpus.c| 2 +- target/i386/cpu.c | 9 - 3 files changed, 6 insertions(+), 7 deletions(-) diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h index 2417597236bc..5253e4e839bb 100644 --- a/include/hw/core/cpu.h +++ b/include/hw/core/cpu.h @@ -274,7 +274,7 @@ struct qemu_work_item; * QOM parent. * @tcg_cflags: Pre-computed cflags for this cpu. * @nr_cores: Number of cores within this CPU package. - * @nr_threads: Number of threads within this CPU. + * @nr_threads: Number of threads within this CPU core. * @running: #true if CPU is currently running (lockless). * @has_
[PATCH 06/18] i386: Introduce module-level cpu topology to CPUX86State
From: Zhuocheng Ding smp command has the "clusters" parameter but x86 hasn't supported that level. Though "clusters" was introduced to help define L2 cache topology [1], using cluster to define x86's L2 cache topology will cause the compatibility problem: Currently, x86 defaults that the L2 cache is shared in one core, which actually implies a default setting "cores per L2 cache is 1" and therefore implicitly defaults to having as many L2 caches as cores. For example (i386 PC machine): -smp 16,sockets=2,dies=2,cores=2,threads=2,maxcpus=16 (*) Considering the topology of the L2 cache, this (*) implicitly means "1 core per L2 cache" and "2 L2 caches per die". If we use cluster to configure L2 cache topology with the new default setting "clusters per L2 cache is 1", the above semantics will change to "2 cores per cluster" and "1 cluster per L2 cache", that is, "2 cores per L2 cache". So the same command (*) will cause changes in the L2 cache topology, further affecting the performance of the virtual machine. Therefore, x86 should only treat cluster as a cpu topology level and avoid using it to change L2 cache by default for compatibility. "cluster" in smp is the CPU topology level which is between "core" and die. For x86, the "cluster" in smp is corresponding to the module level [2], which is above the core level. So use the "module" other than "cluster" in i386 code. And please note that x86 already has a cpu topology level also named "cluster" [2], this level is at the upper level of the package. Here, the cluster in x86 cpu topology is completely different from the "clusters" as the smp parameter. After the module level is introduced, the cluster as the smp parameter will actually refer to the module level of x86. [1]: 0d87178 (hw/core/machine: Introduce CPU cluster topology support) [2]: SDM, vol.3, ch.9, 9.9.1 Hierarchical Mapping of Shared Resources. Signed-off-by: Zhuocheng Ding Co-developed-by: Zhao Liu Signed-off-by: Zhao Liu --- hw/i386/x86.c | 1 + target/i386/cpu.c | 1 + target/i386/cpu.h | 6 ++ 3 files changed, 8 insertions(+) diff --git a/hw/i386/x86.c b/hw/i386/x86.c index 78cc131926c8..66902d1c0923 100644 --- a/hw/i386/x86.c +++ b/hw/i386/x86.c @@ -305,6 +305,7 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev, init_topo_info(&topo_info, x86ms); env->nr_dies = ms->smp.dies; +env->nr_modules = ms->smp.clusters; /* * If APIC ID is not set, diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 4cda84eb96f1..61ec9a7499b8 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -6781,6 +6781,7 @@ static void x86_cpu_initfn(Object *obj) CPUX86State *env = &cpu->env; env->nr_dies = 1; +env->nr_modules = 1; cpu_set_cpustate_pointers(cpu); object_property_add(obj, "feature-words", "X86CPUFeatureWordInfo", diff --git a/target/i386/cpu.h b/target/i386/cpu.h index d4bc19577a21..f3afea765982 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -1810,7 +1810,13 @@ typedef struct CPUArchState { TPRAccess tpr_access_type; +/* Number of dies per package. */ unsigned nr_dies; +/* + * Number of modules per die. Module level in x86 cpu topology is + * corresponding to smp.clusters. + */ +unsigned nr_modules; } CPUX86State; struct kvm_msrs; -- 2.34.1
[PATCH 11/18] i386/cpu: Introduce cluster-id to X86CPU
From: Zhuocheng Ding We introduce cluster-id other than module-id to be consistent with CpuInstanceProperties.cluster-id, and this avoids the confusion of parameter names when hotplugging. Following the legacy smp check rules, also add the cluster_id validity into x86_cpu_pre_plug(). Signed-off-by: Zhuocheng Ding Co-developed-by: Zhao Liu Signed-off-by: Zhao Liu --- hw/i386/x86.c | 33 + target/i386/cpu.c | 2 ++ target/i386/cpu.h | 1 + 3 files changed, 28 insertions(+), 8 deletions(-) diff --git a/hw/i386/x86.c b/hw/i386/x86.c index 68fce87d18ac..57d646bf360f 100644 --- a/hw/i386/x86.c +++ b/hw/i386/x86.c @@ -324,6 +324,14 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev, cpu->die_id = 0; } +/* + * cluster-id was optional in QEMU 8.0 and older, so keep it optional + * if there's only one cluster per die. + */ +if (cpu->cluster_id < 0 && ms->smp.clusters == 1) { +cpu->cluster_id = 0; +} + if (cpu->socket_id < 0) { error_setg(errp, "CPU socket-id is not set"); return; @@ -340,6 +348,14 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev, cpu->die_id, ms->smp.dies - 1); return; } +if (cpu->cluster_id < 0) { +error_setg(errp, "CPU cluster-id is not set"); +return; +} else if (cpu->cluster_id > ms->smp.clusters - 1) { +error_setg(errp, "Invalid CPU cluster-id: %u must be in range 0:%u", + cpu->cluster_id, ms->smp.clusters - 1); +return; +} if (cpu->core_id < 0) { error_setg(errp, "CPU core-id is not set"); return; @@ -359,16 +375,9 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev, topo_ids.pkg_id = cpu->socket_id; topo_ids.die_id = cpu->die_id; +topo_ids.module_id = cpu->cluster_id; topo_ids.core_id = cpu->core_id; topo_ids.smt_id = cpu->thread_id; - -/* - * TODO: This is the temporary initialization for topo_ids.module_id to - * avoid "maybe-uninitialized" compilation errors. Will remove when - * X86CPU supports cluster_id. - */ -topo_ids.module_id = 0; - cpu->apic_id = x86_apicid_from_topo_ids(&topo_info, &topo_ids); } @@ -415,6 +424,14 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev, } cpu->die_id = topo_ids.die_id; +if (cpu->cluster_id != -1 && cpu->cluster_id != topo_ids.module_id) { +error_setg(errp, "property cluster-id: %u doesn't match set apic-id:" +" 0x%x (cluster-id: %u)", cpu->cluster_id, cpu->apic_id, +topo_ids.module_id); +return; +} +cpu->cluster_id = topo_ids.module_id; + if (cpu->core_id != -1 && cpu->core_id != topo_ids.core_id) { error_setg(errp, "property core-id: %u doesn't match set apic-id:" " 0x%x (core-id: %u)", cpu->core_id, cpu->apic_id, diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 6f3d114c7d12..27bbbc36b11c 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -6980,12 +6980,14 @@ static Property x86_cpu_properties[] = { DEFINE_PROP_UINT32("apic-id", X86CPU, apic_id, 0), DEFINE_PROP_INT32("thread-id", X86CPU, thread_id, 0), DEFINE_PROP_INT32("core-id", X86CPU, core_id, 0), +DEFINE_PROP_INT32("cluster-id", X86CPU, cluster_id, 0), DEFINE_PROP_INT32("die-id", X86CPU, die_id, 0), DEFINE_PROP_INT32("socket-id", X86CPU, socket_id, 0), #else DEFINE_PROP_UINT32("apic-id", X86CPU, apic_id, UNASSIGNED_APIC_ID), DEFINE_PROP_INT32("thread-id", X86CPU, thread_id, -1), DEFINE_PROP_INT32("core-id", X86CPU, core_id, -1), +DEFINE_PROP_INT32("cluster-id", X86CPU, cluster_id, -1), DEFINE_PROP_INT32("die-id", X86CPU, die_id, -1), DEFINE_PROP_INT32("socket-id", X86CPU, socket_id, -1), #endif diff --git a/target/i386/cpu.h b/target/i386/cpu.h index f3afea765982..8668e74e0c87 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -1966,6 +1966,7 @@ struct ArchCPU { int32_t node_id; /* NUMA node this CPU belongs to */ int32_t socket_id; int32_t die_id; +int32_t cluster_id; int32_t core_id; int32_t thread_id; -- 2.34.1
[PATCH 17/18] i386: Use CPUCacheInfo.share_level to encode CPUID[0x8000001D].EAX[bits 25:14]
From: Zhao Liu CPUID[0x801D].EAX[bits 25:14] is used to represent the cache topology for amd CPUs. After cache models have topology information, we can use CPUCacheInfo.share_level to decide which topology level to be encoded into CPUID[0x801D].EAX[bits 25:14]. Signed-off-by: Zhao Liu --- target/i386/cpu.c | 10 +- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index d691c02e3c06..5816dc99b1d4 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -355,20 +355,12 @@ static void encode_cache_cpuid801d(CPUCacheInfo *cache, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) { -uint32_t sharing_apic_ids; assert(cache->size == cache->line_size * cache->associativity * cache->partitions * cache->sets); *eax = CACHE_TYPE(cache->type) | CACHE_LEVEL(cache->level) | (cache->self_init ? CACHE_SELF_INIT_LEVEL : 0); - -/* L3 is shared among multiple cores */ -if (cache->level == 3) { -sharing_apic_ids = 1 << apicid_die_offset(topo_info); -} else { -sharing_apic_ids = 1 << apicid_core_offset(topo_info); -} -*eax |= (sharing_apic_ids - 1) << 14; +*eax |= max_processor_ids_for_cache(cache, topo_info) << 14; assert(cache->line_size > 0); assert(cache->partitions > 0); -- 2.34.1
[PATCH 14/18] i386: Add cache topology info in CPUCacheInfo
From: Zhao Liu Currently, by default, the cache topology is encoded as: 1. i/d cache is shared in one core. 2. L2 cache is shared in one core. 3. L3 cache is shared in one die. This default general setting has caused a misunderstanding, that is, the cache topology is completely equated with a specific cpu topology, such as the connection between L2 cache and core level, and the connection between L3 cache and die level. In fact, the settings of these topologies depend on the specific platform and are not static. For example, on Alder Lake-P, every four Atom cores share the same L2 cache. Thus, we should explicitly define the corresponding cache topology for different cache models to increase scalability. Except legacy_l2_cache_cpuid2 (its default topo level is INVALID), explicitly set the corresponding topology level for all other cache models. In order to be compatible with the existing cache topology, set the CORE level for the i/d cache, set the CORE level for L2 cache, and set the DIE level for L3 cache. The field for CPUID[4].EAX[bits 25:14] or CPUID[0x801D].EAX[bits 25:14] will be set based on CPUCacheInfo.share_level. Signed-off-by: Zhao Liu --- target/i386/cpu.c | 19 +++ target/i386/cpu.h | 16 2 files changed, 35 insertions(+) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 27bbbc36b11c..364534e84b1b 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -433,6 +433,7 @@ static CPUCacheInfo legacy_l1d_cache = { .sets = 64, .partitions = 1, .no_invd_sharing = true, +.share_level = CORE, }; /*FIXME: CPUID leaf 0x8005 is inconsistent with leaves 2 & 4 */ @@ -447,6 +448,7 @@ static CPUCacheInfo legacy_l1d_cache_amd = { .partitions = 1, .lines_per_tag = 1, .no_invd_sharing = true, +.share_level = CORE, }; /* L1 instruction cache: */ @@ -460,6 +462,7 @@ static CPUCacheInfo legacy_l1i_cache = { .sets = 64, .partitions = 1, .no_invd_sharing = true, +.share_level = CORE, }; /*FIXME: CPUID leaf 0x8005 is inconsistent with leaves 2 & 4 */ @@ -474,6 +477,7 @@ static CPUCacheInfo legacy_l1i_cache_amd = { .partitions = 1, .lines_per_tag = 1, .no_invd_sharing = true, +.share_level = CORE, }; /* Level 2 unified cache: */ @@ -487,6 +491,7 @@ static CPUCacheInfo legacy_l2_cache = { .sets = 4096, .partitions = 1, .no_invd_sharing = true, +.share_level = CORE, }; /*FIXME: CPUID leaf 2 descriptor is inconsistent with CPUID leaf 4 */ @@ -509,6 +514,7 @@ static CPUCacheInfo legacy_l2_cache_amd = { .associativity = 16, .sets = 512, .partitions = 1, +.share_level = CORE, }; /* Level 3 unified cache: */ @@ -524,6 +530,7 @@ static CPUCacheInfo legacy_l3_cache = { .self_init = true, .inclusive = true, .complex_indexing = true, +.share_level = DIE, }; /* TLB definitions: */ @@ -1668,6 +1675,7 @@ static const CPUCaches epyc_cache_info = { .lines_per_tag = 1, .self_init = 1, .no_invd_sharing = true, +.share_level = CORE, }, .l1i_cache = &(CPUCacheInfo) { .type = INSTRUCTION_CACHE, @@ -1680,6 +1688,7 @@ static const CPUCaches epyc_cache_info = { .lines_per_tag = 1, .self_init = 1, .no_invd_sharing = true, +.share_level = CORE, }, .l2_cache = &(CPUCacheInfo) { .type = UNIFIED_CACHE, @@ -1690,6 +1699,7 @@ static const CPUCaches epyc_cache_info = { .partitions = 1, .sets = 1024, .lines_per_tag = 1, +.share_level = CORE, }, .l3_cache = &(CPUCacheInfo) { .type = UNIFIED_CACHE, @@ -1703,6 +1713,7 @@ static const CPUCaches epyc_cache_info = { .self_init = true, .inclusive = true, .complex_indexing = true, +.share_level = DIE, }, }; @@ -1718,6 +1729,7 @@ static const CPUCaches epyc_rome_cache_info = { .lines_per_tag = 1, .self_init = 1, .no_invd_sharing = true, +.share_level = CORE, }, .l1i_cache = &(CPUCacheInfo) { .type = INSTRUCTION_CACHE, @@ -1730,6 +1742,7 @@ static const CPUCaches epyc_rome_cache_info = { .lines_per_tag = 1, .self_init = 1, .no_invd_sharing = true, +.share_level = CORE, }, .l2_cache = &(CPUCacheInfo) { .type = UNIFIED_CACHE, @@ -1740,6 +1753,7 @@ static const CPUCaches epyc_rome_cache_info = { .partitions = 1, .sets = 1024, .lines_per_tag = 1, +.share_level = CORE, }, .l3_cache = &(CPUCacheInfo) { .type = UNIFIED_CACHE, @@ -1753,6 +1767,7 @@ static const CPUCaches epyc_rome_cache_info = { .self_init = true, .inclusive = true, .complex_indexing = true, +.share_level = DIE, }, }; @@ -1768,6 +1783,7 @@ static const CPUCaches epyc_milan_cache_info = { .lines_per_tag = 1, .
[PATCH 02/18] tests: Rename test-x86-cpuid.c to test-x86-apicid.c
From: Zhao Liu In fact, this unit tests APIC ID other than CPUID. Rename to test-x86-apicid.c to make its name more in line with its actual content. Signed-off-by: Zhao Liu --- MAINTAINERS| 2 +- tests/unit/meson.build | 4 ++-- tests/unit/{test-x86-cpuid.c => test-x86-apicid.c} | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) rename tests/unit/{test-x86-cpuid.c => test-x86-apicid.c} (99%) diff --git a/MAINTAINERS b/MAINTAINERS index c581c11a645a..a6a0c7fe5795 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1674,7 +1674,7 @@ F: include/hw/southbridge/piix.h F: hw/misc/sga.c F: hw/isa/apm.c F: include/hw/isa/apm.h -F: tests/unit/test-x86-cpuid.c +F: tests/unit/test-x86-apicid.c F: tests/qtest/test-x86-cpuid-compat.c PC Chipset diff --git a/tests/unit/meson.build b/tests/unit/meson.build index ffa444f4323c..a9df2843e92e 100644 --- a/tests/unit/meson.build +++ b/tests/unit/meson.build @@ -20,8 +20,8 @@ tests = { 'test-opts-visitor': [testqapi], 'test-visitor-serialization': [testqapi], 'test-bitmap': [], - # all code tested by test-x86-cpuid is inside topology.h - 'test-x86-cpuid': [], + # all code tested by test-x86-apicid is inside topology.h + 'test-x86-apicid': [], 'test-cutils': [], 'test-div128': [], 'test-shift128': [], diff --git a/tests/unit/test-x86-cpuid.c b/tests/unit/test-x86-apicid.c similarity index 99% rename from tests/unit/test-x86-cpuid.c rename to tests/unit/test-x86-apicid.c index bfabc0403a1a..2b104f86d7c2 100644 --- a/tests/unit/test-x86-cpuid.c +++ b/tests/unit/test-x86-apicid.c @@ -1,5 +1,5 @@ /* - * Test code for x86 CPUID and Topology functions + * Test code for x86 APIC ID and Topology functions * * Copyright (c) 2012 Red Hat Inc. * -- 2.34.1
[PATCH 18/18] i386: Add new property to control L2 cache topo in CPUID.04H
From: Zhao Liu The property x-l2-cache-topo will be used to change the L2 cache topology in CPUID.04H. Now it allows user to set the L2 cache is shared in core level or cluster level. If user passes "-cpu x-l2-cache-topo=[core|cluster]" then older L2 cache topology will be overrided by the new topology setting. Here we expose to user "cluster" instead of "module", to be consistent with "cluster-id" naming. Since CPUID.04H is used by intel CPUs, this property is available on intel CPUs as for now. When necessary, it can be extended to CPUID.801DH for amd CPUs. Signed-off-by: Zhao Liu --- target/i386/cpu.c | 33 - target/i386/cpu.h | 2 ++ 2 files changed, 34 insertions(+), 1 deletion(-) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 5816dc99b1d4..cf84c720a431 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -240,12 +240,15 @@ static uint32_t max_processor_ids_for_cache(CPUCacheInfo *cache, case CORE: num_ids = 1 << apicid_core_offset(topo_info); break; +case MODULE: +num_ids = 1 << apicid_module_offset(topo_info); +break; case DIE: num_ids = 1 << apicid_die_offset(topo_info); break; default: /* - * Currently there is no use case for SMT, MODULE and PACKAGE, so use + * Currently there is no use case for SMT and PACKAGE, so use * assert directly to facilitate debugging. */ g_assert_not_reached(); @@ -6633,6 +6636,33 @@ static void x86_cpu_realizefn(DeviceState *dev, Error **errp) env->cache_info_amd.l3_cache = &legacy_l3_cache; } +if (cpu->l2_cache_topo_level) { +/* + * FIXME: Currently only supports changing CPUID[4] (for intel), and + * will support changing CPUID[0x801D] when necessary. + */ +if (!IS_INTEL_CPU(env)) { +error_setg(errp, "only intel cpus supports x-l2-cache-topo"); +return; +} + +if (!strcmp(cpu->l2_cache_topo_level, "core")) { +env->cache_info_cpuid4.l2_cache->share_level = CORE; +} else if (!strcmp(cpu->l2_cache_topo_level, "cluster")) { +/* + * We expose to users "cluster" instead of "module", to be + * consistent with "cluster-id" naming. + */ +env->cache_info_cpuid4.l2_cache->share_level = MODULE; +} else { +error_setg(errp, + "x-l2-cache-topo doesn't support '%s', " + "and it only supports 'core' or 'cluster'", + cpu->l2_cache_topo_level); +return; +} +} + #ifndef CONFIG_USER_ONLY MachineState *ms = MACHINE(qdev_get_machine()); qemu_register_reset(x86_cpu_machine_reset_cb, cpu); @@ -7135,6 +7165,7 @@ static Property x86_cpu_properties[] = { false), DEFINE_PROP_BOOL("x-intel-pt-auto-level", X86CPU, intel_pt_auto_level, true), +DEFINE_PROP_STRING("x-l2-cache-topo", X86CPU, l2_cache_topo_level), DEFINE_PROP_END_OF_LIST() }; diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 5a955431f759..aa7e96c586c7 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -1987,6 +1987,8 @@ struct ArchCPU { int32_t thread_id; int32_t hv_max_vps; + +char *l2_cache_topo_level; }; -- 2.34.1
[PATCH 10/18] i386: Update APIC ID parsing rule to support module level
From: Zhuocheng Ding Add the module level parsing support for APIC ID. With this support, now the conversion between X86CPUTopoIDs, X86CPUTopoInfo and APIC ID is completed. Signed-off-by: Zhuocheng Ding Co-developed-by: Zhao Liu Signed-off-by: Zhao Liu --- hw/i386/x86.c | 19 --- include/hw/i386/topology.h | 36 ++-- 2 files changed, 26 insertions(+), 29 deletions(-) diff --git a/hw/i386/x86.c b/hw/i386/x86.c index b66719230d57..68fce87d18ac 100644 --- a/hw/i386/x86.c +++ b/hw/i386/x86.c @@ -310,11 +310,11 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev, /* * If APIC ID is not set, - * set it based on socket/die/core/thread properties. + * set it based on socket/die/cluster/core/thread properties. */ if (cpu->apic_id == UNASSIGNED_APIC_ID) { -int max_socket = (ms->smp.max_cpus - 1) / -smp_threads / smp_cores / ms->smp.dies; +int max_socket = (ms->smp.max_cpus - 1) / smp_threads / smp_cores / +ms->smp.clusters / ms->smp.dies; /* * die-id was optional in QEMU 4.0 and older, so keep it optional @@ -378,15 +378,12 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev, x86_topo_ids_from_apicid(cpu->apic_id, &topo_info, &topo_ids); -/* - * TODO: Before APIC ID supports module level parsing, there's no need - * to expose module_id info. - */ error_setg(errp, -"Invalid CPU [socket: %u, die: %u, core: %u, thread: %u] with" -" APIC ID %" PRIu32 ", valid index range 0:%d", -topo_ids.pkg_id, topo_ids.die_id, topo_ids.core_id, topo_ids.smt_id, -cpu->apic_id, ms->possible_cpus->len - 1); +"Invalid CPU [socket: %u, die: %u, module: %u, core: %u, thread: %u]" +" with APIC ID %" PRIu32 ", valid index range 0:%d", +topo_ids.pkg_id, topo_ids.die_id, topo_ids.module_id, +topo_ids.core_id, topo_ids.smt_id, cpu->apic_id, +ms->possible_cpus->len - 1); return; } diff --git a/include/hw/i386/topology.h b/include/hw/i386/topology.h index 5de905dc00d3..3cec97b377f2 100644 --- a/include/hw/i386/topology.h +++ b/include/hw/i386/topology.h @@ -79,12 +79,13 @@ static inline unsigned apicid_smt_width(X86CPUTopoInfo *topo_info) /* Bit width of the Core_ID field */ static inline unsigned apicid_core_width(X86CPUTopoInfo *topo_info) { -/* - * TODO: Will separate module info from core_width when update - * APIC ID with module level. - */ -return apicid_bitwidth_for_count(topo_info->cores_per_module * - topo_info->modules_per_die); +return apicid_bitwidth_for_count(topo_info->cores_per_module); +} + +/* Bit width of the Module_ID (cluster ID) field */ +static inline unsigned apicid_module_width(X86CPUTopoInfo *topo_info) +{ +return apicid_bitwidth_for_count(topo_info->modules_per_die); } /* Bit width of the Die_ID field */ @@ -99,10 +100,16 @@ static inline unsigned apicid_core_offset(X86CPUTopoInfo *topo_info) return apicid_smt_width(topo_info); } +/* Bit offset of the Module_ID (cluster ID) field */ +static inline unsigned apicid_module_offset(X86CPUTopoInfo *topo_info) +{ +return apicid_core_offset(topo_info) + apicid_core_width(topo_info); +} + /* Bit offset of the Die_ID field */ static inline unsigned apicid_die_offset(X86CPUTopoInfo *topo_info) { -return apicid_core_offset(topo_info) + apicid_core_width(topo_info); +return apicid_module_offset(topo_info) + apicid_module_width(topo_info); } /* Bit offset of the Pkg_ID (socket ID) field */ @@ -121,6 +128,7 @@ static inline apic_id_t x86_apicid_from_topo_ids(X86CPUTopoInfo *topo_info, { return (topo_ids->pkg_id << apicid_pkg_offset(topo_info)) | (topo_ids->die_id << apicid_die_offset(topo_info)) | + (topo_ids->module_id << apicid_module_offset(topo_info)) | (topo_ids->core_id << apicid_core_offset(topo_info)) | topo_ids->smt_id; } @@ -138,11 +146,6 @@ static inline void x86_topo_ids_from_idx(X86CPUTopoInfo *topo_info, unsigned nr_cores = topo_info->cores_per_module; unsigned nr_threads = topo_info->threads_per_core; -/* - * Currently smp for i386 doesn't support "clusters", modules_per_die is - * only 1. Therefore, the module_id generated from the module topology will - * not conflict with the module_id generated according to the apicid. - */ topo_ids->pkg_id = cpu_index / (nr_dies * nr_modules * nr_cores * nr_threads); topo_ids->die_id = cpu_index / (nr_modules * nr_cores * @@ -166,12 +169,9 @@ static inline void x86_topo_ids_from_apicid(apic_id_t apicid, topo_ids->core_id = (apicid >> apicid_core_offset(topo_info)) & ~(0xUL << apicid_core_width(topo
[PATCH 05/18] i386/cpu: Consolidate the use of topo_info in cpu_x86_cpuid()
From: Zhao Liu In cpu_x86_cpuid(), there are many variables in representing the cpu topology, e.g., topo_info, cs->nr_cores/cs->nr_threads. Since the names of cs->nr_cores/cs->nr_threads does not accurately represent its meaning, the use of cs->nr_cores/cs->nr_threads is prone to confusion and mistakes. And the structure X86CPUTopoInfo names its memebers clearly, thus the variable "topo_info" should be preferred. Suggested-by: Robert Hoo Signed-off-by: Zhao Liu --- target/i386/cpu.c | 30 ++ 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 7833505092d8..4cda84eb96f1 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -5215,11 +5215,15 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, uint32_t limit; uint32_t signature[3]; X86CPUTopoInfo topo_info; +uint32_t cpus_per_pkg; topo_info.dies_per_pkg = env->nr_dies; topo_info.cores_per_die = cs->nr_cores / env->nr_dies; topo_info.threads_per_core = cs->nr_threads; +cpus_per_pkg = topo_info.dies_per_pkg * topo_info.cores_per_die * + topo_info.threads_per_core; + /* Calculate & apply limits for different index ranges */ if (index >= 0xC000) { limit = env->cpuid_xlevel2; @@ -5255,8 +5259,8 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, *ecx |= CPUID_EXT_OSXSAVE; } *edx = env->features[FEAT_1_EDX]; -if (cs->nr_cores * cs->nr_threads > 1) { -*ebx |= (cs->nr_cores * cs->nr_threads) << 16; +if (cpus_per_pkg > 1) { +*ebx |= cpus_per_pkg << 16; *edx |= CPUID_HT; } if (!cpu->enable_pmu) { @@ -5293,10 +5297,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, */ if (*eax & 31) { int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14); -int vcpus_per_socket = cs->nr_cores * cs->nr_threads; -if (cs->nr_cores > 1) { +int vcpus_per_socket = cpus_per_pkg; +int cores_per_socket = topo_info.cores_per_die * + topo_info.dies_per_pkg; +if (cores_per_socket > 1) { *eax &= ~0xFC00; -*eax |= (pow2ceil(cs->nr_cores) - 1) << 26; +*eax |= (pow2ceil(cores_per_socket) - 1) << 26; } if (host_vcpus_per_cache > vcpus_per_socket) { *eax &= ~0x3FFC000; @@ -5436,12 +5442,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, switch (count) { case 0: *eax = apicid_core_offset(&topo_info); -*ebx = cs->nr_threads; +*ebx = topo_info.threads_per_core; *ecx |= CPUID_TOPOLOGY_LEVEL_SMT; break; case 1: *eax = apicid_pkg_offset(&topo_info); -*ebx = cs->nr_cores * cs->nr_threads; +*ebx = cpus_per_pkg; *ecx |= CPUID_TOPOLOGY_LEVEL_CORE; break; default: @@ -5472,7 +5478,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, switch (count) { case 0: *eax = apicid_core_offset(&topo_info); -*ebx = cs->nr_threads; +*ebx = topo_info.threads_per_core; *ecx |= CPUID_TOPOLOGY_LEVEL_SMT; break; case 1: @@ -5482,7 +5488,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, break; case 2: *eax = apicid_pkg_offset(&topo_info); -*ebx = cs->nr_cores * cs->nr_threads; +*ebx = cpus_per_pkg; *ecx |= CPUID_TOPOLOGY_LEVEL_DIE; break; default: @@ -5707,7 +5713,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, * discards multiple thread information if it is set. * So don't set it here for Intel to make Linux guests happy. */ -if (cs->nr_cores * cs->nr_threads > 1) { +if (cpus_per_pkg > 1) { if (env->cpuid_vendor1 != CPUID_VENDOR_INTEL_1 || env->cpuid_vendor2 != CPUID_VENDOR_INTEL_2 || env->cpuid_vendor3 != CPUID_VENDOR_INTEL_3) { @@ -5769,7 +5775,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, *eax |= (cpu_x86_virtual_addr_width(env) << 8); } *ebx = env->features[FEAT_8000_0008_EBX]; -if (cs->nr_cores * cs->nr_threads > 1) { +if (cpus_per_pkg > 1) { /* * Bits 15:12 is "The number of bits in the initial * Core::X86::Apic::ApicId[ApicId] value that indicate @@ -5777,7 +5783,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, * Bits 7:0
[PATCH 13/18] hw/i386/pc: Support smp.clusters for x86 PC machine
From: Zhuocheng Ding As module-level topology support is added to X86CPU, now we can enable the support for the cluster parameter on PC machines. With this support, we can define a 5-level x86 CPU topology with "-smp": -smp cpus=*,maxcpus=*,sockets=*,dies=*,clusters=*,cores=*,threads=*. Additionally, add the 5-level topology example in description of "-smp". Signed-off-by: Zhuocheng Ding Signed-off-by: Zhao Liu --- hw/i386/pc.c| 1 + qemu-options.hx | 10 +- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/hw/i386/pc.c b/hw/i386/pc.c index 6e592bd969aa..c329df56ebd2 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -1929,6 +1929,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data) mc->default_cpu_type = TARGET_DEFAULT_CPU_TYPE; mc->nvdimm_supported = true; mc->smp_props.dies_supported = true; +mc->smp_props.clusters_supported = true; mc->default_ram_id = "pc.ram"; object_class_property_add(oc, PC_MACHINE_MAX_RAM_BELOW_4G, "size", diff --git a/qemu-options.hx b/qemu-options.hx index d59d19704bc5..3700e1aa97ea 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -312,14 +312,14 @@ SRST -smp 8,sockets=2,cores=2,threads=2,maxcpus=8 The following sub-option defines a CPU topology hierarchy (2 sockets -totally on the machine, 2 dies per socket, 2 cores per die, 2 threads -per core) for PC machines which support sockets/dies/cores/threads. -Some members of the option can be omitted but their values will be -automatically computed: +totally on the machine, 2 dies per socket, 2 clusters per die, 2 cores per +cluster, 2 threads per core) for PC machines which support sockets/dies +/clusters/cores/threads. Some members of the option can be omitted but +their values will be automatically computed: :: --smp 16,sockets=2,dies=2,cores=2,threads=2,maxcpus=16 +-smp 32,sockets=2,dies=2,clusters=2,cores=2,threads=2,maxcpus=32 The following sub-option defines a CPU topology hierarchy (2 sockets totally on the machine, 2 clusters per socket, 2 cores per cluster, -- 2.34.1
[PATCH 01/18] machine: Fix comment of machine_parse_smp_config()
From: Zhao Liu Now smp supports dies and clusters, so add description about these 2 levels in the comment of machine_parse_smp_config(). Fixes: 864c3b5 (hw/core/machine: Introduce CPU cluster topology support) Suggested-by: Robert Hoo Signed-off-by: Zhao Liu --- hw/core/machine-smp.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/hw/core/machine-smp.c b/hw/core/machine-smp.c index c3dab007dadc..3fd9e641efde 100644 --- a/hw/core/machine-smp.c +++ b/hw/core/machine-smp.c @@ -51,8 +51,8 @@ static char *cpu_hierarchy_to_string(MachineState *ms) * machine_parse_smp_config: Generic function used to parse the given * SMP configuration * - * Any missing parameter in "cpus/maxcpus/sockets/cores/threads" will be - * automatically computed based on the provided ones. + * Any missing parameter in "cpus/maxcpus/sockets/dies/clusters/cores/threads" + * will be automatically computed based on the provided ones. * * In the calculation of omitted sockets/cores/threads: we prefer sockets * over cores over threads before 6.2, while preferring cores over sockets @@ -66,7 +66,8 @@ static char *cpu_hierarchy_to_string(MachineState *ms) * * For compatibility, apart from the parameters that will be computed, newly * introduced topology members which are likely to be target specific should - * be directly set as 1 if they are omitted (e.g. dies for PC since 4.1). + * be directly set as 1 if they are omitted (e.g. dies for PC since v4.1 and + * clusters for arm since v7.0). */ void machine_parse_smp_config(MachineState *ms, const SMPConfiguration *config, Error **errp) -- 2.34.1
[PATCH 16/18] i386: Fix NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
From: Zhao Liu >From AMD's APM, NumSharingCache (CPUID[0x801D].EAX[bits 25:14]) means [1]: The number of logical processors sharing this cache is the value of this field incremented by 1. To determine which logical processors are sharing a cache, determine a Share Id for each processor as follows: ShareId = LocalApicId >> log2(NumSharingCache+1) Logical processors with the same ShareId then share a cache. If NumSharingCache+1 is not a power of two, round it up to the next power of two. >From the description above, the caculation of this feild should be same as CPUID[4].EAX[bits 25:14] for intel cpus. So also use the offsets of APIC ID to caculate this field. Note: I don't have the hardware available, hope someone can help me to confirm whether this calculation is correct, thanks! [1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology Information Signed-off-by: Zhao Liu --- target/i386/cpu.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 96ef96860604..d691c02e3c06 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -355,7 +355,7 @@ static void encode_cache_cpuid801d(CPUCacheInfo *cache, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) { -uint32_t l3_threads; +uint32_t sharing_apic_ids; assert(cache->size == cache->line_size * cache->associativity * cache->partitions * cache->sets); @@ -364,13 +364,11 @@ static void encode_cache_cpuid801d(CPUCacheInfo *cache, /* L3 is shared among multiple cores */ if (cache->level == 3) { -l3_threads = topo_info->modules_per_die * - topo_info->cores_per_module * - topo_info->threads_per_core; -*eax |= (l3_threads - 1) << 14; +sharing_apic_ids = 1 << apicid_die_offset(topo_info); } else { -*eax |= ((topo_info->threads_per_core - 1) << 14); +sharing_apic_ids = 1 << apicid_core_offset(topo_info); } +*eax |= (sharing_apic_ids - 1) << 14; assert(cache->line_size > 0); assert(cache->partitions > 0); -- 2.34.1
[PATCH 08/18] i386: Support module_id in X86CPUTopoIDs
From: Zhuocheng Ding Add module_id member in X86CPUTopoIDs. module_id can be parsed from APIC ID, so before updating the parsing rule of APIC ID, temporarily set the module_id generated in this way to 0. module_id can be also generated from cpu topology, and before i386 supports "clusters" in smp, the default "clusters per die" is only 1, thus the module_id generated in this way is 0, so that it will not conflict with the module_id generated by APIC ID. Signed-off-by: Zhuocheng Ding Co-developed-by: Zhao Liu Signed-off-by: Zhao Liu --- hw/i386/x86.c | 17 + include/hw/i386/topology.h | 24 2 files changed, 37 insertions(+), 4 deletions(-) diff --git a/hw/i386/x86.c b/hw/i386/x86.c index 3f6e0dd7b827..b66719230d57 100644 --- a/hw/i386/x86.c +++ b/hw/i386/x86.c @@ -361,6 +361,14 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev, topo_ids.die_id = cpu->die_id; topo_ids.core_id = cpu->core_id; topo_ids.smt_id = cpu->thread_id; + +/* + * TODO: This is the temporary initialization for topo_ids.module_id to + * avoid "maybe-uninitialized" compilation errors. Will remove when + * X86CPU supports cluster_id. + */ +topo_ids.module_id = 0; + cpu->apic_id = x86_apicid_from_topo_ids(&topo_info, &topo_ids); } @@ -369,6 +377,11 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev, MachineState *ms = MACHINE(x86ms); x86_topo_ids_from_apicid(cpu->apic_id, &topo_info, &topo_ids); + +/* + * TODO: Before APIC ID supports module level parsing, there's no need + * to expose module_id info. + */ error_setg(errp, "Invalid CPU [socket: %u, die: %u, core: %u, thread: %u] with" " APIC ID %" PRIu32 ", valid index range 0:%d", @@ -494,6 +507,10 @@ const CPUArchIdList *x86_possible_cpu_arch_ids(MachineState *ms) ms->possible_cpus->cpus[i].props.has_die_id = true; ms->possible_cpus->cpus[i].props.die_id = topo_ids.die_id; } +if (ms->smp.clusters > 1) { +ms->possible_cpus->cpus[i].props.has_cluster_id = true; +ms->possible_cpus->cpus[i].props.cluster_id = topo_ids.module_id; +} ms->possible_cpus->cpus[i].props.has_core_id = true; ms->possible_cpus->cpus[i].props.core_id = topo_ids.core_id; ms->possible_cpus->cpus[i].props.has_thread_id = true; diff --git a/include/hw/i386/topology.h b/include/hw/i386/topology.h index bbb00dc4aad8..b0174c18b7bd 100644 --- a/include/hw/i386/topology.h +++ b/include/hw/i386/topology.h @@ -48,6 +48,7 @@ typedef uint32_t apic_id_t; typedef struct X86CPUTopoIDs { unsigned pkg_id; unsigned die_id; +unsigned module_id; unsigned core_id; unsigned smt_id; } X86CPUTopoIDs; @@ -134,12 +135,21 @@ static inline void x86_topo_ids_from_idx(X86CPUTopoInfo *topo_info, X86CPUTopoIDs *topo_ids) { unsigned nr_dies = topo_info->dies_per_pkg; -unsigned nr_cores = topo_info->cores_per_module * -topo_info->modules_per_die; +unsigned nr_modules = topo_info->modules_per_die; +unsigned nr_cores = topo_info->cores_per_module; unsigned nr_threads = topo_info->threads_per_core; -topo_ids->pkg_id = cpu_index / (nr_dies * nr_cores * nr_threads); -topo_ids->die_id = cpu_index / (nr_cores * nr_threads) % nr_dies; +/* + * Currently smp for i386 doesn't support "clusters", modules_per_die is + * only 1. Therefore, the module_id generated from the module topology will + * not conflict with the module_id generated according to the apicid. + */ +topo_ids->pkg_id = cpu_index / (nr_dies * nr_modules * + nr_cores * nr_threads); +topo_ids->die_id = cpu_index / (nr_modules * nr_cores * + nr_threads) % nr_dies; +topo_ids->module_id = cpu_index / (nr_cores * nr_threads) % + nr_modules; topo_ids->core_id = cpu_index / nr_threads % nr_cores; topo_ids->smt_id = cpu_index % nr_threads; } @@ -156,6 +166,12 @@ static inline void x86_topo_ids_from_apicid(apic_id_t apicid, topo_ids->core_id = (apicid >> apicid_core_offset(topo_info)) & ~(0xUL << apicid_core_width(topo_info)); +/* + * TODO: This is the temporary initialization for topo_ids.module_id to + * avoid "maybe-uninitialized" compilation errors. Will remove when APIC + * ID supports module level parsing. + */ +topo_ids->module_id = 0; topo_ids->die_id = (apicid >> apicid_die_offset(topo_info)) & ~(0xUL << apicid_die_width(topo_info)); -- 2.34.1
[PATCH 12/18] tests: Add test case of APIC ID for module level parsing
From: Zhuocheng Ding After i386 supports module level, it's time to add the test for module level's parsing. Signed-off-by: Zhuocheng Ding Co-developed-by: Zhao Liu Signed-off-by: Zhao Liu --- tests/unit/test-x86-apicid.c | 19 +++ 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/tests/unit/test-x86-apicid.c b/tests/unit/test-x86-apicid.c index f21b8a5d95c2..55b731ccae55 100644 --- a/tests/unit/test-x86-apicid.c +++ b/tests/unit/test-x86-apicid.c @@ -37,6 +37,7 @@ static void test_topo_bits(void) topo_info = (X86CPUTopoInfo) {1, 1, 1, 1}; g_assert_cmpuint(apicid_smt_width(&topo_info), ==, 0); g_assert_cmpuint(apicid_core_width(&topo_info), ==, 0); +g_assert_cmpuint(apicid_module_width(&topo_info), ==, 0); g_assert_cmpuint(apicid_die_width(&topo_info), ==, 0); topo_info = (X86CPUTopoInfo) {1, 1, 1, 1}; @@ -74,13 +75,22 @@ static void test_topo_bits(void) topo_info = (X86CPUTopoInfo) {1, 1, 33, 2}; g_assert_cmpuint(apicid_core_width(&topo_info), ==, 6); -topo_info = (X86CPUTopoInfo) {1, 1, 30, 2}; +topo_info = (X86CPUTopoInfo) {1, 6, 30, 2}; +g_assert_cmpuint(apicid_module_width(&topo_info), ==, 3); +topo_info = (X86CPUTopoInfo) {1, 7, 30, 2}; +g_assert_cmpuint(apicid_module_width(&topo_info), ==, 3); +topo_info = (X86CPUTopoInfo) {1, 8, 30, 2}; +g_assert_cmpuint(apicid_module_width(&topo_info), ==, 3); +topo_info = (X86CPUTopoInfo) {1, 9, 30, 2}; +g_assert_cmpuint(apicid_module_width(&topo_info), ==, 4); + +topo_info = (X86CPUTopoInfo) {1, 6, 30, 2}; g_assert_cmpuint(apicid_die_width(&topo_info), ==, 0); -topo_info = (X86CPUTopoInfo) {2, 1, 30, 2}; +topo_info = (X86CPUTopoInfo) {2, 6, 30, 2}; g_assert_cmpuint(apicid_die_width(&topo_info), ==, 1); -topo_info = (X86CPUTopoInfo) {3, 1, 30, 2}; +topo_info = (X86CPUTopoInfo) {3, 6, 30, 2}; g_assert_cmpuint(apicid_die_width(&topo_info), ==, 2); -topo_info = (X86CPUTopoInfo) {4, 1, 30, 2}; +topo_info = (X86CPUTopoInfo) {4, 6, 30, 2}; g_assert_cmpuint(apicid_die_width(&topo_info), ==, 2); /* build a weird topology and see if IDs are calculated correctly @@ -91,6 +101,7 @@ static void test_topo_bits(void) topo_info = (X86CPUTopoInfo) {1, 1, 6, 3}; g_assert_cmpuint(apicid_smt_width(&topo_info), ==, 2); g_assert_cmpuint(apicid_core_offset(&topo_info), ==, 2); +g_assert_cmpuint(apicid_module_offset(&topo_info), ==, 5); g_assert_cmpuint(apicid_die_offset(&topo_info), ==, 5); g_assert_cmpuint(apicid_pkg_offset(&topo_info), ==, 5); -- 2.34.1
[PATCH 04/18] i386/cpu: Fix number of addressable IDs in CPUID.04H
From: Zhao Liu For i-cache and d-cache, the maximum IDs for CPUs sharing cache ( CPUID.04H.00H:EAX[bits 25:14] and CPUID.04H.01H:EAX[bits 25:14]) are both 0, and this means i-cache and d-cache are shared in the SMT level. This is correct if there's single thread per core, but is wrong for the hyper threading case (one core contains multiple threads) since the i-cache and d-cache are shared in the core level other than SMT level. Therefore, in order to be compatible with both multi-threaded and single-threaded situations, we should set i-cache and d-cache be shared at the core level by default. Refer to the fixes of cache_info_passthrough ([1], [2]) and SDM, the CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits 31:26] should use the nearest power-of-2 integer. The nearest power-of-2 integer can be caculated by pow2ceil() or by using APIC ID offset (like L3 topology using 1 << die_offset [3]). But in fact, CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits 31:26] are associated with APIC ID. For example, in linux kernel, the field "num_threads_sharing" (Bits 25 - 14) is parsed with APIC ID. And for another example, on Alder Lake P, the CPUID.04H:EAX[bits 31:26] is not matched with actual core numbers and it's caculated by: "(1 << (pkg_offset - core_offset)) - 1". Therefore the offset of APIC ID should be preferred to caculate nearest power-of-2 integer for CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits 31:26]: 1. d/i cache is shared in a core, 1 << core_offset should be used instand of "1" in encode_cache_cpuid4() for CPUID.04H.00H:EAX[bits 25:14] and CPUID.04H.01H:EAX[bits 25:14]. 2. L2 cache is supposed to be shared in a core as for now, thereby 1 << core_offset should also be used instand of "cs->nr_threads" in encode_cache_cpuid4() for CPUID.04H.02H:EAX[bits 25:14]. 3. Similarly, the value for CPUID.04H:EAX[bits 31:26] should also be replaced by the offsets upper SMT level in APIC ID. And since [1] and [2] are good enough to make cache_into_passthrough work well, its "pow2ceil()" uses are enough so that they're no need to be replaced by APIC ID offset way. [1]: efb3934 (x86: cpu: make sure number of addressable IDs for processor cores meets the spec) [2]: d7caf13 (x86: cpu: fixup number of addressable IDs for logical processors sharing cache) [3]: d65af28 (i386: Update new x86_apicid parsing rules with die_offset support) Fixes: 7e3482f (i386: Helpers to encode cache information consistently) Suggested-by: Robert Hoo Signed-off-by: Zhao Liu --- target/i386/cpu.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 29afec12c281..7833505092d8 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -5212,7 +5212,6 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, { X86CPU *cpu = env_archcpu(env); CPUState *cs = env_cpu(env); -uint32_t die_offset; uint32_t limit; uint32_t signature[3]; X86CPUTopoInfo topo_info; @@ -5308,27 +5307,38 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, *eax = *ebx = *ecx = *edx = 0; } else { *eax = 0; +int addressable_cores_offset = apicid_pkg_offset(&topo_info) - + apicid_core_offset(&topo_info); +int core_offset, die_offset; + switch (count) { case 0: /* L1 dcache info */ +core_offset = apicid_core_offset(&topo_info); encode_cache_cpuid4(env->cache_info_cpuid4.l1d_cache, -1, cs->nr_cores, +(1 << core_offset), +(1 << addressable_cores_offset), eax, ebx, ecx, edx); break; case 1: /* L1 icache info */ +core_offset = apicid_core_offset(&topo_info); encode_cache_cpuid4(env->cache_info_cpuid4.l1i_cache, -1, cs->nr_cores, +(1 << core_offset), +(1 << addressable_cores_offset), eax, ebx, ecx, edx); break; case 2: /* L2 cache info */ +core_offset = apicid_core_offset(&topo_info); encode_cache_cpuid4(env->cache_info_cpuid4.l2_cache, -cs->nr_threads, cs->nr_cores, +(1 << core_offset), +(1 << addressable_cores_offset), eax, ebx, ecx, edx); break; case 3: /* L3 cache info */ die_offset = apicid_die_offset(&topo_info); if (cpu->enable_l3_cache) { encode_cache_cpuid4(env->cache_info_cpuid4.l3_cache, -
[PATCH 15/18] i386: Use CPUCacheInfo.share_level to encode CPUID[4].EAX[bits 25:14]
From: Zhao Liu CPUID[4].EAX[bits 25:14] is used to represent the cache topology for intel CPUs. After cache models have topology information, we can use CPUCacheInfo.share_level to decide which topology level to be encoded into CPUID[4].EAX[bits 25:14]. Additionally, since maximum_processor_id (original "num_apic_ids") is parsed based on cpu topology levels, which are verified when parsing smp, it's no need to check this value by "assert(num_apic_ids > 0)" again, so remove this assert. Signed-off-by: Zhao Liu --- target/i386/cpu.c | 55 +++ 1 file changed, 36 insertions(+), 19 deletions(-) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 364534e84b1b..96ef96860604 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -231,22 +231,50 @@ static uint8_t cpuid2_cache_descriptor(CPUCacheInfo *cache) ((t) == UNIFIED_CACHE) ? CACHE_TYPE_UNIFIED : \ 0 /* Invalid value */) +static uint32_t max_processor_ids_for_cache(CPUCacheInfo *cache, +X86CPUTopoInfo *topo_info) +{ +uint32_t num_ids = 0; + +switch (cache->share_level) { +case CORE: +num_ids = 1 << apicid_core_offset(topo_info); +break; +case DIE: +num_ids = 1 << apicid_die_offset(topo_info); +break; +default: +/* + * Currently there is no use case for SMT, MODULE and PACKAGE, so use + * assert directly to facilitate debugging. + */ +g_assert_not_reached(); +} + +return num_ids - 1; +} + +static uint32_t max_core_ids_in_package(X86CPUTopoInfo *topo_info) +{ +uint32_t num_cores = 1 << (apicid_pkg_offset(topo_info) - + apicid_core_offset(topo_info)); +return num_cores - 1; +} /* Encode cache info for CPUID[4] */ static void encode_cache_cpuid4(CPUCacheInfo *cache, -int num_apic_ids, int num_cores, +X86CPUTopoInfo *topo_info, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) { assert(cache->size == cache->line_size * cache->associativity * cache->partitions * cache->sets); -assert(num_apic_ids > 0); *eax = CACHE_TYPE(cache->type) | CACHE_LEVEL(cache->level) | (cache->self_init ? CACHE_SELF_INIT_LEVEL : 0) | - ((num_cores - 1) << 26) | - ((num_apic_ids - 1) << 14); + (max_core_ids_in_package(topo_info) << 26) | + (max_processor_ids_for_cache(cache, topo_info) << 14); assert(cache->line_size > 0); assert(cache->partitions > 0); @@ -5335,38 +5363,27 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, *eax = *ebx = *ecx = *edx = 0; } else { *eax = 0; -int addressable_cores_offset = apicid_pkg_offset(&topo_info) - - apicid_core_offset(&topo_info); -int core_offset, die_offset; switch (count) { case 0: /* L1 dcache info */ -core_offset = apicid_core_offset(&topo_info); encode_cache_cpuid4(env->cache_info_cpuid4.l1d_cache, -(1 << core_offset), -(1 << addressable_cores_offset), +&topo_info, eax, ebx, ecx, edx); break; case 1: /* L1 icache info */ -core_offset = apicid_core_offset(&topo_info); encode_cache_cpuid4(env->cache_info_cpuid4.l1i_cache, -(1 << core_offset), -(1 << addressable_cores_offset), +&topo_info, eax, ebx, ecx, edx); break; case 2: /* L2 cache info */ -core_offset = apicid_core_offset(&topo_info); encode_cache_cpuid4(env->cache_info_cpuid4.l2_cache, -(1 << core_offset), -(1 << addressable_cores_offset), +&topo_info, eax, ebx, ecx, edx); break; case 3: /* L3 cache info */ -die_offset = apicid_die_offset(&topo_info); if (cpu->enable_l3_cache) { encode_cache_cpuid4(env->cache_info_cpuid4.l3_cache, -(1 << die_offset), -(1 << addressable_cores_offset), +&topo_info, eax, ebx, ecx, edx); break; } -- 2.34.1
Re: [PATCH] vhost-user-fs: add capability to allow migration
Anton Kuchin wrote: > On 01/02/2023 16:26, Juan Quintela wrote: >> Anton Kuchin wrote: >>> On 19/01/2023 18:02, Stefan Hajnoczi wrote: On Thu, 19 Jan 2023 at 10:29, Anton Kuchin wrote: > On 19/01/2023 16:30, Stefan Hajnoczi wrote: >> On Thu, 19 Jan 2023 at 07:43, Anton Kuchin >> wrote: >>> On 18/01/2023 17:52, Stefan Hajnoczi wrote: On Sun, 15 Jan 2023 at 12:21, Anton Kuchin wrote: >> Once told that, I think that you are making your live harder in the >> future when you add the other migratable devices. >> >> I am assuming here that your "underlying device" is: >> >> enum VhostUserFSType { >> VHOST_USER_NO_MIGRATABLE = 0, >> // The one we are doing here >> VHOST_USER_EXTERNAL_MIGRATABLE, >> // The one you describe for the future >> VHOST_USER_INTERNAL_MIGRATABLE, >> }; >> >> struct VHostUserFS { >> /*< private >*/ >> VirtIODevice parent; >> VHostUserFSConf conf; >> struct vhost_virtqueue *vhost_vqs; >> struct vhost_dev vhost_dev; >> VhostUserState vhost_user; >> VirtQueue **req_vqs; >> VirtQueue *hiprio_vq; >> int32_t bootindex; >> enum migration_type; >> /*< public >*/ >> }; >> >> >> static int vhost_user_fs_pre_save(void *opaque) >> { >> VHostUserFS *s = opaque; >> >> if (s->migration_type == VHOST_USER_NO_MIGRATABLE)) { >> error_report("Migration of vhost-user-fs devices requires internal >> FUSE " >> "state of backend to be preserved. If orchestrator can >> " >> "guarantee this (e.g. dst connects to the same backend >> " >> "instance or backend state is migrated) set >> 'vhost-user-fs' " >> "migration capability to true to enable migration."); >> return -1; >> } >> if (s->migration_type == VHOST_USER_EXTERNAL_MIGRATABLE) { >> return 0; >> } >> if (s->migration_type == VHOST_USER_INTERNAL_MIGRATABLE) { >> error_report("still not implemented"); >> return -1; >> } >> assert("we don't reach here"); >> } >> >> Your initial vmstateDescription >> >> static const VMStateDescription vuf_vmstate = { >> .name = "vhost-user-fs", >> .unmigratable = 1, >> .minimum_version_id = 0, >> .version_id = 0, >> .fields = (VMStateField[]) { >> VMSTATE_INT8(migration_type, struct VHostUserFS), >> VMSTATE_VIRTIO_DEVICE, >> VMSTATE_END_OF_LIST() >> }, >> .pre_save = vhost_user_fs_pre_save, >> }; >> >> And later you change it to something like: >> >> static bool vhost_fs_user_internal_state_needed(void *opaque) >> { >> VHostUserFS *s = opaque; >> >> return s->migration_type == VMOST_USER_INTERNAL_MIGRATABLE; >> } >> >> static const VMStateDescription vmstate_vhost_user_fs_internal_sub = { >> .name = "vhost-user-fs/internal", >> .version_id = 1, >> .minimum_version_id = 1, >> .needed = &vhost_fs_user_internal_state_needed, >> .fields = (VMStateField[]) { >> // Whatever >> VMSTATE_END_OF_LIST() >> } >> }; >> >> static const VMStateDescription vuf_vmstate = { >> .name = "vhost-user-fs", >> .minimum_version_id = 0, >> .version_id = 0, >> .fields = (VMStateField[]) { >> VMSTATE_INT8(migration_type, struct VHostUserFS), >> VMSTATE_VIRTIO_DEVICE, >> VMSTATE_END_OF_LIST() >> }, >> .pre_save = vhost_user_fs_pre_save, >> .subsections = (const VMStateDescription*[]) { >> &vmstate_vhost_user_fs_internal_sub, >> NULL >> } >> }; >> >> And you are done. >> >> I will propose to use a property to set migration_type, but I didn't >> want to write the code right now. >> >> I think that this proposal will make Stephan happy, and it is just >> adding and extra uint8_t that is helpul to implement everything. > > That is exactly the approach I'm trying to implement it right now. Single > flag can be convenient for orchestrator but makes it too hard to account in > all cases for all devices on qemu side without breaking future > compatibility. > So I'm rewriting it with properties. Nice. That would be my proposal. Just a bit complicated for a proof of concetp. > BTW do you think each vhost-user device should have its own enum of > migration > types or maybe we could make them common for all device types? I will put it for vhost-user, because as far as I know nobody else is asking for this functionality. The most similar device that I can think of right now is vfio devices. But they are implemeting callbacks to save hardware device state, and they go device by device, i.e. there is nothing general there. Later, Juan.
Re: [PATCH v3 2/3] hvf: implement guest debugging on Apple Silicon hosts
> Are you running the Linux guest on multiple cores? If yes, could you > check if the issue persists also when using a single core? Yes, I was running with 2 cores. I just tested with 1 and 2 several times. I haven't reproduced the bug with 1 core, but happens around 70% of the time with 2 cores. — Mads Ynddal
Re: [PULL 0/7] Python patches
On Wed, 25 Jan 2023 at 02:34, John Snow wrote: > > The following changes since commit 13356edb87506c148b163b8c7eb0695647d00c2a: > > Merge tag 'block-pull-request' of https://gitlab.com/stefanha/qemu into > staging (2023-01-24 09:45:33 +) > > are available in the Git repository at: > > https://gitlab.com/jsnow/qemu.git tags/python-pull-request > > for you to fetch changes up to bd4c0ef409140bd1be393407c04005ac077d4574: > > python/qemu/machine: use socketpair() for QMP by default (2023-01-24 > 13:37:13 -0500) > > > Python > > Bits and pieces, kibbles'n'bits > > Applied, thanks. Please update the changelog at https://wiki.qemu.org/ChangeLog/8.0 for any user-visible changes. -- PMM
Re: [PATCH v3] migration: Remove res_compatible parameter
Vladimir Sementsov-Ogievskiy wrote: > On 1/30/23 11:06, Juan Quintela wrote: >> It was only used for RAM, and in that case, it means that this amount >> of data was sent for memory. Just delete the field in all callers. > > Could you describe, why it's safe to change the behavior for RAM? I will start for the change in migration.c -uint64_t pend_pre, pend_compat, pend_post; +uint64_t pend_pre, pend_post; bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE; -qemu_savevm_state_pending_estimate(&pend_pre, &pend_compat, &pend_post); -uint64_t pending_size = pend_pre + pend_compat + pend_post; Here we add the three to pending_size, so it don't matter. +qemu_savevm_state_pending_estimate(&pend_pre, &pend_post); +uint64_t pending_size = pend_pre + pend_post; -trace_migrate_pending_estimate(pending_size, - pend_pre, pend_compat, pend_post); +trace_migrate_pending_estimate(pending_size, pend_pre, pend_post); Trace, we don't care again. -if (pend_pre + pend_compat <= s->threshold_size) { We do the test on pend_pre + pend_compat. So, the only place where we are using pend_compat alone is on trace messages. I guess we can agree that changing trace messages don't matter. -qemu_savevm_state_pending_exact(&pend_pre, &pend_compat, &pend_post); -pending_size = pend_pre + pend_compat + pend_post; -trace_migrate_pending_exact(pending_size, -pend_pre, pend_compat, pend_post); +if (pend_pre <= s->threshold_size) { +qemu_savevm_state_pending_exact(&pend_pre, &pend_post); +pending_size = pend_pre + pend_post; +trace_migrate_pending_exact(pending_size, pend_pre, pend_post); Ok, Now go to ram.c changes: The only change that we do is that if we are on postcopy, we add the @@ -3448,9 +3444,9 @@ static void ram_state_pending_exact(void *opaque, if (migrate_postcopy_ram()) { /* We can do postcopy, and all the data is postcopiable */ -*res_compatible += remaining_size; +*res_postcopy += remaining_size; } else { -*res_precopy_only += remaining_size; +*res_precopy += remaining_size; } } The only use of res_compatible was when we were on postocpy, nothing else sets it. But when we are on postcopy, we can send that pages trough postcopy, so we add them there and if we are on precopy, res_compatible was already zero. So I think that is ok, i.e. no behaviour change. What do you think? > Also, I think it would be a lot better to split the change for RAM > (s/res_compatible/res_postcopy) in one patch, and then drop the > totally unused res_compatible file in the second patch Ok, will do for next version. Later, Juan.
Re: [PATCH] block/mirror: add 'write-blocking-after-ready' copy mode
Am 31.01.23 um 19:18 schrieb Denis V. Lunev: > On 1/31/23 18:44, Vladimir Sementsov-Ogievskiy wrote: >> + Den >> >> Den, I remember we thought about that, and probably had a solution? >> >> Another possible approach to get benefits from both modes is to switch >> to blocking mode after first loop of copying. [*] >> >> Hmm. Thinking about proposed solution it seems, that [*] is better. >> The main reason of "write-blocking" mode is to force convergence of >> the mirror job. But you lose this thing if you postpone switching to >> the moment when mirror becomes READY: we may never reach it. >> >> >> >> Or, may be we can selectively skip or block guest writes, to keep some >> specified level of convergence? >> >> Or, we can start in "background" mode, track the speed of convergence >> (like dirty-delta per minute), and switch to blocking if speed becomes >> less than some threshold. >> > > That is absolutely correct. It would be very kind to start with > asynchronous mirror and switch to the synchronous one after > specific amount of loops. This should be somehow measured. > > Originally we have thought about switching after the one loop > (when basic background sending is done), but may be 2 iterations > (initial + first one) would be better. > > Talking about this specific approach and our past experience > I should note that reaching completion point in the > asynchronous mode was a real pain and slow down for the guest. > We have intentionally switched at that time to the sync mode > for that purpose. Thus this patch is a "worst half-way" for > us. While the asynchronous mode can take longer of course, the slowdown of the guest is bigger with the synchronous mode, or what am I missing? We have been using asynchronous mode for years (starting migration after all drive-mirror jobs are READY) and AFAIK our users did not complain about them not reaching READY or heavy slowdowns in the guest. We had two reports about an assertion failure recently when drive-mirror still got work left at the time the blockdrives are inactivated by migration. > Frankly speaking I would say that this switch could be considered > NOT QEMU job and we should just send a notification (event) for the > completion of the each iteration and management software should > take a decision to switch from async mode to the sync one. Unfortunately, our management software is a bit limited in that regard currently and making listening for events available in the necessary place would take a bit of work. Having the switch command would nearly be enough for us (we'd just switch after READY). But we'd also need that when the switch happens after READY, that all remaining asynchronous operations are finished by the command. Otherwise, the original issue with inactivating block drives while mirror still has work remains. Do those semantics for the switch sound acceptable? > Under such a condition we should have a command to perform mode > switch. This seems more QEMU/QAPI way :) > I feel like a switch at an arbitrary time might be hard to do correctly, i.e. to not miss some (implicit) assertion. But I can try and go for this approach if it's more likely to be accepted upstream. So a 'drive-mirror-change' command which takes the 'job-id' and a 'copy-mode' argument? And only allow the switch from 'background' to 'active' (initially)? Or a more generic 'block-job-change'? But that would need a way to take 'JobType'-specific arguments. Is that doable? (...) >>> @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job, >>> Error **errp) >>> if (s->in_flight == 0 && cnt == 0) { >>> trace_mirror_before_flush(s); >>> if (!job_is_ready(&s->common.job)) { >>> + if (s->copy_mode == >>> + MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) { >>> + /* >>> + * Pause guest I/O to check if we can switch to >>> active mode. >>> + * To set actively_synced to true below, it is >>> necessary to >>> + * have seen and synced all guest I/O. >>> + */ >>> + s->in_drain = true; >>> + bdrv_drained_begin(bs); >>> + if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) { >>> + bdrv_drained_end(bs); >>> + s->in_drain = false; >>> + continue; >>> + } >>> + s->in_active_mode = true; >>> + bdrv_disable_dirty_bitmap(s->dirty_bitmap); >>> + bdrv_drained_end(bs); >>> + s->in_drain = false; >>> + } >> >> I doubt, do we really need the drained section? > > Frankly speaking I do not like this. I'd better would not > rely on the enable/disable of the whole bitmap but encode > mode into MirrorOp and mark blocks dirty only for those > operations for which in_active_mode was set at the > operation start. > Do you mean
Re: [PATCH] block/mirror: add 'write-blocking-after-ready' copy mode
Am 31.01.23 um 18:44 schrieb Vladimir Sementsov-Ogievskiy: >> @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job, >> Error **errp) >> if (s->in_flight == 0 && cnt == 0) { >> trace_mirror_before_flush(s); >> if (!job_is_ready(&s->common.job)) { >> + if (s->copy_mode == >> + MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) { >> + /* >> + * Pause guest I/O to check if we can switch to >> active mode. >> + * To set actively_synced to true below, it is >> necessary to >> + * have seen and synced all guest I/O. >> + */ >> + s->in_drain = true; >> + bdrv_drained_begin(bs); >> + if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) { >> + bdrv_drained_end(bs); >> + s->in_drain = false; >> + continue; >> + } >> + s->in_active_mode = true; >> + bdrv_disable_dirty_bitmap(s->dirty_bitmap); >> + bdrv_drained_end(bs); >> + s->in_drain = false; >> + } > > I doubt, do we really need the drained section? > > Why can't we simply set s->in_active_mode here and don't care? > > I think bdrv_get_dirty_count(s->dirty_bitmap) can't become > 0 here, as > cnt is zero, we are in context of bs and we don't have yield point. (ok, > we have one in drained_begin(), but what I think we can simply drop > drained section here). > Thank you for the explanation! I wasn't sure if it's really needed and wanted to take it safe (should've mentioned that in the original mail). >> + >> if (mirror_flush(s) < 0) { >> /* Go check s->ret. */ >> continue; >> } >> + >> /* We're out of the streaming phase. From now on, >> if the job >> * is cancelled we will actually complete all >> pending I/O and >> * report completion. This way, block-job-cancel >> will leave >> @@ -1443,7 +1465,7 @@ static int coroutine_fn >> bdrv_mirror_top_do_write(BlockDriverState *bs, >> if (s->job) { >> copy_to_target = s->job->ret >= 0 && >> !job_is_cancelled(&s->job->common.job) && >> - s->job->copy_mode == >> MIRROR_COPY_MODE_WRITE_BLOCKING; >> + s->job->in_active_mode; >> } >> if (copy_to_target) { >> @@ -1494,7 +1516,7 @@ static int coroutine_fn >> bdrv_mirror_top_pwritev(BlockDriverState *bs, >> if (s->job) { >> copy_to_target = s->job->ret >= 0 && >> !job_is_cancelled(&s->job->common.job) && >> - s->job->copy_mode == >> MIRROR_COPY_MODE_WRITE_BLOCKING; >> + s->job->in_active_mode; >> } >> if (copy_to_target) { >> @@ -1792,7 +1814,10 @@ static BlockJob *mirror_start_job( >> goto fail; >> } >> if (s->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING) { >> + s->in_active_mode = true; >> bdrv_disable_dirty_bitmap(s->dirty_bitmap); >> + } else { >> + s->in_active_mode = false; >> } >> ret = block_job_add_bdrv(&s->common, "source", bs, 0, >> diff --git a/qapi/block-core.json b/qapi/block-core.json >> index 95ac4fa634..2a983ed78d 100644 >> --- a/qapi/block-core.json >> +++ b/qapi/block-core.json >> @@ -1200,10 +1200,13 @@ >> # addition, data is copied in background just like in >> # @background mode. >> # >> +# @write-blocking-after-ready: starts out in @background mode and >> switches to >> +# @write-blocking when transitioning to >> ready. > > You should add also (Since: 8.0) label > I'll try to remember it next time. >> +# >> # Since: 3.0 >> ## >> { 'enum': 'MirrorCopyMode', >> - 'data': ['background', 'write-blocking'] } >> + 'data': ['background', 'write-blocking', >> 'write-blocking-after-ready'] } >> ## >> # @BlockJobInfo: >
Re: [PULL 5/5] migration: simplify migration_iteration_run()
Vladimir Sementsov-Ogievskiy wrote: > On 30.01.23 11:03, Juan Quintela wrote: >> Signed-off-by: Juan Quintela >> Reviewed-by: Dr. David Alan Gilbert >> --- >> migration/migration.c | 24 >> 1 file changed, 12 insertions(+), 12 deletions(-) >> diff --git a/migration/migration.c b/migration/migration.c >> index 594a42f085..644c61e91d 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -3764,23 +3764,23 @@ static MigIterateState >> migration_iteration_run(MigrationState *s) >> pend_pre, pend_compat, pend_post); >> } >> -if (pending_size && pending_size >= s->threshold_size) { >> -/* Still a significant amount to transfer */ >> -if (!in_postcopy && pend_pre <= s->threshold_size && >> -qatomic_read(&s->start_postcopy)) { >> -if (postcopy_start(s)) { >> -error_report("%s: postcopy failed to start", __func__); >> -} >> -return MIG_ITERATE_SKIP; >> -} >> -/* Just another iteration step */ >> -qemu_savevm_state_iterate(s->to_dst_file, in_postcopy); >> -} else { >> +if (pending_size < s->threshold_size) { > > to keep the logic, formally it should be "if (!pending_size || pending_size < > s->threshold_size)"... > Actually, could s->threshold_size be 0 here? Or, worth an assertion > assert(s->threshold_size) ? It is weird, but it could: bandwidth = (double)transferred / time_spent; s->threshold_size = bandwidth * s->parameters.downtime_limit; That is the (real) only place when we set it, and if the network was completely down, bandwidth could be zero. But I think that it is better to explain in the other way. What does the current code do: if (too much dirty memory to converge) do another iteration else do final iteration What the new code do is if (low enough memory to converge) do final iteration. do another iteration. So, how we decide if we converge? if pending_size < s->threshold_size we "could" change it to pending_size <= s->threshold_size for the zero case But I think that we would be shooting ourselves in the foot, because we are having network trouble (only reason that s->theshold_size could be zero) and we still have all the devices state to migrate. And once that we enter that state, there is no way back, guest is stopped in source and we are not able to send anything else. So I think it is better this way. What do you think? Later, Juan.
Re: [PATCH v2 2/3] util/userfaultfd: Add uffd_open()
Peter Xu wrote: > Add a helper to create the uffd handle. > > Reviewed-by: Philippe Mathieu-Daudé > Signed-off-by: Peter Xu Reviewed-by: Juan Quintela I can get this one through migration tree.
Re: [PATCH v3 6/6] i386: Add new CPU model SapphireRapids
On Fri, 6 Jan 2023 00:38:26 -0800 Lei Wang wrote: > The new CPU model mostly inherits features from Icelake-Server, while > adding new features: > - AMX (Advance Matrix eXtensions) > - Bus Lock Debug Exception > and new instructions: > - AVX VNNI (Vector Neural Network Instruction): > - VPDPBUS: Multiply and Add Unsigned and Signed Bytes > - VPDPBUSDS: Multiply and Add Unsigned and Signed Bytes with Saturation > - VPDPWSSD: Multiply and Add Signed Word Integers > - VPDPWSSDS: Multiply and Add Signed Integers with Saturation > - FP16: Replicates existing AVX512 computational SP (FP32) instructions >using FP16 instead of FP32 for ~2X performance gain > - SERIALIZE: Provide software with a simple way to force the processor to >complete all modifications, faster, allowed in all privilege levels and >not causing an unconditional VM exit > - TSX Suspend Load Address Tracking: Allows programmers to choose which >memory accesses do not need to be tracked in the TSX read set > - AVX512_BF16: Vector Neural Network Instructions supporting BFLOAT16 >inputs and conversion instructions from IEEE single precision you should mention all new features here, preferably in format: feature-flag: expected CPUID feature bits , so reviewer would be able to find it in spec also you mention that it inherits most of the features from Icelake cpu, it would be nice to provide cpuid diff between real Icelake and new cpu somewhere in cover letter to simplify comparison. > > Features may be added in future versions: > - CET (virtualization support hasn't been merged) > Instructions may be added in future versions: > - fast zero-length MOVSB (KVM doesn't support yet) > - fast short STOSB (KVM doesn't support yet) > - fast short CMPSB, SCASB (KVM doesn't support yet) > > Signed-off-by: Lei Wang > Reviewed-by: Robert Hoo > --- > target/i386/cpu.c | 135 ++ > target/i386/cpu.h | 4 ++ > 2 files changed, 139 insertions(+) > > diff --git a/target/i386/cpu.c b/target/i386/cpu.c > index 946df29a3d..5e1ecd50b3 100644 > --- a/target/i386/cpu.c > +++ b/target/i386/cpu.c > @@ -3576,6 +3576,141 @@ static const X86CPUDefinition builtin_x86_defs[] = { > { /* end of list */ } > } > }, > +{ > +.name = "SapphireRapids", > +.level = 0x20, > +.vendor = CPUID_VENDOR_INTEL, > +.family = 6, > +.model = 143, > +.stepping = 4, > +/* > + * please keep the ascending order so that we can have a clear view > of > + * bit position of each feature. > + */ unless you have a way to enforce this comment. I wouldn't expect that would work in practice. Also If you wish to bring more order here, you should start with a prerequisite patch that would do the same to all existing CPU models. That way one can easily see a difference between Icelake and new cpu model. As it is in this patch (with all unnecessary reordering) it's hard for reviewer to see differences. I'd suggest to just copy Icelake definition and modify it to suit new cpu model (without reordering all feature bits) > +.features[FEAT_1_EDX] = > +CPUID_FP87 | CPUID_VME | CPUID_DE | CPUID_PSE | CPUID_TSC | > +CPUID_MSR | CPUID_PAE | CPUID_MCE | CPUID_CX8 | CPUID_APIC | > +CPUID_SEP | CPUID_MTRR | CPUID_PGE | CPUID_MCA | CPUID_CMOV | > +CPUID_PAT | CPUID_PSE36 | CPUID_CLFLUSH | CPUID_MMX | CPUID_FXSR > | > +CPUID_SSE | CPUID_SSE2, > +.features[FEAT_1_ECX] = > +CPUID_EXT_SSE3 | CPUID_EXT_PCLMULQDQ | CPUID_EXT_SSSE3 | > +CPUID_EXT_FMA | CPUID_EXT_CX16 | CPUID_EXT_PCID | > CPUID_EXT_SSE41 | > +CPUID_EXT_SSE42 | CPUID_EXT_X2APIC | CPUID_EXT_MOVBE | > +CPUID_EXT_POPCNT | CPUID_EXT_TSC_DEADLINE_TIMER | CPUID_EXT_AES | > +CPUID_EXT_XSAVE | CPUID_EXT_AVX | CPUID_EXT_F16C | > CPUID_EXT_RDRAND, > +.features[FEAT_8000_0001_EDX] = > +CPUID_EXT2_SYSCALL | CPUID_EXT2_NX | CPUID_EXT2_PDPE1GB | > +CPUID_EXT2_RDTSCP | CPUID_EXT2_LM, > +.features[FEAT_8000_0001_ECX] = > +CPUID_EXT3_LAHF_LM | CPUID_EXT3_ABM | CPUID_EXT3_3DNOWPREFETCH, > +.features[FEAT_8000_0008_EBX] = > +CPUID_8000_0008_EBX_WBNOINVD, > +.features[FEAT_7_0_EBX] = > +CPUID_7_0_EBX_FSGSBASE | CPUID_7_0_EBX_BMI1 | CPUID_7_0_EBX_HLE | > +CPUID_7_0_EBX_AVX2 | CPUID_7_0_EBX_SMEP | CPUID_7_0_EBX_BMI2 | > +CPUID_7_0_EBX_ERMS | CPUID_7_0_EBX_INVPCID | CPUID_7_0_EBX_RTM | > +CPUID_7_0_EBX_AVX512F | CPUID_7_0_EBX_AVX512DQ | > +CPUID_7_0_EBX_RDSEED | CPUID_7_0_EBX_ADX | CPUID_7_0_EBX_SMAP | > +CPUID_7_0_EBX_AVX512IFMA | CPUID_7_0_EBX_CLFLUSHOPT | > +CPUID_7_0_EBX_CLWB | CPUID_7_0_EBX_AVX512CD | > CPUID_7_0_EBX_SHA_NI | > +CPUID_7_0_EBX_AVX512BW | C
Re: [PATCH v15 01/11] s390x/cpu topology: adding s390 specificities to CPU topology
On 01/02/2023 14.20, Pierre Morel wrote: S390 adds two new SMP levels, drawers and books to the CPU topology. The S390 CPU have specific toplogy features like dedication Nit: s/toplogy/topology/ and polarity to give to the guest indications on the host vCPUs scheduling and help the guest take the best decisions on the scheduling of threads on the vCPUs. Let us provide the SMP properties with books and drawers levels and S390 CPU with dedication and polarity, Signed-off-by: Pierre Morel --- Apart from the nit: Reviewed-by: Thomas Huth
Re: [PULL 08/34] hw/arm/virt: Let the virtio-iommu bypass MSIs
Hi Eric, On 3/7/20 17:53, Peter Maydell wrote: From: Eric Auger At the moment the virtio-iommu translates MSI transactions. This behavior is inherited from ARM SMMU. The virt machine code knows where the guest MSI doorbells are so we can easily declare those regions as VIRTIO_IOMMU_RESV_MEM_T_MSI. With that setting the guest will not map MSIs through the IOMMU and those transactions will be simply bypassed. Depending on which MSI controller is in use (ITS or GICV2M), we declare either: - the ITS interrupt translation space (ITS_base + 0x1), containing the GITS_TRANSLATOR or - The GICV2M single frame, containing the MSI_SETSP_NS register. Signed-off-by: Eric Auger Message-id: 20200629070404.10969-6-eric.au...@redhat.com Reviewed-by: Peter Maydell Signed-off-by: Peter Maydell --- include/hw/arm/virt.h | 7 +++ hw/arm/virt.c | 30 ++ 2 files changed, 37 insertions(+) static void create_gic(VirtMachineState *vms) @@ -2198,8 +2200,36 @@ out: static void virt_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev, Error **errp) { +VirtMachineState *vms = VIRT_MACHINE(hotplug_dev); + if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) { virt_memory_pre_plug(hotplug_dev, dev, errp); +} else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI)) { +hwaddr db_start = 0, db_end = 0; +char *resv_prop_str; + +switch (vms->msi_controller) { +case VIRT_MSI_CTRL_NONE: +return; +case VIRT_MSI_CTRL_ITS: +/* GITS_TRANSLATER page */ +db_start = base_memmap[VIRT_GIC_ITS].base + 0x1; +db_end = base_memmap[VIRT_GIC_ITS].base + + base_memmap[VIRT_GIC_ITS].size - 1; +break; +case VIRT_MSI_CTRL_GICV2M: +/* MSI_SETSPI_NS page */ +db_start = base_memmap[VIRT_GIC_V2M].base; +db_end = db_start + base_memmap[VIRT_GIC_V2M].size - 1; +break; +} +resv_prop_str = g_strdup_printf("0x%"PRIx64":0x%"PRIx64":%u", +db_start, db_end, +VIRTIO_IOMMU_RESV_MEM_T_MSI); + +qdev_prop_set_uint32(dev, "len-reserved-regions", 1); Where is "len-reserved-regions" declared? Since qdev_prop_set_uint32() uses &error_abort, isn't this call aborting the process? I am confused how this code path is exercised, what am I missing? +qdev_prop_set_string(dev, "reserved-regions[0]", resv_prop_str); +g_free(resv_prop_str); } }
Re: [PULL 00/35] Testing, docs, semihosting and plugin updates
On Wed, 1 Feb 2023 at 18:07, Alex Bennée wrote: > Peter Maydell writes: > > I think this is "you can't put labels in qemu-options.hx, > > because it gets included in two .rst files (invocation.rst > > and qemu-manpage.rst), and Sphinx complains about the > > duplicate labels, even though one of the two files is > > only used in the HTML and one is only used in the manpages". > > Oh boo - anyway to work around that because they are helpful links? Nothing easy. The problem is that Sphinx looks at every .rst file in the source directory, regardless of whether it's reachable from the document you specify as the root of the manual or not. So both lots of .rst files get processed for both the HTML manual set and the manpages, even though they don't need to be[*]. This is a long-standing design deficiency in Sphinx. The only thing I could think of was splitting the manpages and html docs entirely into separate subdirectories, and having meson symlink the files which are actually shared between them. But that seems like quite a lot of extra machinery. [*] This shows up for instance in the HTML docs getting a not-linked-to-from-anywhere HTML version of the qemu(1) manpage: https://www.qemu.org/docs/master/system/qemu-manpage.html -- PMM
Re: [PULL 08/34] hw/arm/virt: Let the virtio-iommu bypass MSIs
On 2/2/23 11:47, Philippe Mathieu-Daudé wrote: Hi Eric, On 3/7/20 17:53, Peter Maydell wrote: From: Eric Auger At the moment the virtio-iommu translates MSI transactions. This behavior is inherited from ARM SMMU. The virt machine code knows where the guest MSI doorbells are so we can easily declare those regions as VIRTIO_IOMMU_RESV_MEM_T_MSI. With that setting the guest will not map MSIs through the IOMMU and those transactions will be simply bypassed. Depending on which MSI controller is in use (ITS or GICV2M), we declare either: - the ITS interrupt translation space (ITS_base + 0x1), containing the GITS_TRANSLATOR or - The GICV2M single frame, containing the MSI_SETSP_NS register. Signed-off-by: Eric Auger Message-id: 20200629070404.10969-6-eric.au...@redhat.com Reviewed-by: Peter Maydell Signed-off-by: Peter Maydell --- include/hw/arm/virt.h | 7 +++ hw/arm/virt.c | 30 ++ 2 files changed, 37 insertions(+) static void create_gic(VirtMachineState *vms) @@ -2198,8 +2200,36 @@ out: static void virt_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev, Error **errp) { + VirtMachineState *vms = VIRT_MACHINE(hotplug_dev); + if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) { virt_memory_pre_plug(hotplug_dev, dev, errp); + } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI)) { + hwaddr db_start = 0, db_end = 0; + char *resv_prop_str; + + switch (vms->msi_controller) { + case VIRT_MSI_CTRL_NONE: + return; + case VIRT_MSI_CTRL_ITS: + /* GITS_TRANSLATER page */ + db_start = base_memmap[VIRT_GIC_ITS].base + 0x1; + db_end = base_memmap[VIRT_GIC_ITS].base + + base_memmap[VIRT_GIC_ITS].size - 1; + break; + case VIRT_MSI_CTRL_GICV2M: + /* MSI_SETSPI_NS page */ + db_start = base_memmap[VIRT_GIC_V2M].base; + db_end = db_start + base_memmap[VIRT_GIC_V2M].size - 1; + break; + } + resv_prop_str = g_strdup_printf("0x%"PRIx64":0x%"PRIx64":%u", + db_start, db_end, + VIRTIO_IOMMU_RESV_MEM_T_MSI); + + qdev_prop_set_uint32(dev, "len-reserved-regions", 1); Where is "len-reserved-regions" declared? Since qdev_prop_set_uint32() uses &error_abort, isn't this call aborting the process? I am confused how this code path is exercised, what am I missing? The call path is: qdev_prop_set_uint32 -> object_property_set_int -> object_property_set_qobject -> object_property_set -> object_property_find_err So QEMU should abort displaying: "Property 'virtio-iommu-pci.len-reserved-regions' not found". + qdev_prop_set_string(dev, "reserved-regions[0]", resv_prop_str); + g_free(resv_prop_str); } }
Re: [PATCH v2 3/3] util/userfaultfd: Support /dev/userfaultfd
Peter Xu wrote: > Teach QEMU to use /dev/userfaultfd when it existed and fallback to the > system call if either it's not there or doesn't have enough permission. > > Firstly, as long as the app has permission to access /dev/userfaultfd, it > always have the ability to trap kernel faults which QEMU mostly wants. > Meanwhile, in some context (e.g. containers) the userfaultfd syscall can be > forbidden, so it can be the major way to use postcopy in a restricted > environment with strict seccomp setup. > > Reviewed-by: Philippe Mathieu-Daudé > Signed-off-by: Peter Xu Hi Can we change this code to not use the global variable. > --- > util/trace-events | 1 + > util/userfaultfd.c | 37 + > 2 files changed, 38 insertions(+) > > diff --git a/util/trace-events b/util/trace-events > index c8f53d7d9f..16f78d8fe5 100644 > --- a/util/trace-events > +++ b/util/trace-events > @@ -93,6 +93,7 @@ qemu_vfio_region_info(const char *desc, uint64_t > region_ofs, uint64_t region_siz > qemu_vfio_pci_map_bar(int index, uint64_t region_ofs, uint64_t region_size, > int ofs, void *host) "map region bar#%d addr 0x%"PRIx64" size 0x%"PRIx64" ofs > 0x%x host %p" > > #userfaultfd.c > +uffd_detect_open_mode(int mode) "%d" > uffd_query_features_nosys(int err) "errno: %i" > uffd_query_features_api_failed(int err) "errno: %i" > uffd_create_fd_nosys(int err) "errno: %i" > diff --git a/util/userfaultfd.c b/util/userfaultfd.c > index 9845a2ec81..7dceab51d6 100644 > --- a/util/userfaultfd.c > +++ b/util/userfaultfd.c > @@ -18,10 +18,47 @@ > #include > #include > #include > +#include > + > +typedef enum { > +UFFD_UNINITIALIZED = 0, > +UFFD_USE_DEV_PATH, > +UFFD_USE_SYSCALL, > +} uffd_open_mode; > + > +static int uffd_dev; > + > +static uffd_open_mode uffd_detect_open_mode(void) > +{ > +static uffd_open_mode open_mode; > + > +if (open_mode == UFFD_UNINITIALIZED) { > +/* > + * Make /dev/userfaultfd the default approach because it has better > + * permission controls, meanwhile allows kernel faults without any > + * privilege requirement (e.g. SYS_CAP_PTRACE). > + */ > +uffd_dev = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC); > +if (uffd_dev >= 0) { > +open_mode = UFFD_USE_DEV_PATH; > +} else { > +/* Fallback to the system call */ > +open_mode = UFFD_USE_SYSCALL; > +} > +trace_uffd_detect_open_mode(open_mode); > +} > + > +return open_mode; > +} > > int uffd_open(int flags) > { > #if defined(__linux__) && defined(__NR_userfaultfd) > +if (uffd_detect_open_mode() == UFFD_USE_DEV_PATH) { > +assert(uffd_dev >= 0); > +return ioctl(uffd_dev, USERFAULTFD_IOC_NEW, flags); > +} > + > return syscall(__NR_userfaultfd, flags); > #else > return -EINVAL; static int open_userfaultd(void) { /* * Make /dev/userfaultfd the default approach because it has better * permission controls, meanwhile allows kernel faults without any * privilege requirement (e.g. SYS_CAP_PTRACE). */ int uffd = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC); if (uffd >= 0) { return uffd; } return -1; } int uffd_open(int flags) { #if defined(__linux__) && defined(__NR_userfaultfd) static int uffd = -2; if (uffd == -2) { uffd = open_userfaultd(); } if (uffd >= 0) { return ioctl(uffd, USERFAULTFD_IOC_NEW, flags); } return syscall(__NR_userfaultfd, flags); #else return -EINVAL; 27 lines vs 42 No need for enum type No need for global variable What do you think? Later, Juan.
Re: [PATCH v2 1/3] linux-headers: Update to v6.1
Peter Xu wrote: > Signed-off-by: Peter Xu How does this change gets into the tree? I know that it is "automagically" generated, but who decides when that goes into the tree? As we need that for the following patch? Later, Juan.
Re: [PATCH] accel/tcg: Complete cpu initialization before registration
On Wed, 1 Feb 2023 at 20:37, Richard Henderson wrote: > > On 2/1/23 04:20, Eric Auger wrote: > > What I fail to understand is why this code is called with a kvm > > accelerated qemu (the test runs by default with kvm). > ... > > #2 0x02aaab1500f0 in vmsa_ttbr_write > > (env=0x2aaac393850, ri=0x2aaac3c90e0, value=2154950976315703518) at > > ../target/arm/helper.c:3784 > > #3 0x02aaab14e5a8 in write_raw_cp_reg > > (env=env@entry=0x2aaac393850, ri=ri@entry=0x2aaac3c90e0, > > v=v@entry=2154950976315703518) > > This is indeed very curious -- vmsa_ttbr_write is supposed to be the "cooked" > .writefn, > not the .raw_writefn. We're not supposed to arrive here at all. If you only provide a cooked .writefn and no .raw_writefn, the default is to assume that the cooked function will also work as the raw one. None of the ARMCPRegInfo structs that use vmsa_ttbr_write specify a raw_writefn... thanks -- PMM
Re: Display update issue on M1 Macs
On Tue, 31 Jan 2023, BALATON Zoltan wrote: On Tue, 31 Jan 2023, Akihiko Odaki wrote: [...] To summarise previous discussion: - There's a problem on Apple M1 Macs with sm501 and ati-vga 2d accel functions drawing from device model into the video memory of the emulated card which is not shown on screen when the display update callback is called from another thread. This works on x86_64 host so I suspect it may be related to missing memory synchronisation that ARM may need. - This can be reproduced running AmigaOS4 on sam460ex or MorphOS (demo iso downliadable from their web site) on sam460ex, pegasos2 or mac99,via=pmu with -device ati-vga,romfile="" as described here: http://zero.eik.bme.hu/~balaton/qemu/amiga/ - I can't test it myself lacking hardware so I have to rely on reports from people who have this hardware so there may be some uncertainity in the info I get. - We have confirmed it's not related to a known race condition as disabling dirty tracking and always doing full updates of whole screen did not fix it: But there is an exception: memory_region_snapshot_and_clear_dirty() releases iothread lock, and that broke raspi3b display device: https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=wn+k8dqneb_...@mail.gmail.com/T/ It is unexpected that gfx_update() callback releases iothread lock so it may break things in peculiar ways. Peter, is there any change in the situation regarding the race introduced by memory_region_snapshot_and_clear_dirty()? For now, to workaround the issue, I think you can create another mutex and make the entire sm501_2d_engine_write() and sm501_update_display() critical sections. Interesting thread but not sure it's the same problem so this workaround may not be enough to fix my issue. Here's a video posted by one of the people who reported it showing the problem on M1 Mac: https://www.youtube.com/watch?v=FDqoNbp6PQs and here's how it looks like on other machines: https://www.youtube.com/watch?v=ML7-F4HNFKQ There are also videos showing it running on RPi 4 and G5 Mac without this issue so it seems to only happen on Apple Silicon M1 Macs. What's strange is that graphics elements are not just delayed which I think should happen with missing thread synchronisation where the update callback would miss some pixels rendered during it's running but subsequent update callbacks would eventually draw those, woudn't they? Also setting full_update to 1 in sm501_update_display() callback to disable dirty tracking does not fix the problem. So it looks like as if sm501_2d_operation() running on one CPU core only writes data to the local cache of that core which sm501_update_display() running on other core can't see, so maybe some cache synchronisation is needed in memory_region_set_dirty() or if that's already there maybe I should call it for all changes not only those in the visible display area? I'm still not sure I understand the problem and don't know what could be a fix for it so anything to test to identify the issue better might also bring us closer to a solution. If you set full_update to 1, you may also comment out memory_region_snapshot_and_clear_dirty() and memory_region_snapshot_get_dirty() to avoid the iothread mutex being unlocked. The iothread mutex should ensure cache coherency as well. But as you say, it's weird that the rendered result is not just delayed but missed. That may imply other possibilities (e.g., the results are overwritten by someone else). If the problem persists after commenting out memory_region_snapshot_and_clear_dirty() and memory_region_snapshot_get_dirty(), I think you can assume the inter-thread coherency between sm501_2d_operation() and sm501_update_display() is not causing the problem. I've asked people who reported and can reproduce it to test this but it did not change anything so confirmed it's not that race condition but looks more like some cache inconsistency maybe. Any other ideas? I can come up with two important differences between x86 and Arm which can affect the execution of QEMU: 1. Memory model. Arm uses a memory model more relaxed than x86 so it is more sensitive for synchronization failures among threads. 2. Different instructions. TCG uses JIT so differences in instructions matter. We should be able to exclude 1) as a potential cause of the problem. iothread mutex should take care of race condition and even cache coherency problem; mutex includes memory barrier functionality. [...] For difference 2), you may try to use TCI. You can find details of TCI in tcg/tci/README. This was tested and also with TCI got the same results just much slower. The common sense tells, however, the memory model is usually the cause of the problem when you see behavioral differences between x86 and Arm, and TCG should work fine with both of x86 and Arm as they should have been tested well. [...] Fortunately macOS provides Rosetta 2 for x86 emulatio
Re: [PATCH v5 04/20] scripts/clean-includes: Improve --git commit message
Markus Armbruster wrote: > It's less terse. Fine with me. The mix of passive and active voice > feels a bit awkward, though. Another try: > > All .c should include qemu/osdep.h first. This script performs three > related cleanups: > > * Ensure .c files include qemu/osdep.h first. > * Including it in a .h is redundant, since the .c already includes > it. Drop such inclusions. > * Likewise, including headers qemu/osdep.h includes is redundant. > Drop these, too. Perfect, thanks. Reviewed-by: Juan Quintela or whatever you want it O:-)
Re: An issue with x86 tcg and MMIO
On 2/1/23 23:39, Jonathan Cameron wrote: Not sure - if we can do the handling above then sure we could make that change. I can see there is a path to register the callbacks but I'd kind of assumed ROM meant read only... I think "romd" means "read mostly". In the case of flash, I believe that a write changes modes (block erase something something) and the region changes state into MMIO. But normal state is read mode where read+execute go through unchallenged. It has been a long time since I studied how all that works, so I may well have forgotten something. r~
Re: [PULL 08/34] hw/arm/virt: Let the virtio-iommu bypass MSIs
On Thu, 2 Feb 2023 at 10:47, Philippe Mathieu-Daudé wrote: > Where is "len-reserved-regions" declared? DEFINE_PROP_ARRAY("reserved-regions", ...) does this. For an array property "foo" the machinery creates an integer property "foo-len", which must be set first. Setting that then creates properties "foo[0]", "foo[1]", ... which can be set. thanks -- PMM
Re: [RFC v5 1/3] rcu: introduce rcu_read_is_locked()
Chuang Xu wrote: > add rcu_read_is_locked() to detect holding of rcu lock. > > Signed-off-by: Chuang Xu Reviewed-by: Juan Quintela Althought I think that petting a review from Paolo or anyone that knows more than RCU could be a good idea. > --- > include/qemu/rcu.h | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h > index b063c6fde8..719916d9d3 100644 > --- a/include/qemu/rcu.h > +++ b/include/qemu/rcu.h > @@ -119,6 +119,13 @@ static inline void rcu_read_unlock(void) > } > } > > +static inline bool rcu_read_is_locked(void) > +{ > +struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader(); > + > +return p_rcu_reader->depth > 0; > +} > + > extern void synchronize_rcu(void); > > /*
Re: [RFC v5 3/3] migration: reduce time of loading non-iterable vmstate
Chuang Xu wrote: > The duration of loading non-iterable vmstate accounts for a significant > portion of downtime (starting with the timestamp of source qemu stop and > ending with the timestamp of target qemu start). Most of the time is spent > committing memory region changes repeatedly. > > This patch packs all the changes to memory region during the period of > loading non-iterable vmstate in a single memory transaction. With the > increase of devices, this patch will greatly improve the performance. > > Here are the test1 results: > test info: > - Host > - Intel(R) Xeon(R) Platinum 8260 CPU > - NVIDIA Mellanox ConnectX-5 > - VM > - 32 CPUs 128GB RAM VM > - 8 16-queue vhost-net device > - 16 4-queue vhost-user-blk device. > > time of loading non-iterable vmstate downtime > beforeabout 150 ms 740+ ms > after about 30 ms 630+ ms > > In test2, we keep the number of the device the same as test1, reduce the > number of queues per device: > > Here are the test2 results: > test info: > - Host > - Intel(R) Xeon(R) Platinum 8260 CPU > - NVIDIA Mellanox ConnectX-5 > - VM > - 32 CPUs 128GB RAM VM > - 8 1-queue vhost-net device > - 16 1-queue vhost-user-blk device. > > time of loading non-iterable vmstate downtime > beforeabout 90 ms about 250 ms > > after about 25 ms about 160 ms > > In test3, we keep the number of queues per device the same as test1, reduce > the number of devices: > > Here are the test3 results: > test info: > - Host > - Intel(R) Xeon(R) Platinum 8260 CPU > - NVIDIA Mellanox ConnectX-5 > - VM > - 32 CPUs 128GB RAM VM > - 1 16-queue vhost-net device > - 1 4-queue vhost-user-blk device. > > time of loading non-iterable vmstate downtime > beforeabout 20 ms about 70 ms > after about 11 ms about 60 ms > > As we can see from the test results above, both the number of queues and > the number of devices have a great impact on the time of loading non-iterable > vmstate. The growth of the number of devices and queues will lead to more > mr commits, and the time consumption caused by the flatview reconstruction > will also increase. > > Signed-off-by: Chuang Xu Reviewed-by: Juan Quintela And BTW, nice load times improvements.
Re: [PATCH v5 0/8] virtio-mem: Handle preallocation with migration
David Hildenbrand wrote: > On 17.01.23 12:22, David Hildenbrand wrote: >> While playing with migration of virtio-mem with an ordinary file backing, >> I realized that migration and prealloc doesn't currently work as expected >> for virtio-mem. Further, Jing Qi reported that setup issues (insufficient >> huge pages on the destination) result in QEMU getting killed with SIGBUS >> instead of failing gracefully. >> In contrast to ordinary memory backend preallocation, virtio-mem >> preallocates memory before plugging blocks to the guest. Consequently, >> when migrating we are not actually preallocating on the destination but >> "only" migrate pages. Fix that be migrating the bitmap early, before any >> RAM content, and use that information to preallocate memory early, before >> migrating any RAM. >> Postcopy needs some extra care, and I realized that >> prealloc+postcopy is >> shaky in general. Let's at least try to mimic what ordinary >> prealloc+postcopy does: temporarily allocate the memory, discard it, and >> cross fingers that we'll still have sufficient memory when postcopy >> actually tries placing pages. >> Cc: Dr. David Alan Gilbert >> Cc: Juan Quintela >> Cc: Peter Xu >> Cc: Michael S. Tsirkin >> Cc: Michal Privoznik > > @Juan, David: I can similarly take this via my tree as well. Reviewing it. I can get it through migration tree, thanks. Later, Juan.
Re: [PATCH v3 0/6] Support for new CPU model SapphireRapids
On Fri, 6 Jan 2023 00:38:20 -0800 Lei Wang wrote: > This series aims to add a new CPU model SapphireRapids, and tries to > address the problem stated in > https://lore.kernel.org/all/20220812055751.14553-1-lei4.w...@intel.com/T/#mcf67dbd1ad37c65d7988c36a2b267be9afd2fb30, > so that named CPU model can define its own AMX values, and QEMU won't > pass the wrong AMX values to KVM in future platforms if they have > different values supported. > > The original patch is > https://lore.kernel.org/all/20220812055751.14553-1-lei4.w...@intel.com/T/#u. MultiBitFeatureInfo looks like an interesting idea but among fixing whatever issues this has atm, you'd probably need to introduce a new qdev_bitfield property infrastructure so that such features could be treated like any other qdev properties. Also when MultiBitFeatureInfo is added, one should convert all other usecases where it's applicable (not only for new code) so that we would end up with consolidated approach instead of zoo mess. I'm not sure all MultiBitFeatureInfo complexity is necessary just for adding a new CPU model, I'd rather use ad hoc approach that we were using before for non boolean features. And then try to develop MultiBitFeatureInfo & co as a separate series to demonstrate how much it will improve current cpu models definitions. PS: 'make check-acceptance' are broken with this > --- > > Changelog: > > v3: > - Rebase on the latest QEMU (d1852caab131ea898134fdcea8c14bc2ee75fbe9). > - v2: > https://lore.kernel.org/qemu-devel/20221102085256.81139-1-lei4.w...@intel.com/ > > v2: > - Fix when passing all zeros of AMX-related CPUID, QEMU will warn >unsupported. > - Remove unnecessary function definition and make code cleaner. > - Fix some typos. > - v1: > https://lore.kernel.org/qemu-devel/20221027020036.373140-1-lei4.w...@intel.com/T/#t > > > Lei Wang (6): > i386: Introduce FeatureWordInfo for AMX CPUID leaf 0x1D and 0x1E > i386: Remove unused parameter "uint32_t bit" in > feature_word_description() > i386: Introduce new struct "MultiBitFeatureInfo" for multi-bit > features > i386: Mask and report unavailable multi-bit feature values > i386: Initialize AMX CPUID leaves with corresponding env->features[] > leaves > i386: Add new CPU model SapphireRapids > > target/i386/cpu-internal.h | 11 ++ > target/i386/cpu.c | 311 +++-- > target/i386/cpu.h | 16 ++ > 3 files changed, 322 insertions(+), 16 deletions(-) > > > base-commit: d1852caab131ea898134fdcea8c14bc2ee75fbe9
Re: [RFC v5 0/3] migration: reduce time of loading non-iterable vmstate
Chuang Xu wrote: > In this version: > > - rename rcu_read_locked() to rcu_read_is_locked(). > - adjust the sanity check in address_space_to_flatview(). > - improve some comments. > > The duration of loading non-iterable vmstate accounts for a significant > portion of downtime (starting with the timestamp of source qemu stop and > ending with the timestamp of target qemu start). Most of the time is spent > committing memory region changes repeatedly. > > This patch packs all the changes to memory region during the period of > loading non-iterable vmstate in a single memory transaction. With the > increase of devices, this patch will greatly improve the performance. > > Here are the test1 results: > test info: > - Host > - Intel(R) Xeon(R) Platinum 8260 CPU > - NVIDIA Mellanox ConnectX-5 > - VM > - 32 CPUs 128GB RAM VM > - 8 16-queue vhost-net device > - 16 4-queue vhost-user-blk device. > > time of loading non-iterable vmstate downtime > beforeabout 150 ms 740+ ms > after about 30 ms 630+ ms > > (This result is different from that of v1. It may be that someone has > changed something on my host.., but it does not affect the display of > the optimization effect.) > > > In test2, we keep the number of the device the same as test1, reduce the > number of queues per device: > > Here are the test2 results: > test info: > - Host > - Intel(R) Xeon(R) Platinum 8260 CPU > - NVIDIA Mellanox ConnectX-5 > - VM > - 32 CPUs 128GB RAM VM > - 8 1-queue vhost-net device > - 16 1-queue vhost-user-blk device. > > time of loading non-iterable vmstate downtime > beforeabout 90 ms about 250 ms > > after about 25 ms about 160 ms > > > > In test3, we keep the number of queues per device the same as test1, reduce > the number of devices: > > Here are the test3 results: > test info: > - Host > - Intel(R) Xeon(R) Platinum 8260 CPU > - NVIDIA Mellanox ConnectX-5 > - VM > - 32 CPUs 128GB RAM VM > - 1 16-queue vhost-net device > - 1 4-queue vhost-user-blk device. > > time of loading non-iterable vmstate downtime > beforeabout 20 ms about 70 ms > after about 11 ms about 60 ms > > > As we can see from the test results above, both the number of queues and > the number of devices have a great impact on the time of loading non-iterable > vmstate. The growth of the number of devices and queues will lead to more > mr commits, and the time consumption caused by the flatview reconstruction > will also increase. > > Please review, Chuang. Hi As on the review, I aggree with the patches, but I am waiting for Paolo to review the rcu change (that I think that it is trivial, but I am not the rcu maintainer). If it happens that you need to send another version, I think that you can change the RFC for PATCH. Again, very good job. Later, Juan. > [v4] > > - attach more information in the cover letter. > - remove changes on virtio_load. > - add rcu_read_locked() to detect holding of rcu lock. > > [v3] > > - move virtio_load_check_delay() from virtio_memory_listener_commit() to > virtio_vmstate_change(). > - add delay_check flag to VirtIODevice to make sure virtio_load_check_delay() > will be called when delay_check is true. > > [v2] > > - rebase to latest upstream. > - add sanity check to address_space_to_flatview(). > - postpone the init of the vring cache until migration's loading completes. > > [v1] > > The duration of loading non-iterable vmstate accounts for a significant > portion of downtime (starting with the timestamp of source qemu stop and > ending with the timestamp of target qemu start). Most of the time is spent > committing memory region changes repeatedly. > > This patch packs all the changes to memory region during the period of > loading non-iterable vmstate in a single memory transaction. With the > increase of devices, this patch will greatly improve the performance. > > Here are the test results: > test vm info: > - 32 CPUs 128GB RAM > - 8 16-queue vhost-net device > - 16 4-queue vhost-user-blk device. > > time of loading non-iterable vmstate > beforeabout 210 ms > after about 40 ms
Re: [PULL 08/34] hw/arm/virt: Let the virtio-iommu bypass MSIs
On 2/2/23 11:58, Peter Maydell wrote: On Thu, 2 Feb 2023 at 10:47, Philippe Mathieu-Daudé wrote: Where is "len-reserved-regions" declared? DEFINE_PROP_ARRAY("reserved-regions", ...) does this. For an array property "foo" the machinery creates an integer property "foo-len", which must be set first. Setting that then creates properties "foo[0]", "foo[1]", ... which can be set. Oh indeed now I see the array prefix: #define PROP_ARRAY_LEN_PREFIX "len-" Not obvious to realize while grepping. Thanks for the quick answer!
Re: [PATCH] block/mirror: add 'write-blocking-after-ready' copy mode
On 2/2/23 11:19, Fiona Ebner wrote: Am 31.01.23 um 19:18 schrieb Denis V. Lunev: On 1/31/23 18:44, Vladimir Sementsov-Ogievskiy wrote: + Den Den, I remember we thought about that, and probably had a solution? Another possible approach to get benefits from both modes is to switch to blocking mode after first loop of copying. [*] Hmm. Thinking about proposed solution it seems, that [*] is better. The main reason of "write-blocking" mode is to force convergence of the mirror job. But you lose this thing if you postpone switching to the moment when mirror becomes READY: we may never reach it. Or, may be we can selectively skip or block guest writes, to keep some specified level of convergence? Or, we can start in "background" mode, track the speed of convergence (like dirty-delta per minute), and switch to blocking if speed becomes less than some threshold. That is absolutely correct. It would be very kind to start with asynchronous mirror and switch to the synchronous one after specific amount of loops. This should be somehow measured. Originally we have thought about switching after the one loop (when basic background sending is done), but may be 2 iterations (initial + first one) would be better. Talking about this specific approach and our past experience I should note that reaching completion point in the asynchronous mode was a real pain and slow down for the guest. We have intentionally switched at that time to the sync mode for that purpose. Thus this patch is a "worst half-way" for us. While the asynchronous mode can take longer of course, the slowdown of the guest is bigger with the synchronous mode, or what am I missing? We have been using asynchronous mode for years (starting migration after all drive-mirror jobs are READY) and AFAIK our users did not complain about them not reaching READY or heavy slowdowns in the guest. We had two reports about an assertion failure recently when drive-mirror still got work left at the time the blockdrives are inactivated by migration. We have tests internally which performs migration with different IO write rates during the process. We use such kind of tests more than 10 years and before we have switched to QEMU from previous proprietary hypervisor. Once the switch happened, we have started to experience failures here due to asynchronous nature of the mirror. That is why we have raised initial discussion and this code have appeared. I may be incorrect or too self oriented here, but this could be my understanding of the situation :) Frankly speaking I would say that this switch could be considered NOT QEMU job and we should just send a notification (event) for the completion of the each iteration and management software should take a decision to switch from async mode to the sync one. Unfortunately, our management software is a bit limited in that regard currently and making listening for events available in the necessary place would take a bit of work. Having the switch command would nearly be enough for us (we'd just switch after READY). But we'd also need that when the switch happens after READY, that all remaining asynchronous operations are finished by the command. That could be a matter of the other event I believe. We switch mode and reset the state. New READY event will be sent once the bitmap is cleared. That seems fair. Otherwise, the original issue with inactivating block drives while mirror still has work remains. Do those semantics for the switch sound acceptable? Under such a condition we should have a command to perform mode switch. This seems more QEMU/QAPI way :) I feel like a switch at an arbitrary time might be hard to do correctly, i.e. to not miss some (implicit) assertion. But I can try and go for this approach if it's more likely to be accepted upstream. So a 'drive-mirror-change' command which takes the 'job-id' and a 'copy-mode' argument? And only allow the switch from 'background' to 'active' (initially)? That is what I can not guarantee as we have Kevin and Stefan here. Though in general this opens interesting possibilities for other use-cases :) Or a more generic 'block-job-change'? But that would need a way to take 'JobType'-specific arguments. Is that doable? (...) @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) if (s->in_flight == 0 && cnt == 0) { trace_mirror_before_flush(s); if (!job_is_ready(&s->common.job)) { + if (s->copy_mode == + MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) { + /* + * Pause guest I/O to check if we can switch to active mode. + * To set actively_synced to true below, it is necessary to + * have seen and synced all guest I/O. + */ + s->in_drain = true; + bdrv_drained_begin(bs); + if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) {
Re: [PATCH v5 0/8] virtio-mem: Handle preallocation with migration
On Tue, Jan 17, 2023 at 12:22:41PM +0100, David Hildenbrand wrote: > While playing with migration of virtio-mem with an ordinary file backing, > I realized that migration and prealloc doesn't currently work as expected > for virtio-mem. Further, Jing Qi reported that setup issues (insufficient > huge pages on the destination) result in QEMU getting killed with SIGBUS > instead of failing gracefully. > > In contrast to ordinary memory backend preallocation, virtio-mem > preallocates memory before plugging blocks to the guest. Consequently, > when migrating we are not actually preallocating on the destination but > "only" migrate pages. Fix that be migrating the bitmap early, before any > RAM content, and use that information to preallocate memory early, before > migrating any RAM. > > Postcopy needs some extra care, and I realized that prealloc+postcopy is > shaky in general. Let's at least try to mimic what ordinary > prealloc+postcopy does: temporarily allocate the memory, discard it, and > cross fingers that we'll still have sufficient memory when postcopy > actually tries placing pages. > > Cc: Dr. David Alan Gilbert > Cc: Juan Quintela > Cc: Peter Xu > Cc: Michael S. Tsirkin > Cc: Michal Privoznik Reviewed-by: Michael S. Tsirkin > v4 -> v5: > - "migration/savevm: Move more savevm handling into vmstate_save()" > -- Extended patch description regarding tracing > - "migration/savevm: Prepare vmdesc json writer in >qemu_savevm_state_setup()" > -- Move freeing to migrate_fd_cleanup() > - "migration/savevm: Allow immutable device state to be migrated early (i.e., > before RAM)" > -- "immutable" -> "early_setup" > -- Extend comment > - Added some RBs (thanks!) > > v3 -> v4: > - First 3 patches: > -- Minimze code changes and simplify > -- Save immutable device state during qemu_savevm_state_setup() > -- Don't use vmsd priorities, use a new flag > -- Split it logically up > - "migration/ram: Factor out check for advised postcopy" > -- Don't factor out postcopy_is_running() > - "virtio-mem: Migrate immutable properties early" > -- Adjust to changed vmsd interface > - "virtio-mem: Proper support for preallocation with migration" > -- Drop sanity check in virtio_mem_post_load_early() > > v2 -> v3: > - New approach/rewrite, drop RB and TB of last patch > > v1 -> v2: > - Added RBs and Tested-bys > - "virtio-mem: Fail if a memory backend with "prealloc=on" is specified" > -- Fail instead of warn > -- Adjust subject/description > > > David Hildenbrand (8): > migration/savevm: Move more savevm handling into vmstate_save() > migration/savevm: Prepare vmdesc json writer in > qemu_savevm_state_setup() > migration/savevm: Allow immutable device state to be migrated early > (i.e., before RAM) > migration/vmstate: Introduce VMSTATE_WITH_TMP_TEST() and > VMSTATE_BITMAP_TEST() > migration/ram: Factor out check for advised postcopy > virtio-mem: Fail if a memory backend with "prealloc=on" is specified > virtio-mem: Migrate immutable properties early > virtio-mem: Proper support for preallocation with migration > > hw/core/machine.c | 4 +- > hw/virtio/virtio-mem.c | 144 - > include/hw/virtio/virtio-mem.h | 8 ++ > include/migration/misc.h | 4 +- > include/migration/vmstate.h| 28 ++- > migration/migration.c | 9 +++ > migration/migration.h | 4 + > migration/ram.c| 8 +- > migration/savevm.c | 105 +--- > 9 files changed, 255 insertions(+), 59 deletions(-) > > -- > 2.39.0
Re: [PATCH v1 1/5] migration/ram: Fix populate_read_range()
David Hildenbrand wrote: > Unfortunately, commit f7b9dcfbcf44 broke populate_read_range(): the loop > end condition is very wrong, resulting in that function not populating the > full range. Lets' fix that. > > Fixes: f7b9dcfbcf44 ("migration/ram: Factor out populating pages readable in > ram_block_populate_pages()") > Cc: qemu-sta...@nongnu.org > Signed-off-by: David Hildenbrand Reviewed-by: Juan Quintela
Re: [PATCH v1 2/5] migration/ram: Fix error handling in ram_write_tracking_start()
David Hildenbrand wrote: > If something goes wrong during uffd_change_protection(), we would miss > to unregister uffd-wp and not release our reference. Fix it by > performing the uffd_change_protection(true) last. > > Note that a uffd_change_protection(false) on the recovery path without a > prior uffd_change_protection(false) is fine. > > Fixes: 278e2f551a09 ("migration: support UFFD write fault processing in > ram_save_iterate()") > Cc: qemu-sta...@nongnu.org > Signed-off-by: David Hildenbrand Reviewed-by: Juan Quintela
Re: [PATCH v1 3/5] migration/ram: Don't explicitly unprotect when unregistering uffd-wp
David Hildenbrand wrote: > When unregistering uffd-wp, older kernels before commit f369b07c86143 > ("mm/uffd:reset write protection when unregister with wp-mode") won't > clear the uffd-wp PTE bit. When re-registering uffd-wp, the previous > uffd-wp PTE bits would trigger again. With above commit, the kernel will > clear the uffd-wp PTE bits when unregistering itself. > > Consequently, we'll clear the uffd-wp PTE bits now twice -- whereby we > don't care about clearing them at all: a new background snapshot will > re-register uffd-wp and re-protect all memory either way. > > So let's skip the manual clearing of uffd-wp. If ever relevant, we > could clear conditionally in uffd_unregister_memory() -- we just need a > way to figure out more recent kernels. > > Signed-off-by: David Hildenbrand Reviewed-by: Juan Quintela Fixing a bug by removing code O:-)
Re: [PATCH v1 4/5] migration/ram: Rely on used_length for uffd_change_protection()
David Hildenbrand wrote: > ram_mig_ram_block_resized() will abort migration (including background > snapshots) when resizing a RAMBlock. ram_block_populate_read() will only > populate RAM up to used_length, so at least for anonymous memory > protecting everything between used_length and max_length won't > actually be protected and is just a NOP. > > So let's only protect everything up to used_length. > > Note: it still makes sense to register uffd-wp for max_length, such > that RAM_UF_WRITEPROTECT is independent of a changing used_length. > > Signed-off-by: David Hildenbrand Reviewed-by: Juan Quintela
Re: [PATCH v1 0/5] migration/ram: background snapshot fixes and optimiations
David Hildenbrand wrote: > On 05.01.23 13:45, David Hildenbrand wrote: >> Playing with background snapshots in combination with hugetlb and >> virtio-mem, I found two issues and some reasonable optimizations (skip >> unprotecting when unregistering). >> With virtio-mem (RamDiscardManager), we now won't be allocating >> unnecessary >> page tables for unplugged ranges when using uffd-wp with shmem/hugetlb. >> Cc: Juan Quintela (maintainer:Migration) >> Cc: "Dr. David Alan Gilbert" (maintainer:Migration) >> Cc: Peter Xu >> Cc: Andrey Gruzdev > > @Juan, David: I can take this via my tree (especially the last patch > only optimized virtio-mem for now). Just let me know. I reviewed everything now. Queued on my tree. Later, Juan.
Re: [PATCH v1 5/5] migration/ram: Optimize ram_write_tracking_start() for RamDiscardManager
David Hildenbrand wrote: > ram_block_populate_read() already optimizes for RamDiscardManager. > However, ram_write_tracking_start() will still try protecting discarded > memory ranges. > > Let's optimize, because discarded ranges don't map any pages and > > (1) For anonymous memory, trying to protect using uffd-wp without a mapped > page is ignored by the kernel and consequently a NOP. > > (2) For shared/file-backed memory, we will fill present page tables in the > range with PTE markers. However, we will even allocate page tables > just to fill them with unnecessary PTE markers and effectively > waste memory. > > So let's exclude these ranges, just like ram_block_populate_read() > already does. > > Signed-off-by: David Hildenbrand Reviewed-by: Juan Quintela
[PATCH v6 3/8] igb: add ICR_RXDW
IGB uses RXDW ICR bit to indicate that rx descriptor has been written back. This is the same as RXT0 bit in older HW. Signed-off-by: Sriram Yagnaraman --- hw/net/e1000x_regs.h | 4 hw/net/igb_core.c| 2 +- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/hw/net/e1000x_regs.h b/hw/net/e1000x_regs.h index fb5b861135..f509db73a7 100644 --- a/hw/net/e1000x_regs.h +++ b/hw/net/e1000x_regs.h @@ -335,6 +335,7 @@ #define E1000_ICR_RXDMT00x0010 /* rx desc min. threshold (0) */ #define E1000_ICR_RXO 0x0040 /* rx overrun */ #define E1000_ICR_RXT0 0x0080 /* rx timer intr (ring 0) */ +#define E1000_ICR_RXDW 0x0080 /* rx desc written back */ #define E1000_ICR_MDAC 0x0200 /* MDIO access complete */ #define E1000_ICR_RXCFG 0x0400 /* RX /c/ ordered set */ #define E1000_ICR_GPI_EN0 0x0800 /* GP Int 0 */ @@ -378,6 +379,7 @@ #define E1000_ICS_RXDMT0E1000_ICR_RXDMT0/* rx desc min. threshold */ #define E1000_ICS_RXO E1000_ICR_RXO /* rx overrun */ #define E1000_ICS_RXT0 E1000_ICR_RXT0 /* rx timer intr */ +#define E1000_ICS_RXDW E1000_ICR_RXDW /* rx desc written back */ #define E1000_ICS_MDAC E1000_ICR_MDAC /* MDIO access complete */ #define E1000_ICS_RXCFG E1000_ICR_RXCFG /* RX /c/ ordered set */ #define E1000_ICS_GPI_EN0 E1000_ICR_GPI_EN0 /* GP Int 0 */ @@ -407,6 +409,7 @@ #define E1000_IMS_RXDMT0E1000_ICR_RXDMT0/* rx desc min. threshold */ #define E1000_IMS_RXO E1000_ICR_RXO /* rx overrun */ #define E1000_IMS_RXT0 E1000_ICR_RXT0 /* rx timer intr */ +#define E1000_IMS_RXDW E1000_ICR_RXDW /* rx desc written back */ #define E1000_IMS_MDAC E1000_ICR_MDAC /* MDIO access complete */ #define E1000_IMS_RXCFG E1000_ICR_RXCFG /* RX /c/ ordered set */ #define E1000_IMS_GPI_EN0 E1000_ICR_GPI_EN0 /* GP Int 0 */ @@ -441,6 +444,7 @@ #define E1000_IMC_RXDMT0E1000_ICR_RXDMT0/* rx desc min. threshold */ #define E1000_IMC_RXO E1000_ICR_RXO /* rx overrun */ #define E1000_IMC_RXT0 E1000_ICR_RXT0 /* rx timer intr */ +#define E1000_IMC_RXDW E1000_ICR_RXDW /* rx desc written back */ #define E1000_IMC_MDAC E1000_ICR_MDAC /* MDIO access complete */ #define E1000_IMC_RXCFG E1000_ICR_RXCFG /* RX /c/ ordered set */ #define E1000_IMC_GPI_EN0 E1000_ICR_GPI_EN0 /* GP Int 0 */ diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c index b484e6ac30..1ddf54f630 100644 --- a/hw/net/igb_core.c +++ b/hw/net/igb_core.c @@ -1582,7 +1582,7 @@ igb_receive_internal(IGBCore *core, const struct iovec *iov, int iovcnt, n |= E1000_ICS_RXDMT0; } -n |= E1000_ICR_RXT0; +n |= E1000_ICR_RXDW; trace_e1000e_rx_written_to_guest(rxr.i->idx); } -- 2.34.1
Re: [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration
On Thu, Feb 2, 2023 at 2:00 AM Si-Wei Liu wrote: > > > > On 1/12/2023 9:24 AM, Eugenio Pérez wrote: > > It's possible to migrate vdpa net devices if they are shadowed from the > > > > start. But to always shadow the dataplane is effectively break its host > > > > passthrough, so its not convenient in vDPA scenarios. > > > > > > > > This series enables dynamically switching to shadow mode only at > > > > migration time. This allow full data virtqueues passthrough all the > > > > time qemu is not migrating. > > > > > > > > Successfully tested with vdpa_sim_net (but it needs some patches, I > > > > will send them soon) and qemu emulated device with vp_vdpa with > > > > some restrictions: > > > > * No CVQ. > > > > * VIRTIO_RING_F_STATE patches. > What are these patches (I'm not sure I follow VIRTIO_RING_F_STATE, is it > a new feature that other vdpa driver would need for live migration)? > Not really, Since vp_vdpa wraps a virtio-net-pci driver to give it vdpa capabilities it needs a virtio in-band method to set and fetch the virtqueue state. Jason sent a proposal some time ago [1], and I implemented it in qemu's virtio emulated device. I can send them as a RFC but I didn't worry about making it pretty, nor I think they should be merged at the moment. vdpa parent drivers should follow vdpa_sim changes. Thanks! [1] https://lists.oasis-open.org/archives/virtio-comment/202103/msg00036.html > -Siwei > > > > > * Expose _F_SUSPEND, but ignore it and suspend on ring state fetch like > > > >DPDK. > > > > > > > > Comments are welcome, especially in the patcheswith RFC in the message. > > > > > > > > v2: > > > > - Use a migration listener instead of a memory listener to know when > > > >the migration starts. > > > > - Add stuff not picked with ASID patches, like enable rings after > > > >driver_ok > > > > - Add rewinding on the migration src, not in dst > > > > - v1 at https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01664.html > > > > > > > > Eugenio Pérez (13): > > > >vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check > > > >vdpa net: move iova tree creation from init to start > > > >vdpa: copy cvq shadow_data from data vqs, not from x-svq > > > >vdpa: rewind at get_base, not set_base > > > >vdpa net: add migration blocker if cannot migrate cvq > > > >vhost: delay set_vring_ready after DRIVER_OK > > > >vdpa: delay set_vring_ready after DRIVER_OK > > > >vdpa: Negotiate _F_SUSPEND feature > > > >vdpa: add feature_log parameter to vhost_vdpa > > > >vdpa net: allow VHOST_F_LOG_ALL > > > >vdpa: add vdpa net migration state notifier > > > >vdpa: preemptive kick at enable > > > >vdpa: Conditionally expose _F_LOG in vhost_net devices > > > > > > > > include/hw/virtio/vhost-backend.h | 4 + > > > > include/hw/virtio/vhost-vdpa.h| 1 + > > > > hw/net/vhost_net.c| 25 ++- > > > > hw/virtio/vhost-vdpa.c| 64 +--- > > > > hw/virtio/vhost.c | 3 + > > > > net/vhost-vdpa.c | 247 +- > > > > 6 files changed, 278 insertions(+), 66 deletions(-) > > > > > > >
Re: [PATCH 0/6] Shorten the runtime of some gitlab-CI shared runner jobs
On 30/1/23 11:44, Thomas Huth wrote: We're currently facing the problem that the gitlab-CI jobs for the shared runners take too much of the limited CI minutes on gitlab.com. Here are now some patches that optimize some of the jobs a little bit to take less runtime. We slightly lose some test coverage by some of these changes (e.g. by dropping ppc-softmmu from a Clang-based test and only continue testing ppc64-softmmu with Clang in another job), but that should still be much better than running out of CI minutes after 3/4 of a month. FWIW the last time I wanted to add some tests Alex suggested me to run a before/after gcov report and justify the new path tested. Maybe we should enforce something similar, either cover new paths or fix a bug.
Re: [PATCH] block/mirror: add 'write-blocking-after-ready' copy mode
Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben: > Am 31.01.23 um 19:18 schrieb Denis V. Lunev: > > Frankly speaking I would say that this switch could be considered > > NOT QEMU job and we should just send a notification (event) for the > > completion of the each iteration and management software should > > take a decision to switch from async mode to the sync one. My first thought was very similar. We should provide a building block that just switches between the two modes and then the management tool can decide what the right policy is. Adding a new event when the first iteration is done (I'm not sure if there is much value in having it for later iterations) makes sense to me if someone wants to use it. If we add it, let's not forget that events can be lost and clients must be able to query the same information, so we'd have to add it to query-jobs, too - which in turn requires adding a job type specific struct to JobInfo first. Once we have this generic infrastructure with low-level building block, I wouldn't necessarily be opposed to having an option build on top where QEMU automatically does what we consider most useful for most users. auto-finalize/dismiss already do something similar. > Unfortunately, our management software is a bit limited in that regard > currently and making listening for events available in the necessary > place would take a bit of work. Having the switch command would nearly > be enough for us (we'd just switch after READY). But we'd also need > that when the switch happens after READY, that all remaining > asynchronous operations are finished by the command. Otherwise, the > original issue with inactivating block drives while mirror still has > work remains. Do those semantics for the switch sound acceptable? Completing the remaining asynchronous operations can take a while, so I don't think it's something to be done in a synchronous QMP command. Do we need an event that tells you that the switch has completed? But having to switch the mirror job to sync mode just to avoid doing I/O on an inactive device sounds wrong to me. It doesn't fix the root cause of that problem, but just papers over it. Why does your management tool not complete the mirror job before it does the migration switchover that inactivates images? Kevin
Re: An issue with x86 tcg and MMIO
On Thu, 2 Feb 2023 at 10:56, Richard Henderson wrote: > > On 2/1/23 23:39, Jonathan Cameron wrote: > > Not sure - if we can do the handling above then sure we could make that > > change. > > I can see there is a path to register the callbacks but I'd kind of assumed > > ROM meant read only... > > I think "romd" means "read mostly". > > In the case of flash, I believe that a write changes modes (block erase > something > something) and the region changes state into MMIO. But normal state is read > mode where > read+execute go through unchallenged. In QEMU a ROMD MemoryRegion (created by memory_region_init_rom_device()) is a memory region backed by RAM for reads and by callbacks for writes. (I think ROMD stands for "ROM device".) You can then use memory_region_device_set_romd() to put the ROMD into either ROMD mode (the default, reads backed by RAM) or MMIO mode (reads backed by MMIO callbacks). Writes are always callbacks regardless. This is mainly meant for flash devices, which are usually reads-as-data but have a programming mode where you write a command to it and then read back command results. It's possible to use it for other tricks too. When a ROMD is in ROMD mode then execution from it is as fast as execution from any RAM; when it is in MMIO mode then execution from it is as slow as execution from any other MMIO-backed MemoryRegion. Note that AFAIK you can't execute from MMIO at all with KVM (either ROMD-in-MMIO mode or a plain old MMIO device). You might want to look at whether QEMU's iommu functionality is helpful to you -- I'm assuming CXL doesn't do weird stuff on a less-than-page granularity, and the iommu APIs will let you do "programmatically decide where this address should actually go". The other option involves mapping and unmapping MemoryRegions inside a container MR. thanks -- PMM
Re: [PATCH] hw/arm: Use TYPE_ARM_SMMUV3
On Tue, 24 Jan 2023 at 23:21, Richard Henderson wrote: > > Use the macro instead of two explicit string literals. > > Signed-off-by: Richard Henderson > --- > hw/arm/sbsa-ref.c | 3 ++- > hw/arm/virt.c | 2 +- > 2 files changed, 3 insertions(+), 2 deletions(-) Applied to target-arm.next, thanks. -- PMM
Re: [PATCH] target/arm: Fix physical address resolution for Stage2
On Thu, 26 Jan 2023 at 23:32, Richard Henderson wrote: > > Conversion to probe_access_full missed applying the page offset. > > Cc: qemu-sta...@nongnu.org > Reported-by: Sid Manning > Fixes: f3639a64f602 ("target/arm: Use softmmu tlbs for page table walking") > Signed-off-by: Richard Henderson > --- > target/arm/ptw.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Applied to target-arm.next, thanks. -- PMM
Re: [PATCH v5 1/8] migration/savevm: Move more savevm handling into vmstate_save()
David Hildenbrand wrote: > Let's move more code into vmstate_save(), reducing code duplication and > preparing for reuse of vmstate_save() in qemu_savevm_state_setup(). We > have to move vmstate_save() to make the compiler happy. > > We'll now also trace from qemu_save_device_state(), triggering the same > tracepoints as previously called from > qemu_savevm_state_complete_precopy_non_iterable() only. Note that > qemu_save_device_state() ignores iterable device state, such as RAM, > and consequently doesn't trigger some other trace points (e.g., > trace_savevm_state_setup()). > > Signed-off-by: David Hildenbrand Reviewed-by: Juan Quintela
Re: [PATCH v5 2/8] migration/savevm: Prepare vmdesc json writer in qemu_savevm_state_setup()
David Hildenbrand wrote: > ... and store it in the migration state. This is a preparation for > storing selected vmds's already in qemu_savevm_state_setup(). > > Reviewed-by: Dr. David Alan Gilbert > Signed-off-by: David Hildenbrand Reviewed-by: Juan Quintela
Re: [PATCH v5 3/8] migration/savevm: Allow immutable device state to be migrated early (i.e., before RAM)
David Hildenbrand wrote: > For virtio-mem, we want to have the plugged/unplugged state of memory > blocks available before migrating any actual RAM content, and perform > sanity checks before touching anything on the destination. This > information is immutable on the migration source while migration is active, > > We want to use this information for proper preallocation support with > migration: currently, we don't preallocate memory on the migration target, > and especially with hugetlb, we can easily run out of hugetlb pages during > RAM migration and will crash (SIGBUS) instead of catching this gracefully > via preallocation. > > Migrating device state via a VMSD before we start iterating is currently > impossible: the only approach that would be possible is avoiding a VMSD > and migrating state manually during save_setup(), to be restored during > load_state(). > > Let's allow for migrating device state via a VMSD early, during the > setup phase in qemu_savevm_state_setup(). To keep it simple, we > indicate applicable VMSD's using an "early_setup" flag. > > Note that only very selected devices (i.e., ones seriously messing with > RAM setup) are supposed to make use of such early state migration. > > While at it, also use a bool for the "unmigratable" member. > > Signed-off-by: David Hildenbrand Reviewed-by: Juan Quintela
Re: [PATCH v5 4/8] migration/vmstate: Introduce VMSTATE_WITH_TMP_TEST() and VMSTATE_BITMAP_TEST()
David Hildenbrand wrote: > We'll make use of both next in the context of virtio-mem. > > Reviewed-by: Dr. David Alan Gilbert > Signed-off-by: David Hildenbrand Reviewed-by: Juan Quintela
Re: [PATCH v5 5/8] migration/ram: Factor out check for advised postcopy
David Hildenbrand wrote: > Let's factor out this check, to be used in virtio-mem context next. > > While at it, fix a spelling error in a related comment. > > Signed-off-by: David Hildenbrand Reviewed-by: Juan Quintela
Re: [PATCH v5 6/8] virtio-mem: Fail if a memory backend with "prealloc=on" is specified
David Hildenbrand wrote: > "prealloc=on" for the memory backend does not work as expected, as > virtio-mem will simply discard all preallocated memory immediately again. > In the best case, it's an expensive NOP. In the worst case, it's an > unexpected allocation error. > > Instead, "prealloc=on" should be specified for the virtio-mem device only, > such that virtio-mem will try preallocating memory before plugging > memory dynamically to the guest. Fail if such a memory backend is > provided. > > Tested-by: Michal Privoznik > Reviewed-by: Dr. David Alan Gilbert > Signed-off-by: David Hildenbrand Reviewed-by: Juan Quintela
Re: [PATCH v5 7/8] virtio-mem: Migrate immutable properties early
David Hildenbrand wrote: > The bitmap and the size are immutable while migration is active: see > virtio_mem_is_busy(). We can migrate this information early, before > migrating any actual RAM content. Further, all information we need for > sanity checks is immutable as well. > > Having this information in place early will, for example, allow for > properly preallocating memory before touching these memory locations > during RAM migration: this way, we can make sure that all memory was > actually preallocated and that any user errors (e.g., insufficient > hugetlb pages) can be handled gracefully. > > In contrast, usable_region_size and requested_size can theoretically > still be modified on the source while the VM is running. Keep migrating > these properties the usual, late, way. > > Use a new device property to keep behavior of compat machines > unmodified. > > Signed-off-by: David Hildenbrand Reviewed-by: Juan Quintela > diff --git a/hw/core/machine.c b/hw/core/machine.c > index 616f3a207c..29b57f6448 100644 > --- a/hw/core/machine.c > +++ b/hw/core/machine.c > @@ -41,7 +41,9 @@ > #include "hw/virtio/virtio-pci.h" > #include "qom/object_interfaces.h" > > -GlobalProperty hw_compat_7_2[] = {}; > +GlobalProperty hw_compat_7_2[] = { > +{ "virtio-mem", "x-early-migration", "false" }, > +}; We can stop prettending that we don't support this kind of properties O:-)
Re: [PATCH v5 8/8] virtio-mem: Proper support for preallocation with migration
David Hildenbrand wrote: > Ordinary memory preallocation runs when QEMU starts up and creates the > memory backends, before processing the incoming migration stream. With > virtio-mem, we don't know which memory blocks to preallocate before > migration started. Now that we migrate the virtio-mem bitmap early, before > migrating any RAM content, we can safely preallocate memory for all plugged > memory blocks before migrating any RAM content. > > This is especially relevant for the following cases: > > (1) User errors > > With hugetlb/files, if we don't have sufficient backend memory available on > the migration destination, we'll crash QEMU (SIGBUS) during RAM migration > when running out of backend memory. Preallocating memory before actual > RAM migration allows for failing gracefully and informing the user about > the setup problem. > > (2) Excluded memory ranges during migration > > For example, virtio-balloon free page hinting will exclude some pages > from getting migrated. In that case, we won't crash during RAM > migration, but later, when running the VM on the destination, which is > bad. > > To fix this for new QEMU machines that migrate the bitmap early, > preallocate the memory early, before any RAM migration. Warn with old > QEMU machines. > > Getting postcopy right is a bit tricky, but we essentially now implement > the same (problematic) preallocation logic as ordinary preallocation: > preallocate memory early and discard it again before precopy starts. During > ordinary preallocation, discarding of RAM happens when postcopy is advised. > As the state (bitmap) is loaded after postcopy was advised but before > postcopy starts listening, we have to discard memory we preallocated > immediately again ourselves. > > Note that nothing (not even hugetlb reservations) guarantees for postcopy > that backend memory (especially, hugetlb pages) are still free after they > were freed ones while discarding RAM. Still, allocating that memory at > least once helps catching some basic setup problems. > > Before this change, trying to restore a VM when insufficient hugetlb > pages are around results in the process crashing to to a "Bus error" > (SIGBUS). With this change, QEMU fails gracefully: > > qemu-system-x86_64: qemu_prealloc_mem: preallocating memory failed: Bad > address > qemu-system-x86_64: error while loading state for instance 0x0 of device > ':00:03.0/virtio-mem-device-early' > qemu-system-x86_64: load of migration failed: Cannot allocate memory > > And we can even introspect the early migration data, including the > bitmap: > $ ./scripts/analyze-migration.py -f STATEFILE > { > "ram (2)": { > "section sizes": { > ":00:03.0/mem0": "0x00078000", > ":00:04.0/mem1": "0x00078000", > "pc.ram": "0x0001", > "/rom@etc/acpi/tables": "0x0002", > "pc.bios": "0x0004", > ":00:02.0/e1000.rom": "0x0004", > "pc.rom": "0x0002", > "/rom@etc/table-loader": "0x1000", > "/rom@etc/acpi/rsdp": "0x1000" > } > }, > ":00:03.0/virtio-mem-device-early (51)": { > "tmp": "00 00 00 01 40 00 00 00 00 00 00 07 80 00 00 00 00 00 00 00 00 > 20 00 00 00 00 00 00", > "size": "0x4000", > "bitmap": "ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [...] > }, > ":00:04.0/virtio-mem-device-early (53)": { > "tmp": "00 00 00 08 c0 00 00 00 00 00 00 07 80 00 00 00 00 00 00 00 00 > 20 00 00 00 00 00 00", > "size": "0x0001fa40", > "bitmap": "ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [...] > }, > [...] > > Reported-by: Jing Qi > Reviewed-by: Dr. David Alan Gilbert > Signed-off-by: David Hildenbrand Reviewed-by: Juan Quintela
Re: [PATCH 01/14] linux-user/sparc: Raise SIGILL for all unhandled software traps
On Wed, 2023-02-01 at 14:51 -1000, Richard Henderson wrote: > The linux kernel's trap tables vector all unassigned trap > numbers to BAD_TRAP, which then raises SIGILL. > > Reported-by: Ilya Leoshkevich > Signed-off-by: Richard Henderson > --- > linux-user/sparc/cpu_loop.c | 8 > 1 file changed, 8 insertions(+) Thanks! Tested-by: Ilya Leoshkevich
[PATCH v2 00/10] TriCore instruction bugfixes
Hi, while resolving [1], I noticed a few more bugs in DEXTR and LD_BU_PREINC which this patch series fixes. I also included the solo patch [2] into this series. For v2 I added testcases in tests/tcg/tricore that exercise the errors that these patches fix. Cheers, Bastian [1] https://gitlab.com/qemu-project/qemu/-/issues/653 [2] https://lore.kernel.org/qemu-devel/20230113123759.677960-1-kbast...@mail.uni-paderborn.de/ Bastian Koppelmann (10): target/tricore: Fix OPC2_32_RCRW_IMASK translation tests/tcg/tricore: Add test for OPC2_32_RCRW_IMASK target/tricore: Fix OPC2_32_RCRW_INSERT translation tests/tcg/tricore: Add test for OPC2_32_RCRW_INSERT target/tricore: Fix RRPW_DEXTR tests/tcg/tricore: Add tests for RRPW_DEXTR target/tricore: Fix OPC2_32__DEXTR tests/tcg/tricore: Add OPC2_32__DEXTR tests target/tricore: Fix OPC2_32_BO_LD_BU_PREINC tests/tcg/tricore: Add LD.BU tests target/tricore/translate.c| 39 ++-- tests/tcg/tricore/Makefile.softmmu-target | 4 ++ tests/tcg/tricore/macros.h| 63 +-- tests/tcg/tricore/test_dextr.S| 75 +++ tests/tcg/tricore/test_imask.S| 10 +++ tests/tcg/tricore/test_insert.S | 9 +++ tests/tcg/tricore/test_ld_bu.S| 15 + 7 files changed, 193 insertions(+), 22 deletions(-) create mode 100644 tests/tcg/tricore/test_dextr.S create mode 100644 tests/tcg/tricore/test_imask.S create mode 100644 tests/tcg/tricore/test_insert.S create mode 100644 tests/tcg/tricore/test_ld_bu.S -- 2.39.1
[PATCH v2 04/10] tests/tcg/tricore: Add test for OPC2_32_RCRW_INSERT
DREG_RS2 and DREG_CALC_RESULT were mapped to the same register which would not trigger https://gitlab.com/qemu-project/qemu/-/issues/653. So let's make each register unique. Signed-off-by: Bastian Koppelmann --- tests/tcg/tricore/Makefile.softmmu-target | 1 + tests/tcg/tricore/macros.h| 16 tests/tcg/tricore/test_insert.S | 9 + 3 files changed, 22 insertions(+), 4 deletions(-) create mode 100644 tests/tcg/tricore/test_insert.S diff --git a/tests/tcg/tricore/Makefile.softmmu-target b/tests/tcg/tricore/Makefile.softmmu-target index bc0cfae8d0..afabba8631 100644 --- a/tests/tcg/tricore/Makefile.softmmu-target +++ b/tests/tcg/tricore/Makefile.softmmu-target @@ -11,6 +11,7 @@ TESTS += test_fadd.tst TESTS += test_fmul.tst TESTS += test_ftoi.tst TESTS += test_imask.tst +TESTS += test_insert.tst TESTS += test_madd.tst TESTS += test_msub.tst TESTS += test_muls.tst diff --git a/tests/tcg/tricore/macros.h b/tests/tcg/tricore/macros.h index ceb7e9c0b7..4f2bc3cb0f 100644 --- a/tests/tcg/tricore/macros.h +++ b/tests/tcg/tricore/macros.h @@ -9,10 +9,10 @@ /* Register definitions */ #define DREG_RS1 %d0 #define DREG_RS2 %d1 -#define DREG_RS3 %d4 -#define DREG_CALC_RESULT %d1 -#define DREG_CALC_PSW %d2 -#define DREG_CORRECT_PSW %d3 +#define DREG_RS3 %d2 +#define DREG_CALC_RESULT %d3 +#define DREG_CALC_PSW %d4 +#define DREG_CORRECT_PSW %d5 #define DREG_TEMP_LI %d10 #define DREG_TEMP %d11 #define DREG_TEST_NUM %d14 @@ -103,6 +103,14 @@ test_ ## num: \ insn DREG_CALC_RESULT, DREG_RS1, DREG_RS2, imm; \ ) +#define TEST_D_DIDI(insn, num, result, rs1, imm1, rs2, imm2) \ +TEST_CASE(num, DREG_CALC_RESULT, result, \ +LI(DREG_RS1, rs1); \ +LI(DREG_RS2, rs1); \ +rstv;\ +insn DREG_CALC_RESULT, DREG_RS1, imm1, DREG_RS2, imm2; \ +) + #define TEST_E_ED(insn, num, res_hi, res_lo, rs1_hi, rs1_lo, rs2) \ TEST_CASE_E(num, res_lo, res_hi, \ LI(EREG_RS1_LO, rs1_lo); \ diff --git a/tests/tcg/tricore/test_insert.S b/tests/tcg/tricore/test_insert.S new file mode 100644 index 00..d5fd2237e1 --- /dev/null +++ b/tests/tcg/tricore/test_insert.S @@ -0,0 +1,9 @@ +#include "macros.h" +.text +.global _start +_start: +#insn numresultrs1imm1 rs2 imm2 +# | | || | || +TEST_D_DIDI(insert, 1, 0x7fff, 0x, 0xa, 0x10, 0x8) + +TEST_PASSFAIL -- 2.39.1
[PATCH v2 03/10] target/tricore: Fix OPC2_32_RCRW_INSERT translation
we were mixing up the "c" and "d" registers. We used "d" as a destination register und "c" as the source. According to the TriCore ISA manual 1.6 vol 2 it is the other way round. Reviewed-by: Richard Henderson Signed-off-by: Bastian Koppelmann Resolves: https://gitlab.com/qemu-project/qemu/-/issues/653 --- target/tricore/translate.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/target/tricore/translate.c b/target/tricore/translate.c index 8de4e56b1f..6149d4f5c0 100644 --- a/target/tricore/translate.c +++ b/target/tricore/translate.c @@ -5805,8 +5805,8 @@ static void decode_rcrw_insert(DisasContext *ctx) tcg_gen_movi_tl(temp, width); tcg_gen_movi_tl(temp2, const4); -tcg_gen_andi_tl(temp3, cpu_gpr_d[r4], 0x1f); -gen_insert(cpu_gpr_d[r3], cpu_gpr_d[r1], temp2, temp, temp3); +tcg_gen_andi_tl(temp3, cpu_gpr_d[r3], 0x1f); +gen_insert(cpu_gpr_d[r4], cpu_gpr_d[r1], temp2, temp, temp3); tcg_temp_free(temp3); break; -- 2.39.1
[PATCH v2 07/10] target/tricore: Fix OPC2_32_RRRR_DEXTR
if cpu_gpr_d[r3] == 0 then we were shifting the lower register to the right by 32 which is undefined behaviour. In this case the TriCore would do nothing an just return the higher register cpu_reg_d[r1]. We fixed that by detecting whether cpu_gpr_d[r3] was zero and cleared the lower register. Reviewed-by: Richard Henderson Signed-off-by: Bastian Koppelmann --- v1 -> v2: - use tcg_constant_tl(0) for TCGv zero - add extra line on top of comment target/tricore/translate.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/target/tricore/translate.c b/target/tricore/translate.c index 3b4ec530b1..8bf78b46d0 100644 --- a/target/tricore/translate.c +++ b/target/tricore/translate.c @@ -8245,10 +8245,19 @@ static void decode__extract_insert(DisasContext *ctx) if (r1 == r2) { tcg_gen_rotl_tl(cpu_gpr_d[r4], cpu_gpr_d[r1], tmp_pos); } else { +TCGv msw = tcg_temp_new(); +TCGv zero = tcg_constant_tl(0); tcg_gen_shl_tl(tmp_width, cpu_gpr_d[r1], tmp_pos); -tcg_gen_subfi_tl(tmp_pos, 32, tmp_pos); -tcg_gen_shr_tl(tmp_pos, cpu_gpr_d[r2], tmp_pos); -tcg_gen_or_tl(cpu_gpr_d[r4], tmp_width, tmp_pos); +tcg_gen_subfi_tl(msw, 32, tmp_pos); +tcg_gen_shr_tl(msw, cpu_gpr_d[r2], msw); +/* + * if pos == 0, then we do cpu_gpr_d[r2] << 32, which is undefined + * behaviour. So check that case here and set the low bits to zero + * which effectivly returns cpu_gpr_d[r1] + */ +tcg_gen_movcond_tl(TCG_COND_EQ, msw, tmp_pos, zero, zero, msw); +tcg_gen_or_tl(cpu_gpr_d[r4], tmp_width, msw); +tcg_temp_free(msw); } break; case OPC2_32__EXTR: -- 2.39.1
[PATCH v2 09/10] target/tricore: Fix OPC2_32_BO_LD_BU_PREINC
we were sign extending the result of the load, while the instruction clearly states that the result should be unsigned. Reviewed-by: Richard Henderson Signed-off-by: Bastian Koppelmann --- target/tricore/translate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/target/tricore/translate.c b/target/tricore/translate.c index 8bf78b46d0..ab386cef50 100644 --- a/target/tricore/translate.c +++ b/target/tricore/translate.c @@ -4964,7 +4964,7 @@ static void decode_bo_addrmode_ld_post_pre_base(DisasContext *ctx) tcg_gen_addi_tl(cpu_gpr_a[r2], cpu_gpr_a[r2], off10); break; case OPC2_32_BO_LD_BU_PREINC: -gen_ld_preincr(ctx, cpu_gpr_d[r1], cpu_gpr_a[r2], off10, MO_SB); +gen_ld_preincr(ctx, cpu_gpr_d[r1], cpu_gpr_a[r2], off10, MO_UB); break; case OPC2_32_BO_LD_D_SHORTOFF: CHECK_REG_PAIR(r1); -- 2.39.1
Re: [PATCH 1/2] tests/avocado: Invert parameter vs. tag precedence during setUp
On 1/20/23 19:14, Fabiano Rosas wrote: We currently never pass parameters to the avocado process via Makefile. To start doing so we need to invert the precedence between command line parameters and tags, otherwise a command line parameter would override values for all the tests, which is unlikely to be a common use-case. A more likely use-case is to force certain values for the tests that have no tags. For instance, if a test has no 'arch' tags and therefore can run for all targets, one could possibly force it to run on a certain target with an arch=foo parameter. This applies to the variables set during setUp(): arch, machine, cpu, distro_name, distro_version. Parameters used directly in tests or read via self.params.get are left unchanged. Signed-off-by: Fabiano Rosas --- Reviewed-by: Daniel Henrique Barboza tests/avocado/avocado_qemu/__init__.py | 32 +++--- 1 file changed, 19 insertions(+), 13 deletions(-) diff --git a/tests/avocado/avocado_qemu/__init__.py b/tests/avocado/avocado_qemu/__init__.py index 910f3ba1ea..a181cac383 100644 --- a/tests/avocado/avocado_qemu/__init__.py +++ b/tests/avocado/avocado_qemu/__init__.py @@ -240,12 +240,23 @@ def _get_unique_tag_val(self, tag_name): return vals.pop() return None +def _get_prop(self, name): +""" +Infer test properties based on tags. If no tag is present, +look for a command line parameter of the same name. +""" +val = self._get_unique_tag_val(name) +if not val: +# If there's no tag, look for a command line +# parameter. This allows the user to override any defaults +# the caller of this function would choose if we were to +# return None. +val = self.params.get(name) +return val + def setUp(self, bin_prefix): -self.arch = self.params.get('arch', -default=self._get_unique_tag_val('arch')) - -self.cpu = self.params.get('cpu', - default=self._get_unique_tag_val('cpu')) +self.arch = self._get_prop('arch') +self.cpu = self._get_prop('cpu') default_qemu_bin = pick_default_qemu_bin(bin_prefix, arch=self.arch) self.qemu_bin = self.params.get('qemu_bin', @@ -274,8 +285,7 @@ def setUp(self): super().setUp('qemu-system-') -self.machine = self.params.get('machine', - default=self._get_unique_tag_val('machine')) +self.machine = self._get_prop('machine') def require_accelerator(self, accelerator): """ @@ -529,15 +539,11 @@ class LinuxTest(LinuxSSHMixIn, QemuSystemTest): memory = '1024' def _set_distro(self): -distro_name = self.params.get( -'distro', -default=self._get_unique_tag_val('distro')) +distro_name = self._get_prop('distro') if not distro_name: distro_name = 'fedora' -distro_version = self.params.get( -'distro_version', -default=self._get_unique_tag_val('distro_version')) +distro_version = self._get_prop('distro_version') if not distro_version: distro_version = '31'
[PATCH v2 08/10] tests/tcg/tricore: Add OPC2_32_RRRR_DEXTR tests
Signed-off-by: Bastian Koppelmann --- tests/tcg/tricore/macros.h | 9 + tests/tcg/tricore/test_dextr.S | 35 ++ 2 files changed, 44 insertions(+) diff --git a/tests/tcg/tricore/macros.h b/tests/tcg/tricore/macros.h index 8bc0faf1e4..06bdbf83cb 100644 --- a/tests/tcg/tricore/macros.h +++ b/tests/tcg/tricore/macros.h @@ -78,6 +78,15 @@ test_ ## num: \ insn DREG_CORRECT_RESULT, DREG_RS1; \ ) +#define TEST_D_DDD(insn, num, result, rs1, rs2, rs3)\ +TEST_CASE(num, DREG_CALC_RESULT, result,\ +LI(DREG_RS1, rs1); \ +LI(DREG_RS2, rs2); \ +LI(DREG_RS3, rs3); \ +rstv; \ +insn DREG_CALC_RESULT, DREG_RS1, DREG_RS2, DREG_RS3; \ +) + #define TEST_D_DD_PSW(insn, num, result, psw, rs1, rs2) \ TEST_CASE_PSW(num, DREG_CALC_RESULT, result, psw, \ LI(DREG_RS1, rs1); \ diff --git a/tests/tcg/tricore/test_dextr.S b/tests/tcg/tricore/test_dextr.S index c8a9fc453a..82c8fe5185 100644 --- a/tests/tcg/tricore/test_dextr.S +++ b/tests/tcg/tricore/test_dextr.S @@ -37,4 +37,39 @@ _start: TEST_D_DDI(dextr, 31, 0x48d159e2, 0xabcdef01, 0x23456789, 30) TEST_D_DDI(dextr, 32, 0x91a2b3c4, 0xabcdef01, 0x23456789, 31) +# insn numresult rs1 rs2rs3 +#| | | || | +TEST_D_DDD(dextr, 33, 0xabcdef01, 0xabcdef01, 0x23456789, 0) +TEST_D_DDD(dextr, 34, 0x579bde02, 0xabcdef01, 0x23456789, 1) +TEST_D_DDD(dextr, 35, 0xaf37bc04, 0xabcdef01, 0x23456789, 2) +TEST_D_DDD(dextr, 36, 0x5e6f7809, 0xabcdef01, 0x23456789, 3) +TEST_D_DDD(dextr, 37, 0xbcdef012, 0xabcdef01, 0x23456789, 4) +TEST_D_DDD(dextr, 38, 0x79bde024, 0xabcdef01, 0x23456789, 5) +TEST_D_DDD(dextr, 39, 0xf37bc048, 0xabcdef01, 0x23456789, 6) +TEST_D_DDD(dextr, 40, 0xe6f78091, 0xabcdef01, 0x23456789, 7) +TEST_D_DDD(dextr, 41, 0xcdef0123, 0xabcdef01, 0x23456789, 8) +TEST_D_DDD(dextr, 42, 0x9bde0246, 0xabcdef01, 0x23456789, 9) +TEST_D_DDD(dextr, 43, 0x37bc048d, 0xabcdef01, 0x23456789, 10) +TEST_D_DDD(dextr, 44, 0x6f78091a, 0xabcdef01, 0x23456789, 11) +TEST_D_DDD(dextr, 45, 0xdef01234, 0xabcdef01, 0x23456789, 12) +TEST_D_DDD(dextr, 46, 0xbde02468, 0xabcdef01, 0x23456789, 13) +TEST_D_DDD(dextr, 47, 0x7bc048d1, 0xabcdef01, 0x23456789, 14) +TEST_D_DDD(dextr, 48, 0xf78091a2, 0xabcdef01, 0x23456789, 15) +TEST_D_DDD(dextr, 49, 0xef012345, 0xabcdef01, 0x23456789, 16) +TEST_D_DDD(dextr, 51, 0xde02468a, 0xabcdef01, 0x23456789, 17) +TEST_D_DDD(dextr, 52, 0xbc048d15, 0xabcdef01, 0x23456789, 18) +TEST_D_DDD(dextr, 53, 0x78091a2b, 0xabcdef01, 0x23456789, 19) +TEST_D_DDD(dextr, 54, 0xf0123456, 0xabcdef01, 0x23456789, 20) +TEST_D_DDD(dextr, 55, 0xe02468ac, 0xabcdef01, 0x23456789, 21) +TEST_D_DDD(dextr, 56, 0xc048d159, 0xabcdef01, 0x23456789, 22) +TEST_D_DDD(dextr, 57, 0x8091a2b3, 0xabcdef01, 0x23456789, 23) +TEST_D_DDD(dextr, 58, 0x01234567, 0xabcdef01, 0x23456789, 24) +TEST_D_DDD(dextr, 59, 0x02468acf, 0xabcdef01, 0x23456789, 25) +TEST_D_DDD(dextr, 60, 0x048d159e, 0xabcdef01, 0x23456789, 26) +TEST_D_DDD(dextr, 61, 0x091a2b3c, 0xabcdef01, 0x23456789, 27) +TEST_D_DDD(dextr, 62, 0x12345678, 0xabcdef01, 0x23456789, 28) +TEST_D_DDD(dextr, 63, 0x2468acf1, 0xabcdef01, 0x23456789, 29) +TEST_D_DDD(dextr, 64, 0x48d159e2, 0xabcdef01, 0x23456789, 30) +TEST_D_DDD(dextr, 65, 0x91a2b3c4, 0xabcdef01, 0x23456789, 31) + TEST_PASSFAIL -- 2.39.1
[PATCH v2 06/10] tests/tcg/tricore: Add tests for RRPW_DEXTR
Signed-off-by: Bastian Koppelmann --- tests/tcg/tricore/Makefile.softmmu-target | 1 + tests/tcg/tricore/macros.h| 8 + tests/tcg/tricore/test_dextr.S| 40 +++ 3 files changed, 49 insertions(+) create mode 100644 tests/tcg/tricore/test_dextr.S diff --git a/tests/tcg/tricore/Makefile.softmmu-target b/tests/tcg/tricore/Makefile.softmmu-target index afabba8631..e83cc4b7cd 100644 --- a/tests/tcg/tricore/Makefile.softmmu-target +++ b/tests/tcg/tricore/Makefile.softmmu-target @@ -6,6 +6,7 @@ ASFLAGS = TESTS += test_abs.tst TESTS += test_bmerge.tst TESTS += test_clz.tst +TESTS += test_dextr.tst TESTS += test_dvstep.tst TESTS += test_fadd.tst TESTS += test_fmul.tst diff --git a/tests/tcg/tricore/macros.h b/tests/tcg/tricore/macros.h index 4f2bc3cb0f..8bc0faf1e4 100644 --- a/tests/tcg/tricore/macros.h +++ b/tests/tcg/tricore/macros.h @@ -95,6 +95,14 @@ test_ ## num: \ insn DREG_CALC_RESULT, DREG_RS1, DREG_RS2, DREG_RS3; \ ) +#define TEST_D_DDI(insn, num, result, rs1, rs2, imm) \ +TEST_CASE(num, DREG_CALC_RESULT, result, \ +LI(DREG_RS1, rs1); \ +LI(DREG_RS2, rs2); \ +rstv;\ +insn DREG_CALC_RESULT, DREG_RS1, DREG_RS2, imm; \ +) + #define TEST_D_DDI_PSW(insn, num, result, psw, rs1, rs2, imm) \ TEST_CASE_PSW(num, DREG_CALC_RESULT, result, psw, \ LI(DREG_RS1, rs1);\ diff --git a/tests/tcg/tricore/test_dextr.S b/tests/tcg/tricore/test_dextr.S new file mode 100644 index 00..c8a9fc453a --- /dev/null +++ b/tests/tcg/tricore/test_dextr.S @@ -0,0 +1,40 @@ +#include "macros.h" +.text +.global _start +_start: +# insn numresult rs1 rs2imm +#| | | || | +TEST_D_DDI(dextr, 1, 0xabcdef01, 0xabcdef01, 0x23456789, 0) +TEST_D_DDI(dextr, 2, 0x579bde02, 0xabcdef01, 0x23456789, 1) +TEST_D_DDI(dextr, 3, 0xaf37bc04, 0xabcdef01, 0x23456789, 2) +TEST_D_DDI(dextr, 4, 0x5e6f7809, 0xabcdef01, 0x23456789, 3) +TEST_D_DDI(dextr, 5, 0xbcdef012, 0xabcdef01, 0x23456789, 4) +TEST_D_DDI(dextr, 6, 0x79bde024, 0xabcdef01, 0x23456789, 5) +TEST_D_DDI(dextr, 7, 0xf37bc048, 0xabcdef01, 0x23456789, 6) +TEST_D_DDI(dextr, 8, 0xe6f78091, 0xabcdef01, 0x23456789, 7) +TEST_D_DDI(dextr, 9, 0xcdef0123, 0xabcdef01, 0x23456789, 8) +TEST_D_DDI(dextr, 10, 0x9bde0246, 0xabcdef01, 0x23456789, 9) +TEST_D_DDI(dextr, 11, 0x37bc048d, 0xabcdef01, 0x23456789, 10) +TEST_D_DDI(dextr, 12, 0x6f78091a, 0xabcdef01, 0x23456789, 11) +TEST_D_DDI(dextr, 13, 0xdef01234, 0xabcdef01, 0x23456789, 12) +TEST_D_DDI(dextr, 14, 0xbde02468, 0xabcdef01, 0x23456789, 13) +TEST_D_DDI(dextr, 15, 0x7bc048d1, 0xabcdef01, 0x23456789, 14) +TEST_D_DDI(dextr, 16, 0xf78091a2, 0xabcdef01, 0x23456789, 15) +TEST_D_DDI(dextr, 17, 0xef012345, 0xabcdef01, 0x23456789, 16) +TEST_D_DDI(dextr, 18, 0xde02468a, 0xabcdef01, 0x23456789, 17) +TEST_D_DDI(dextr, 19, 0xbc048d15, 0xabcdef01, 0x23456789, 18) +TEST_D_DDI(dextr, 20, 0x78091a2b, 0xabcdef01, 0x23456789, 19) +TEST_D_DDI(dextr, 21, 0xf0123456, 0xabcdef01, 0x23456789, 20) +TEST_D_DDI(dextr, 22, 0xe02468ac, 0xabcdef01, 0x23456789, 21) +TEST_D_DDI(dextr, 23, 0xc048d159, 0xabcdef01, 0x23456789, 22) +TEST_D_DDI(dextr, 24, 0x8091a2b3, 0xabcdef01, 0x23456789, 23) +TEST_D_DDI(dextr, 25, 0x01234567, 0xabcdef01, 0x23456789, 24) +TEST_D_DDI(dextr, 26, 0x02468acf, 0xabcdef01, 0x23456789, 25) +TEST_D_DDI(dextr, 27, 0x048d159e, 0xabcdef01, 0x23456789, 26) +TEST_D_DDI(dextr, 28, 0x091a2b3c, 0xabcdef01, 0x23456789, 27) +TEST_D_DDI(dextr, 29, 0x12345678, 0xabcdef01, 0x23456789, 28) +TEST_D_DDI(dextr, 30, 0x2468acf1, 0xabcdef01, 0x23456789, 29) +TEST_D_DDI(dextr, 31, 0x48d159e2, 0xabcdef01, 0x23456789, 30) +TEST_D_DDI(dextr, 32, 0x91a2b3c4, 0xabcdef01, 0x23456789, 31) + +TEST_PASSFAIL -- 2.39.1
[PATCH v2 01/10] target/tricore: Fix OPC2_32_RCRW_IMASK translation
we were mixing up the "c" and "d" registers. We used "d" as a destination register und "c" as the source. According to the TriCore ISA manual 1.6 vol 2 it is the other way round. Reviewed-by: Richard Henderson Signed-off-by: Bastian Koppelmann Resolves: https://gitlab.com/qemu-project/qemu/-/issues/653 --- target/tricore/translate.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/target/tricore/translate.c b/target/tricore/translate.c index df9e46c649..8de4e56b1f 100644 --- a/target/tricore/translate.c +++ b/target/tricore/translate.c @@ -5794,11 +5794,11 @@ static void decode_rcrw_insert(DisasContext *ctx) switch (op2) { case OPC2_32_RCRW_IMASK: -tcg_gen_andi_tl(temp, cpu_gpr_d[r4], 0x1f); +tcg_gen_andi_tl(temp, cpu_gpr_d[r3], 0x1f); tcg_gen_movi_tl(temp2, (1 << width) - 1); -tcg_gen_shl_tl(cpu_gpr_d[r3 + 1], temp2, temp); +tcg_gen_shl_tl(cpu_gpr_d[r4 + 1], temp2, temp); tcg_gen_movi_tl(temp2, const4); -tcg_gen_shl_tl(cpu_gpr_d[r3], temp2, temp); +tcg_gen_shl_tl(cpu_gpr_d[r4], temp2, temp); break; case OPC2_32_RCRW_INSERT: temp3 = tcg_temp_new(); -- 2.39.1
[PATCH v2 05/10] target/tricore: Fix RRPW_DEXTR
if we used const16 == 0 we would crash qemu with the error: ../tcg/tcg-op.c:196: tcg_gen_shri_i32: Assertion `arg2 >= 0 && arg2 < 32' failed This whole instruction can be handled by 'tcg_gen_extract2_tl' which takes care of this special case as well. Signed-off-by: Bastian Koppelmann --- v1 -> v2: Use tcg_gen_extract2_tl and remove all special cases. target/tricore/translate.c | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/target/tricore/translate.c b/target/tricore/translate.c index 6149d4f5c0..3b4ec530b1 100644 --- a/target/tricore/translate.c +++ b/target/tricore/translate.c @@ -8706,15 +8706,9 @@ static void decode_32Bit_opc(DisasContext *ctx) r2 = MASK_OP_RRPW_S2(ctx->opcode); r3 = MASK_OP_RRPW_D(ctx->opcode); const16 = MASK_OP_RRPW_POS(ctx->opcode); -if (r1 == r2) { -tcg_gen_rotli_tl(cpu_gpr_d[r3], cpu_gpr_d[r1], const16); -} else { -temp = tcg_temp_new(); -tcg_gen_shli_tl(temp, cpu_gpr_d[r1], const16); -tcg_gen_shri_tl(cpu_gpr_d[r3], cpu_gpr_d[r2], 32 - const16); -tcg_gen_or_tl(cpu_gpr_d[r3], cpu_gpr_d[r3], temp); -tcg_temp_free(temp); -} + +tcg_gen_extract2_tl(cpu_gpr_d[r3], cpu_gpr_d[r2], cpu_gpr_d[r1], +32 - const16); break; /* RRR Format */ case OPCM_32_RRR_COND_SELECT: -- 2.39.1
[PATCH v2 10/10] tests/tcg/tricore: Add LD.BU tests
Signed-off-by: Bastian Koppelmann --- tests/tcg/tricore/Makefile.softmmu-target | 1 + tests/tcg/tricore/macros.h| 23 +++ tests/tcg/tricore/test_ld_bu.S| 15 +++ 3 files changed, 39 insertions(+) create mode 100644 tests/tcg/tricore/test_ld_bu.S diff --git a/tests/tcg/tricore/Makefile.softmmu-target b/tests/tcg/tricore/Makefile.softmmu-target index e83cc4b7cd..b6c19dbecd 100644 --- a/tests/tcg/tricore/Makefile.softmmu-target +++ b/tests/tcg/tricore/Makefile.softmmu-target @@ -13,6 +13,7 @@ TESTS += test_fmul.tst TESTS += test_ftoi.tst TESTS += test_imask.tst TESTS += test_insert.tst +TESTS += test_ld_bu.tst TESTS += test_madd.tst TESTS += test_msub.tst TESTS += test_muls.tst diff --git a/tests/tcg/tricore/macros.h b/tests/tcg/tricore/macros.h index 06bdbf83cb..109ef62a4d 100644 --- a/tests/tcg/tricore/macros.h +++ b/tests/tcg/tricore/macros.h @@ -4,6 +4,10 @@ movh DREG_TEMP_LI, up:val; \ or reg, reg, DREG_TEMP_LI; \ +#define LIA(reg, val)\ +LI(DREG_TEMP, val) \ +mov.a reg, DREG_TEMP; + /* Address definitions */ #define TESTDEV_ADDR 0xf000 /* Register definitions */ @@ -18,6 +22,10 @@ #define DREG_TEST_NUM %d14 #define DREG_CORRECT_RESULT %d15 +#define AREG_ADDR %a0 +#define AREG_CORRECT_RESULT %a3 +#define MEM_BASE_ADDR 0xd000 + #define DREG_DEV_ADDR %a15 #define EREG_RS1 %e6 @@ -60,11 +68,24 @@ test_ ## num: \ mov DREG_TEST_NUM, num;\ jne DREG_CALC_PSW, DREG_CORRECT_PSW, fail; +#define TEST_LD(insn, num, result, addr_result, ld_pattern) \ +test_ ## num: \ +LIA(AREG_ADDR, test_data) \ +insn DREG_CALC_RESULT, ld_pattern; \ +LI(DREG_CORRECT_RESULT, result) \ +mov DREG_TEST_NUM, num; \ +jne DREG_CALC_RESULT, DREG_CORRECT_RESULT, fail;\ +mov.d DREG_CALC_RESULT, AREG_ADDR; \ +LI(DREG_CORRECT_RESULT, addr_result)\ +jne DREG_CALC_RESULT, DREG_CORRECT_RESULT, fail; + /* Actual test case type * e.g inst %dX, %dY -> TEST_D_D * inst %dX, %dY, %dZ -> TEST_D_DD * inst %eX, %dY, %dZ -> TEST_E_DD */ + + #define TEST_D_D(insn, num, result, rs1) \ TEST_CASE(num, DREG_CALC_RESULT, result, \ LI(DREG_RS1, rs1);\ @@ -143,6 +164,8 @@ test_ ## num: \ insn EREG_CALC_RESULT, imm1, DREG_RS1, imm2); \ ) + + /* Pass/Fail handling part */ #define TEST_PASSFAIL \ j pass; \ diff --git a/tests/tcg/tricore/test_ld_bu.S b/tests/tcg/tricore/test_ld_bu.S new file mode 100644 index 00..ff9dac128b --- /dev/null +++ b/tests/tcg/tricore/test_ld_bu.S @@ -0,0 +1,15 @@ +#include "macros.h" +.data +test_data: +.word 0xaffedead +.word 0x001122ff +.text +.global _start +_start: +#expect. addr reg val after load +# insn num expect. load value | pattern for loading +# || || | +TEST_LD(ld.bu, 1, 0xff, MEM_BASE_ADDR + 4, [+AREG_ADDR]4) # pre_inc +TEST_LD(ld.bu, 2, 0xad, MEM_BASE_ADDR + 4, [AREG_ADDR+]4) # post_inc + +TEST_PASSFAIL -- 2.39.1
Re: [PATCH 2/2] tests/avocado: Allow passing command line parameters via Makefile
On 1/20/23 19:15, Fabiano Rosas wrote: Add support for the 'avocado run' "-p" option, which allows us to pass parameters in the form key=value to be applied to all tests selected for a given run. This is useful to force generic tests to use a specific machine, cpu or qemu-binary where otherwise the defaults would be used. E.g.: $ make check-avocado AVOCADO_PARAMS="machine=virt arch=riscv64" Signed-off-by: Fabiano Rosas --- Reviewed-by: Daniel Henrique Barboza tests/Makefile.include | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tests/Makefile.include b/tests/Makefile.include index 9422ddaece..f92e730aa0 100644 --- a/tests/Makefile.include +++ b/tests/Makefile.include @@ -107,6 +107,10 @@ else AVOCADO_CMDLINE_TAGS=$(addprefix -t , $(AVOCADO_TAGS)) endif +ifdef AVOCADO_PARAMS + AVOCADO_CMDLINE_PARAMS=$(addprefix -p , $(AVOCADO_PARAMS)) +endif + quiet-venv-pip = $(quiet-@)$(call quiet-command-run, \ $(TESTS_PYTHON) -m pip -q --disable-pip-version-check $1, \ "VENVPIP","$1") @@ -144,7 +148,7 @@ check-avocado: check-venv $(TESTS_RESULTS_DIR) get-vm-images --show=$(AVOCADO_SHOW) run --job-results-dir=$(TESTS_RESULTS_DIR) \ $(if $(AVOCADO_TAGS),, --filter-by-tags-include-empty \ --filter-by-tags-include-empty-key) \ -$(AVOCADO_CMDLINE_TAGS) \ +$(AVOCADO_CMDLINE_TAGS) $(AVOCADO_CMDLINE_PARAMS) \ $(if $(GITLAB_CI),,--failfast) $(AVOCADO_TESTS), \ "AVOCADO", "tests/avocado")
[PATCH v2 02/10] tests/tcg/tricore: Add test for OPC2_32_RCRW_IMASK
Signed-off-by: Bastian Koppelmann --- tests/tcg/tricore/Makefile.softmmu-target | 1 + tests/tcg/tricore/macros.h| 7 +++ tests/tcg/tricore/test_imask.S| 10 ++ 3 files changed, 18 insertions(+) create mode 100644 tests/tcg/tricore/test_imask.S diff --git a/tests/tcg/tricore/Makefile.softmmu-target b/tests/tcg/tricore/Makefile.softmmu-target index 5007c60ce8..bc0cfae8d0 100644 --- a/tests/tcg/tricore/Makefile.softmmu-target +++ b/tests/tcg/tricore/Makefile.softmmu-target @@ -10,6 +10,7 @@ TESTS += test_dvstep.tst TESTS += test_fadd.tst TESTS += test_fmul.tst TESTS += test_ftoi.tst +TESTS += test_imask.tst TESTS += test_madd.tst TESTS += test_msub.tst TESTS += test_muls.tst diff --git a/tests/tcg/tricore/macros.h b/tests/tcg/tricore/macros.h index 0d76fc403a..ceb7e9c0b7 100644 --- a/tests/tcg/tricore/macros.h +++ b/tests/tcg/tricore/macros.h @@ -111,6 +111,13 @@ test_ ## num: \ insn EREG_CALC_RESULT, EREG_RS1, DREG_RS2;\ ) +#define TEST_E_IDI(insn, num, res_hi, res_lo, imm1, rs1, imm2) \ +TEST_CASE_E(num, res_lo, res_hi, \ +LI(DREG_RS1, rs1); \ +rstv; \ +insn EREG_CALC_RESULT, imm1, DREG_RS1, imm2); \ +) + /* Pass/Fail handling part */ #define TEST_PASSFAIL \ j pass; \ diff --git a/tests/tcg/tricore/test_imask.S b/tests/tcg/tricore/test_imask.S new file mode 100644 index 00..356cf398b8 --- /dev/null +++ b/tests/tcg/tricore/test_imask.S @@ -0,0 +1,10 @@ +#include "macros.h" +.text +.global _start +_start: +# res[31:0] +# insn num res[63:32] |imm1 rs1 imm2 +#|| || || | +TEST_E_IDI(imask, 1, 0x000f, 0x0005, 0x5, 0x10, 0x4) + +TEST_PASSFAIL -- 2.39.1
Lost partition tables on ide-hd + ahci drive
Hi, over the years we've got 1-2 dozen reports[0] about suddenly missing/corrupted MBR/partition tables. The issue seems to be very rare and there was no success in trying to reproduce it yet. I'm asking here in the hope that somebody has seen something similar. The only commonality seems to be the use of an ide-hd drive with ahci bus. It does seem to happen with both Linux and Windows guests (one of the reports even mentions FreeBSD) and backing storages for the VMs include ZFS, RBD, LVM-Thin as well as file-based storages. Relevant part of an example configuration: > -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \ > -drive > 'file=/dev/zvol/myzpool/vm-168-disk-0,if=none,id=drive-sata0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' > \ > -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0' \ The first reports are from before io_uring was used and there are also reports with writeback cache mode and discard=on,detect-zeroes=unmap. Some reports say that the issue occurred under high IO load. Many reports suspect backups causing the issue. Our backup mechanism uses backup_job_create() for each drive and runs the jobs sequentially. It uses a custom block driver as the backup target which just forwards the writes to the actual target which can be a file or our backup server. (If you really want to see the details, apply the patches in [1] and see pve-backup.c and block/backup-dump.c). Of course, the backup job will read sector 0 of the source disk, but I really can't see where a stray write would happen, why the issue would trigger so rarely or why seemingly only ide-hd+ahci would be affected. So again, just asking if somebody has seen something similar or has a hunch of what the cause might be. [0]: https://bugzilla.proxmox.com/show_bug.cgi?id=2874 [1]: https://git.proxmox.com/?p=pve-qemu.git;a=tree;f=debian/patches;hb=HEAD
Re: [PATCH v6 1/2] io: Add support for MSG_PEEK for socket channel
"manish.mishra" wrote: > MSG_PEEK peeks at the channel, The data is treated as unread and > the next read shall still return this data. This support is > currently added only for socket class. Extra parameter 'flags' > is added to io_readv calls to pass extra read flags like MSG_PEEK. > > Reviewed-by: Peter Xu > Reviewed-by: Daniel P. Berrange > Suggested-by: Daniel P. Berrange > Signed-off-by: manish.mishra This change breaks RDMA migration. FAILED: libcommon.fa.p/migration_rdma.c.o cc -m64 -mcx16 -Ilibcommon.fa.p -I/usr/include/pixman-1 -I/usr/include/libpng16 -I/usr/include/spice-server -I/usr/include/spice-1 -I/usr/include/cacard -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/sysprof-4 -I/usr/include/nss3 -I/usr/include/nspr4 -I/usr/include/PCSC -I/usr/include/p11-kit-1 -I/usr/include/libusb-1.0 -I/usr/include/SDL2 -I/usr/include/libmount -I/usr/include/blkid -I/usr/include/gio-unix-2.0 -I/usr/include/slirp -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem /mnt/code/qemu/full/linux-headers -isystem linux-headers -iquote . -iquote /mnt/code/qemu/full -iquote /mnt/code/qemu/full/include -iquote /mnt/code/qemu/full/tcg/i386 -pthread -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -fno-strict-aliasing -fno-common -fwrapv -Wundef -Wwrite-strings -Wmissing-prototypes -Wstrict-prototypes -Wredundant-decls -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2 -Wmissing-format-attribute -Wno-missing-include-dirs -Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -fPIE -D_REENTRANT -Wno-undef -DSTRUCT_IOVEC_DEFINED -MD -MQ libcommon.fa.p/migration_rdma.c.o -MF libcommon.fa.p/migration_rdma.c.o.d -o libcommon.fa.p/migration_rdma.c.o -c ../../../../mnt/code/qemu/full/migration/rdma.c ../../../../mnt/code/qemu/full/migration/rdma.c: In function ‘qio_channel_rdma_class_init’: ../../../../mnt/code/qemu/full/migration/rdma.c:4020:25: error: assignment to ‘ssize_t (*)(QIOChannel *, const struct iovec *, size_t, int **, size_t *, int, Error **)’ {aka ‘long int (*)(QIOChannel *, const struct iovec *, long unsigned int, int **, long unsigned int *, int, Error **)’} from incompatible pointer type ‘ssize_t (*)(QIOChannel *, const struct iovec *, size_t, int **, size_t *, Error **)’ {aka ‘long int (*)(QIOChannel *, const struct iovec *, long unsigned int, int **, long unsigned int *, Error **)’} [-Werror=incompatible-pointer-types] 4020 | ioc_klass->io_readv = qio_channel_rdma_readv; | ^ cc1: all warnings being treated as errors And I don't really know how to fix it, because the problem is that rdma don't use qio_channel_readv_full() at all. Sorry, I have to drop the series until a fix is found. Later, Juan.
Re: [PATCH v6 1/2] io: Add support for MSG_PEEK for socket channel
On Thu, Feb 02, 2023 at 01:22:12PM +0100, Juan Quintela wrote: > "manish.mishra" wrote: > > MSG_PEEK peeks at the channel, The data is treated as unread and > > the next read shall still return this data. This support is > > currently added only for socket class. Extra parameter 'flags' > > is added to io_readv calls to pass extra read flags like MSG_PEEK. > > > > Reviewed-by: Peter Xu > > Reviewed-by: Daniel P. Berrange > > Suggested-by: Daniel P. Berrange > > Signed-off-by: manish.mishra > > > This change breaks RDMA migration. > > FAILED: libcommon.fa.p/migration_rdma.c.o > cc -m64 -mcx16 -Ilibcommon.fa.p -I/usr/include/pixman-1 > -I/usr/include/libpng16 -I/usr/include/spice-server -I/usr/include/spice-1 > -I/usr/include/cacard -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include > -I/usr/include/sysprof-4 -I/usr/include/nss3 -I/usr/include/nspr4 > -I/usr/include/PCSC -I/usr/include/p11-kit-1 -I/usr/include/libusb-1.0 > -I/usr/include/SDL2 -I/usr/include/libmount -I/usr/include/blkid > -I/usr/include/gio-unix-2.0 -I/usr/include/slirp -fdiagnostics-color=auto > -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem > /mnt/code/qemu/full/linux-headers -isystem linux-headers -iquote . -iquote > /mnt/code/qemu/full -iquote /mnt/code/qemu/full/include -iquote > /mnt/code/qemu/full/tcg/i386 -pthread -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 > -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -fno-strict-aliasing > -fno-common -fwrapv -Wundef -Wwrite-strings -Wmissing-prototypes > -Wstrict-prototypes -Wredundant-decls -Wold-style-declaration > -Wold-style-definition -Wtype-limits -Wformat-security -Wformat-y2k > -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs -Wendif-labels > -Wexpansion-to-defined -Wimplicit-fallthrough=2 -Wmissing-format-attribute > -Wno-missing-include-dirs -Wno-shift-negative-value -Wno-psabi > -fstack-protector-strong -fPIE -D_REENTRANT -Wno-undef -DSTRUCT_IOVEC_DEFINED > -MD -MQ libcommon.fa.p/migration_rdma.c.o -MF > libcommon.fa.p/migration_rdma.c.o.d -o libcommon.fa.p/migration_rdma.c.o -c > ../../../../mnt/code/qemu/full/migration/rdma.c > ../../../../mnt/code/qemu/full/migration/rdma.c: In function > ‘qio_channel_rdma_class_init’: > ../../../../mnt/code/qemu/full/migration/rdma.c:4020:25: error: assignment to > ‘ssize_t (*)(QIOChannel *, const struct iovec *, size_t, int **, size_t *, > int, Error **)’ {aka ‘long int (*)(QIOChannel *, const struct iovec *, long > unsigned int, int **, long unsigned int *, int, Error **)’} from > incompatible pointer type ‘ssize_t (*)(QIOChannel *, const struct iovec *, > size_t, int **, size_t *, Error **)’ {aka ‘long int (*)(QIOChannel *, const > struct iovec *, long unsigned int, int **, long unsigned int *, Error **)’} > [-Werror=incompatible-pointer-types] > 4020 | ioc_klass->io_readv = qio_channel_rdma_readv; > | ^ > cc1: all warnings being treated as errors > > And I don't really know how to fix it, because the problem is that rdma > don't use qio_channel_readv_full() at all. Likely qio_channel_rdma_readv just adds the 'int flags' param added. It doesn't need to actually do anything with the flags as they are checked before With regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|