date:20230202

Based-on: <20230201042615.34706-1-akihiko.od...@daynix.com>
([PATCH v7 0/9] Introduce igb)

Rebased on latest changes from Akihiko, and merged changes from my
original patchset:
https://lists.gnu.org/archive/html/qemu-devel/2022-12/msg04670.html

Changes since v5:
- Added back an unecessarily removed empty line

Changes since v4:
- Removed the change implementing VTCTL.IGMAC, it needs more thought
  and implementation of DTXSWC.LLE and VLVF.LVLAN first

Changes since v3:
- Fix comments
- Rebased on latest patchset from Akihiko
- Remove Rx loop improvements that Akihiko has pulled into his patchset

Changes since v2:
- Fixed more comments from Akhiko
- Reordered the patches to make changes easier to understand

Changes since v1:
- Fix review comments from Akihiko

Sriram Yagnaraman (8):
  MAINTAINERS: Add Sriram Yagnaraman as a igb reviewer
  igb: handle PF/VF reset properly
  igb: add ICR_RXDW
  igb: implement VFRE and VFTE registers
  igb: check oversized packets for VMDq
  igb: respect E1000_VMOLR_RSSE
  igb: implement VF Tx and Rx stats
  igb: respect VMVIR and VMOLR for VLAN

 MAINTAINERS  |   1 +
 hw/net/e1000x_regs.h |   4 +
 hw/net/igb_core.c| 199 ++-
 hw/net/igb_core.h|   1 +
 hw/net/igb_regs.h|   6 ++
 hw/net/trace-events  |   2 +
 6 files changed, 175 insertions(+), 38 deletions(-)

-- 
2.34.1

[PATCH v6 1/8] MAINTAINERS: Add Sriram Yagnaraman as a igb reviewer

I would like to review and be informed on changes to igb device

Signed-off-by: Sriram Yagnaraman 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index ece23b2b15..7d0e84ce37 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2224,6 +2224,7 @@ F: tests/qtest/libqos/e1000e.*
 
 igb
 M: Akihiko Odaki 
+R: Sriram Yagnaraman 
 S: Maintained
 F: docs/system/devices/igb.rst
 F: hw/net/igb*
-- 
2.34.1

[PATCH v6 2/8] igb: handle PF/VF reset properly

Use PFRSTD to reset RSTI bit for VFs, and raise VFLRE interrupt when VF
is reset.

Signed-off-by: Sriram Yagnaraman 
---
 hw/net/igb_core.c   | 33 +
 hw/net/igb_regs.h   |  3 +++
 hw/net/trace-events |  2 ++
 3 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c
index cb3e2d0be3..b484e6ac30 100644
--- a/hw/net/igb_core.c
+++ b/hw/net/igb_core.c
@@ -1887,14 +1887,6 @@ static void igb_set_eims(IGBCore *core, int index, 
uint32_t val)
 igb_update_interrupt_state(core);
 }
 
-static void igb_vf_reset(IGBCore *core, uint16_t vfn)
-{
-/* TODO: Reset of the queue enable and the interrupt registers of the VF. 
*/
-
-core->mac[V2PMAILBOX0 + vfn] &= ~E1000_V2PMAILBOX_RSTI;
-core->mac[V2PMAILBOX0 + vfn] = E1000_V2PMAILBOX_RSTD;
-}
-
 static void mailbox_interrupt_to_vf(IGBCore *core, uint16_t vfn)
 {
 uint32_t ent = core->mac[VTIVAR_MISC + vfn];
@@ -1972,6 +1964,17 @@ static void igb_set_vfmailbox(IGBCore *core, int index, 
uint32_t val)
 }
 }
 
+static void igb_vf_reset(IGBCore *core, uint16_t vfn)
+{
+/* disable Rx and Tx for the VF*/
+core->mac[VFTE] &= ~BIT(vfn);
+core->mac[VFRE] &= ~BIT(vfn);
+/* indicate VF reset to PF */
+core->mac[VFLRE] |= BIT(vfn);
+/* VFLRE and mailbox use the same interrupt cause */
+mailbox_interrupt_to_pf(core);
+}
+
 static void igb_w1c(IGBCore *core, int index, uint32_t val)
 {
 core->mac[index] &= ~val;
@@ -2226,14 +2229,20 @@ igb_set_status(IGBCore *core, int index, uint32_t val)
 static void
 igb_set_ctrlext(IGBCore *core, int index, uint32_t val)
 {
-trace_e1000e_link_set_ext_params(!!(val & E1000_CTRL_EXT_ASDCHK),
- !!(val & E1000_CTRL_EXT_SPD_BYPS));
-
-/* TODO: PFRSTD */
+trace_igb_link_set_ext_params(!!(val & E1000_CTRL_EXT_ASDCHK),
+  !!(val & E1000_CTRL_EXT_SPD_BYPS),
+  !!(val & E1000_CTRL_EXT_PFRSTD));
 
 /* Zero self-clearing bits */
 val &= ~(E1000_CTRL_EXT_ASDCHK | E1000_CTRL_EXT_EE_RST);
 core->mac[CTRL_EXT] = val;
+
+if (core->mac[CTRL_EXT] & E1000_CTRL_EXT_PFRSTD) {
+for (int vfn = 0; vfn < IGB_MAX_VF_FUNCTIONS; vfn++) {
+core->mac[V2PMAILBOX0 + vfn] &= ~E1000_V2PMAILBOX_RSTI;
+core->mac[V2PMAILBOX0 + vfn] |= E1000_V2PMAILBOX_RSTD;
+}
+}
 }
 
 static void
diff --git a/hw/net/igb_regs.h b/hw/net/igb_regs.h
index ebf3e95023..ddc0f931d6 100644
--- a/hw/net/igb_regs.h
+++ b/hw/net/igb_regs.h
@@ -240,6 +240,9 @@ union e1000_adv_rx_desc {
 
 /* from igb/e1000_defines.h */
 
+/* Physical Func Reset Done Indication */
+#define E1000_CTRL_EXT_PFRSTD   0x4000
+
 #define E1000_IVAR_VALID 0x80
 #define E1000_GPIE_NSICR 0x0001
 #define E1000_GPIE_MSIX_MODE 0x0010
diff --git a/hw/net/trace-events b/hw/net/trace-events
index 0092919b9b..56651f 100644
--- a/hw/net/trace-events
+++ b/hw/net/trace-events
@@ -280,6 +280,8 @@ igb_core_mdic_read_unhandled(uint32_t addr) "MDIC READ: 
PHY[%u] UNHANDLED"
 igb_core_mdic_write(uint32_t addr, uint32_t data) "MDIC WRITE: PHY[%u] = 0x%x"
 igb_core_mdic_write_unhandled(uint32_t addr) "MDIC WRITE: PHY[%u] UNHANDLED"
 
+igb_link_set_ext_params(bool asd_check, bool speed_select_bypass, bool pfrstd) 
"Set extended link params: ASD check: %d, Speed select bypass: %d, PF reset 
done: %d"
+
 igb_rx_desc_buff_size(uint32_t b) "buffer size: %u"
 igb_rx_desc_buff_write(uint64_t addr, uint16_t offset, const void* source, 
uint32_t len) "addr: 0x%"PRIx64", offset: %u, from: %p, length: %u"
 
-- 
2.34.1

[PATCH v6 8/8] igb: respect VMVIR and VMOLR for VLAN

Add support for stripping/inserting VLAN for VFs.

Had to move CSUM calculation back into the for loop, since packet data
is pulled inside the loop based on strip VLAN decision for every VF.

net_rx_pkt_fix_l4_csum should be extended to accept a buffer instead for
igb. Work for a future patch.

Signed-off-by: Sriram Yagnaraman 
---
 hw/net/igb_core.c | 54 ++-
 1 file changed, 44 insertions(+), 10 deletions(-)

diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c
index 25a5e0ec87..cd4fba383c 100644
--- a/hw/net/igb_core.c
+++ b/hw/net/igb_core.c
@@ -386,6 +386,25 @@ igb_rss_parse_packet(IGBCore *core, struct NetRxPkt *pkt, 
bool tx,
 info->queue = E1000_RSS_QUEUE(&core->mac[RETA], info->hash);
 }
 
+static inline bool
+igb_tx_insert_vlan(IGBCore *core, uint16_t qn,
+   struct igb_tx *tx, bool desc_vle)
+{
+if (core->mac[MRQC] & 1) {
+uint16_t pool = qn % IGB_NUM_VM_POOLS;
+
+if (core->mac[VMVIR0 + pool] & E1000_VMVIR_VLANA_DEFAULT) {
+/* always insert default VLAN */
+desc_vle = true;
+tx->vlan = core->mac[VMVIR0 + pool] & 0x;
+} else if (core->mac[VMVIR0 + pool] & E1000_VMVIR_VLANA_NEVER) {
+return false;
+}
+}
+
+return desc_vle && e1000x_vlan_enabled(core->mac);
+}
+
 static bool
 igb_setup_tx_offloads(IGBCore *core, struct igb_tx *tx)
 {
@@ -581,7 +600,8 @@ igb_process_tx_desc(IGBCore *core,
 
 if (cmd_type_len & E1000_TXD_CMD_EOP) {
 if (!tx->skip_cp && net_tx_pkt_parse(tx->tx_pkt)) {
-if (cmd_type_len & E1000_TXD_CMD_VLE) {
+if (igb_tx_insert_vlan(core, queue_index, tx,
+!!(cmd_type_len & E1000_TXD_CMD_VLE))) {
 net_tx_pkt_setup_vlan_header_ex(tx->tx_pkt, tx->vlan,
 core->mac[VET] & 0x);
 }
@@ -1536,6 +1556,20 @@ igb_write_packet_to_guest(IGBCore *core, struct NetRxPkt 
*pkt,
 igb_update_rx_stats(core, rxi, size, total_size);
 }
 
+static bool
+igb_rx_strip_vlan(IGBCore *core, const E1000E_RingInfo *rxi)
+{
+if (core->mac[MRQC] & 1) {
+uint16_t pool = rxi->idx % IGB_NUM_VM_POOLS;
+/* Sec 7.10.3.8: CTRL.VME is ignored, only VMOLR/RPLOLR is used */
+return (net_rx_pkt_get_packet_type(core->rx_pkt) == ETH_PKT_MCAST) ?
+core->mac[RPLOLR] & E1000_RPLOLR_STRVLAN :
+core->mac[VMOLR0 + pool] & E1000_VMOLR_STRVLAN;
+}
+
+return e1000x_vlan_enabled(core->mac);
+}
+
 static inline void
 igb_rx_fix_l4_csum(IGBCore *core, struct NetRxPkt *pkt)
 {
@@ -1616,10 +1650,7 @@ igb_receive_internal(IGBCore *core, const struct iovec 
*iov, int iovcnt,
 
 ehdr = PKT_GET_ETH_HDR(filter_buf);
 net_rx_pkt_set_packet_type(core->rx_pkt, get_eth_packet_type(ehdr));
-
-net_rx_pkt_attach_iovec_ex(core->rx_pkt, iov, iovcnt, iov_ofs,
-   e1000x_vlan_enabled(core->mac),
-   core->mac[VET] & 0x);
+net_rx_pkt_set_protocols(core->rx_pkt, filter_buf, size);
 
 queues = igb_receive_assign(core, ehdr, size, &rss_info, external_tx);
 if (!queues) {
@@ -1627,11 +1658,6 @@ igb_receive_internal(IGBCore *core, const struct iovec 
*iov, int iovcnt,
 return orig_size;
 }
 
-total_size = net_rx_pkt_get_total_len(core->rx_pkt) +
-e1000x_fcs_len(core->mac);
-
-igb_rx_fix_l4_csum(core, core->rx_pkt);
-
 for (i = 0; i < IGB_NUM_QUEUES; i++) {
 if (!(queues & BIT(i)) ||
 !(core->mac[RXDCTL0 + (i * 16)] & E1000_RXDCTL_QUEUE_ENABLE)) {
@@ -1640,12 +1666,20 @@ igb_receive_internal(IGBCore *core, const struct iovec 
*iov, int iovcnt,
 
 igb_rx_ring_init(core, &rxr, i);
 
+net_rx_pkt_attach_iovec_ex(core->rx_pkt, iov, iovcnt, iov_ofs,
+   igb_rx_strip_vlan(core, rxr.i),
+   core->mac[VET] & 0x);
+
+total_size = net_rx_pkt_get_total_len(core->rx_pkt) +
+e1000x_fcs_len(core->mac);
+
 if (!igb_has_rxbufs(core, rxr.i, total_size)) {
 n |= E1000_ICS_RXO;
 trace_e1000e_rx_not_written_to_guest(rxr.i->idx);
 continue;
 }
 
+igb_rx_fix_l4_csum(core, core->rx_pkt);
 igb_write_packet_to_guest(core, core->rx_pkt, &rxr, &rss_info);
 core->mac[EICR] |= igb_rx_wb_eic(core, rxr.i->idx);
 
-- 
2.34.1

[PATCH v6 5/8] igb: check oversized packets for VMDq

Signed-off-by: Sriram Yagnaraman 
---
 hw/net/igb_core.c | 41 -
 1 file changed, 36 insertions(+), 5 deletions(-)

diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c
index c4a2bff4c1..03139c76f7 100644
--- a/hw/net/igb_core.c
+++ b/hw/net/igb_core.c
@@ -915,12 +915,26 @@ igb_rx_l4_cso_enabled(IGBCore *core)
 return !!(core->mac[RXCSUM] & E1000_RXCSUM_TUOFLD);
 }
 
+static bool
+igb_rx_is_oversized(IGBCore *core, uint16_t qn, size_t size)
+{
+uint16_t pool = qn % IGB_NUM_VM_POOLS;
+bool lpe = !!(core->mac[VMOLR0 + pool] & E1000_VMOLR_LPE);
+int max_ethernet_lpe_size =
+core->mac[VMOLR0 + pool] & E1000_VMOLR_RLPML_MASK;
+int max_ethernet_vlan_size = 1522;
+
+return size > (lpe ? max_ethernet_lpe_size : max_ethernet_vlan_size);
+}
+
 static uint16_t igb_receive_assign(IGBCore *core, const struct eth_header 
*ehdr,
-   E1000E_RSSInfo *rss_info, bool *external_tx)
+   size_t size, E1000E_RSSInfo *rss_info,
+   bool *external_tx)
 {
 static const int ta_shift[] = { 4, 3, 2, 0 };
 uint32_t f, ra[2], *macp, rctl = core->mac[RCTL];
 uint16_t queues = 0;
+uint16_t oversized = 0;
 uint16_t vid = lduw_be_p(&PKT_GET_VLAN_HDR(ehdr)->h_tci) & VLAN_VID_MASK;
 bool accepted = false;
 int i;
@@ -1020,9 +1034,26 @@ static uint16_t igb_receive_assign(IGBCore *core, const 
struct eth_header *ehdr,
 }
 
 queues &= core->mac[VFRE];
-igb_rss_parse_packet(core, core->rx_pkt, external_tx != NULL, 
rss_info);
-if (rss_info->queue & 1) {
-queues <<= 8;
+if (queues) {
+for (i = 0; i < IGB_NUM_VM_POOLS; i++) {
+if ((queues & BIT(i)) && igb_rx_is_oversized(core, i, size)) {
+oversized |= BIT(i);
+}
+}
+/* 8.19.37 increment ROC if packet is oversized for all queues */
+if (oversized == queues) {
+trace_e1000x_rx_oversized(size);
+e1000x_inc_reg_if_not_full(core->mac, ROC);
+}
+queues &= ~oversized;
+}
+
+if (queues) {
+igb_rss_parse_packet(core, core->rx_pkt,
+ external_tx != NULL, rss_info);
+if (rss_info->queue & 1) {
+queues <<= 8;
+}
 }
 } else {
 switch (net_rx_pkt_get_packet_type(core->rx_pkt)) {
@@ -1565,7 +1596,7 @@ igb_receive_internal(IGBCore *core, const struct iovec 
*iov, int iovcnt,
e1000x_vlan_enabled(core->mac),
core->mac[VET] & 0x);
 
-queues = igb_receive_assign(core, ehdr, &rss_info, external_tx);
+queues = igb_receive_assign(core, ehdr, size, &rss_info, external_tx);
 if (!queues) {
 trace_e1000e_rx_flt_dropped();
 return orig_size;
-- 
2.34.1

[PATCH v6 6/8] igb: respect E1000_VMOLR_RSSE

RSS for VFs is only enabled if VMOLR[n].RSSE is set.

Signed-off-by: Sriram Yagnaraman 
---
 hw/net/igb_core.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c
index 03139c76f7..9994724a39 100644
--- a/hw/net/igb_core.c
+++ b/hw/net/igb_core.c
@@ -1051,8 +1051,15 @@ static uint16_t igb_receive_assign(IGBCore *core, const 
struct eth_header *ehdr,
 if (queues) {
 igb_rss_parse_packet(core, core->rx_pkt,
  external_tx != NULL, rss_info);
+/* Sec 8.26.1: PQn = VFn + VQn*8 */
 if (rss_info->queue & 1) {
-queues <<= 8;
+for (i = 0; i < IGB_NUM_VM_POOLS; i++) {
+if ((queues & BIT(i)) &&
+(core->mac[VMOLR0 + i] & E1000_VMOLR_RSSE)) {
+queues |= BIT(i + IGB_NUM_VM_POOLS);
+queues &= ~BIT(i);
+}
+}
 }
 }
 } else {
-- 
2.34.1

[PATCH v6 7/8] igb: implement VF Tx and Rx stats

Please note that loopback counters for VM to VM traffic is not
implemented yet: VFGOTLBC, VFGPTLBC, VFGORLBC and VFGPRLBC.

Signed-off-by: Sriram Yagnaraman 
---
 hw/net/igb_core.c | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c
index 9994724a39..25a5e0ec87 100644
--- a/hw/net/igb_core.c
+++ b/hw/net/igb_core.c
@@ -490,7 +490,7 @@ igb_tx_pkt_send(IGBCore *core, struct igb_tx *tx, int 
queue_index)
 }
 
 static void
-igb_on_tx_done_update_stats(IGBCore *core, struct NetTxPkt *tx_pkt)
+igb_on_tx_done_update_stats(IGBCore *core, struct NetTxPkt *tx_pkt, int qn)
 {
 static const int PTCregs[6] = { PTC64, PTC127, PTC255, PTC511,
 PTC1023, PTC1522 };
@@ -517,6 +517,13 @@ igb_on_tx_done_update_stats(IGBCore *core, struct NetTxPkt 
*tx_pkt)
 core->mac[GPTC] = core->mac[TPT];
 core->mac[GOTCL] = core->mac[TOTL];
 core->mac[GOTCH] = core->mac[TOTH];
+
+if (core->mac[MRQC] & 1) {
+uint16_t pool = qn % IGB_NUM_VM_POOLS;
+
+core->mac[PVFGOTC0 + (pool * 64)] += tot_len;
+core->mac[PVFGPTC0 + (pool * 64)]++;
+}
 }
 
 static void
@@ -579,7 +586,7 @@ igb_process_tx_desc(IGBCore *core,
 core->mac[VET] & 0x);
 }
 if (igb_tx_pkt_send(core, tx, queue_index)) {
-igb_on_tx_done_update_stats(core, tx->tx_pkt);
+igb_on_tx_done_update_stats(core, tx->tx_pkt, queue_index);
 }
 }
 
@@ -1398,7 +1405,8 @@ igb_write_to_rx_buffers(IGBCore *core,
 }
 
 static void
-igb_update_rx_stats(IGBCore *core, size_t data_size, size_t data_fcs_size)
+igb_update_rx_stats(IGBCore *core, const E1000E_RingInfo *rxi,
+size_t data_size, size_t data_fcs_size)
 {
 e1000x_update_rx_total_stats(core->mac, data_size, data_fcs_size);
 
@@ -1414,6 +1422,16 @@ igb_update_rx_stats(IGBCore *core, size_t data_size, 
size_t data_fcs_size)
 default:
 break;
 }
+
+if (core->mac[MRQC] & 1) {
+uint16_t pool = rxi->idx % IGB_NUM_VM_POOLS;
+
+core->mac[PVFGORC0 + (pool * 64)] += data_size + 4;
+core->mac[PVFGPRC0 + (pool * 64)]++;
+if (net_rx_pkt_get_packet_type(core->rx_pkt) == ETH_PKT_MCAST) {
+core->mac[PVFMPRC0 + (pool * 64)]++;
+}
+}
 }
 
 static inline bool
@@ -1515,7 +1533,7 @@ igb_write_packet_to_guest(IGBCore *core, struct NetRxPkt 
*pkt,
 
 } while (desc_offset < total_size);
 
-igb_update_rx_stats(core, size, total_size);
+igb_update_rx_stats(core, rxi, size, total_size);
 }
 
 static inline void
-- 
2.34.1

[PATCH v6 4/8] igb: implement VFRE and VFTE registers

Also introduce:
- Checks for RXDCTL/TXDCTL queue enable bits
- IGB_NUM_VM_POOLS enum (Sec 1.5: Table 1-7)

Signed-off-by: Sriram Yagnaraman 
---
 hw/net/igb_core.c | 38 +++---
 hw/net/igb_core.h |  1 +
 hw/net/igb_regs.h |  3 +++
 3 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c
index 1ddf54f630..c4a2bff4c1 100644
--- a/hw/net/igb_core.c
+++ b/hw/net/igb_core.c
@@ -780,6 +780,18 @@ igb_txdesc_writeback(IGBCore *core, dma_addr_t base,
 return igb_tx_wb_eic(core, txi->idx);
 }
 
+static inline bool
+igb_tx_enabled(IGBCore *core, const E1000E_RingInfo *txi)
+{
+bool vmdq = core->mac[MRQC] & 1;
+uint16_t qn = txi->idx;
+uint16_t pool = qn % IGB_NUM_VM_POOLS;
+
+return (core->mac[TCTL] & E1000_TCTL_EN) &&
+(!vmdq || core->mac[VFTE] & BIT(pool)) &&
+(core->mac[TXDCTL0 + (qn * 16)] & E1000_TXDCTL_QUEUE_ENABLE);
+}
+
 static void
 igb_start_xmit(IGBCore *core, const IGB_TxRing *txr)
 {
@@ -789,8 +801,7 @@ igb_start_xmit(IGBCore *core, const IGB_TxRing *txr)
 const E1000E_RingInfo *txi = txr->i;
 uint32_t eic = 0;
 
-/* TODO: check if the queue itself is enabled too. */
-if (!(core->mac[TCTL] & E1000_TCTL_EN)) {
+if (!igb_tx_enabled(core, txi)) {
 trace_e1000e_tx_disabled();
 return;
 }
@@ -866,6 +877,9 @@ igb_can_receive(IGBCore *core)
 
 for (i = 0; i < IGB_NUM_QUEUES; i++) {
 E1000E_RxRing rxr;
+if (!(core->mac[RXDCTL0 + (i * 16)] & E1000_RXDCTL_QUEUE_ENABLE)) {
+continue;
+}
 
 igb_rx_ring_init(core, &rxr, i);
 if (igb_ring_enabled(core, rxr.i) && igb_has_rxbufs(core, rxr.i, 1)) {
@@ -932,7 +946,7 @@ static uint16_t igb_receive_assign(IGBCore *core, const 
struct eth_header *ehdr,
 
 if (core->mac[MRQC] & 1) {
 if (is_broadcast_ether_addr(ehdr->h_dest)) {
-for (i = 0; i < 8; i++) {
+for (i = 0; i < IGB_NUM_VM_POOLS; i++) {
 if (core->mac[VMOLR0 + i] & E1000_VMOLR_BAM) {
 queues |= BIT(i);
 }
@@ -966,7 +980,7 @@ static uint16_t igb_receive_assign(IGBCore *core, const 
struct eth_header *ehdr,
 f = ta_shift[(rctl >> E1000_RCTL_MO_SHIFT) & 3];
 f = (((ehdr->h_dest[5] << 8) | ehdr->h_dest[4]) >> f) & 0xfff;
 if (macp[f >> 5] & (1 << (f & 0x1f))) {
-for (i = 0; i < 8; i++) {
+for (i = 0; i < IGB_NUM_VM_POOLS; i++) {
 if (core->mac[VMOLR0 + i] & E1000_VMOLR_ROMPE) {
 queues |= BIT(i);
 }
@@ -989,7 +1003,7 @@ static uint16_t igb_receive_assign(IGBCore *core, const 
struct eth_header *ehdr,
 }
 }
 } else {
-for (i = 0; i < 8; i++) {
+for (i = 0; i < IGB_NUM_VM_POOLS; i++) {
 if (core->mac[VMOLR0 + i] & E1000_VMOLR_AUPE) {
 mask |= BIT(i);
 }
@@ -1005,6 +1019,7 @@ static uint16_t igb_receive_assign(IGBCore *core, const 
struct eth_header *ehdr,
 queues = BIT(def_pl >> E1000_VT_CTL_DEFAULT_POOL_SHIFT);
 }
 
+queues &= core->mac[VFRE];
 igb_rss_parse_packet(core, core->rx_pkt, external_tx != NULL, 
rss_info);
 if (rss_info->queue & 1) {
 queues <<= 8;
@@ -1562,7 +1577,8 @@ igb_receive_internal(IGBCore *core, const struct iovec 
*iov, int iovcnt,
 igb_rx_fix_l4_csum(core, core->rx_pkt);
 
 for (i = 0; i < IGB_NUM_QUEUES; i++) {
-if (!(queues & BIT(i))) {
+if (!(queues & BIT(i)) ||
+!(core->mac[RXDCTL0 + (i * 16)] & E1000_RXDCTL_QUEUE_ENABLE)) {
 continue;
 }
 
@@ -1966,9 +1982,16 @@ static void igb_set_vfmailbox(IGBCore *core, int index, 
uint32_t val)
 
 static void igb_vf_reset(IGBCore *core, uint16_t vfn)
 {
+uint16_t qn0 = vfn;
+uint16_t qn1 = vfn + IGB_NUM_VM_POOLS;
+
 /* disable Rx and Tx for the VF*/
-core->mac[VFTE] &= ~BIT(vfn);
+core->mac[RXDCTL0 + (qn0 * 16)] &= ~E1000_RXDCTL_QUEUE_ENABLE;
+core->mac[RXDCTL0 + (qn1 * 16)] &= ~E1000_RXDCTL_QUEUE_ENABLE;
+core->mac[TXDCTL0 + (qn0 * 16)] &= ~E1000_TXDCTL_QUEUE_ENABLE;
+core->mac[TXDCTL0 + (qn1 * 16)] &= ~E1000_TXDCTL_QUEUE_ENABLE;
 core->mac[VFRE] &= ~BIT(vfn);
+core->mac[VFTE] &= ~BIT(vfn);
 /* indicate VF reset to PF */
 core->mac[VFLRE] |= BIT(vfn);
 /* VFLRE and mailbox use the same interrupt cause */
@@ -3874,6 +3897,7 @@ igb_phy_reg_init[] = {
 static const uint32_t igb_mac_reg_init[] = {
 [LEDCTL]= 2 | (3 << 8) | BIT(15) | (6 << 16) | (7 << 24),
 [EEMNGCTL]  = BIT(31),
+[TXDCTL0]   = E1000_TXDCTL_QUEUE_ENABLE,
 [RXDCTL0]   = E1000_RXDCTL_QUEUE_ENABLE | (1 << 16),
 [RXDCTL1]   = 1 << 16,
 [RXDCTL2]   = 1 << 16,
diff --git a/hw/net/igb_core.h b/hw

Re: [SeaBIOS] Re: [SeaBIOS PATCH] xen: require Xen info structure at 0x1000 to detect Xen

2023-02-02 Thread Gerd Hoffmann

> Thanks, Kevin.
> 
> I'd like to get the rest of the Xen platform support in to qemu 8.0 if
> possible. Which is currently scheduled for March.
> 
> Is there likely to be a SeaBIOS release before then which Gerd would
> pull into qemu anyway, or should I submit a submodule update to a
> snapshot of today's tree? That would just pull in this commit, and the
> one other fix that's in the SeaBIOS tree since 1.16.1?

Tagging 1.16.2 in time for the qemu 8.0 should not be a problem given
that we have only bugfixes in master.  Roughly around soft freeze is
probably a good time for that.

take care,
  Gerd

Re: [PATCH 3/5] docs: flesh out raw format driver description

2023-02-02 Thread Daniel P . Berrangé

On Wed, Feb 01, 2023 at 04:12:32PM -0500, Stefan Hajnoczi wrote:
> Modernize the description and document the size=/offset= runtime
> options.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  docs/system/qemu-block-drivers.rst.inc | 32 ++
>  1 file changed, 27 insertions(+), 5 deletions(-)
> 
> diff --git a/docs/system/qemu-block-drivers.rst.inc 
> b/docs/system/qemu-block-drivers.rst.inc
> index be6eec1eb6..ec9ebb2066 100644
> --- a/docs/system/qemu-block-drivers.rst.inc
> +++ b/docs/system/qemu-block-drivers.rst.inc
> @@ -16,11 +16,11 @@ options that are supported for it.
>  .. option:: raw
>  
>Raw disk image format. This format has the advantage of
> -  being simple and easily exportable to all other emulators. If your
> -  file system supports *holes* (for example in ext2 or ext3 on
> -  Linux or NTFS on Windows), then only the written sectors will reserve
> -  space. Use ``qemu-img info`` to know the real size used by the
> -  image or ``ls -ls`` on Unix/Linux.
> +  being simple and easily exportable to all other emulators. Modern
> +  file systems support *holes* (for example in btrfs/XFS/ext4 on
> +  Linux or NTFS on Windows) where space is allocated on demand as sectors are
> +  written. Use ``qemu-img info`` to know the real size used by the image or
> +  ``ls -ls`` on Unix/Linux.
>  
>Supported create options:
>  
> @@ -33,6 +33,28 @@ options that are supported for it.
>  for image by writing data to underlying storage. This data may or
>  may not be zero, depending on the storage location.
>  
> +  Supported runtime options:
> +
> +  .. program:: raw
> +  .. option:: offset
> +
> +The byte position in the underlying file where the virtual disk starts.
> +This is handy when you want to present just a single partition from a
> +physical disk as the virtual disk. This option is usually used in
> +conjunction with the ``size`` option.
> +
> +  .. option:: size
> +
> +Limit the virtual disk size to the given number of bytes, regardless of 
> how
> +large the underlying file is. This option is usually used in conjunction
> +with the ``offset`` option.
> +
> +  Note the raw format can be omitted when no runtime options are being used. 
> In
> +  that case the raw format does nothing besides forwarding I/O requests to 
> the
> +  protocol blockdev. You can improve performance slightly by eliminating
> +  ``--blockdev raw,file=file0,node-name=drive0`` and renaming the "file0"
> +  blockdev to "drive0".

This paragraphs reads a bit softly. I would word this such that we
explicitly and strongly recommend against using the 'raw' format
driver. Almost no one will have a need for this.

All the protocol drivers expose a raw format, which can be consumed
directly. So the 'raw' format driver should NEVER be used unless the
user needs to apply a limited window over the underlying disk capacity,
which is pretty rare.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: An issue with x86 tcg and MMIO

2023-02-02 Thread Jonathan Cameron via

On Wed, 1 Feb 2023 11:50:50 -1000
Richard Henderson  wrote:

> On 2/1/23 05:24, Jørgen Hansen wrote:
> > Hello Richard,
> > 
> > We are using x86 qemu to test some CXL stuff, and in that process we are
> > running into an issue with tcg. In qemu, CXL memory is mapped as an MMIO
> > region, so when using CXL memory as part of the system memory,
> > application code and data can be spread out across a combination of DRAM
> > and CXL memory (we are using the Linux tiered memory numa balancing,
> > that will migrate individual pages between DRAM and CXL memory based on
> > access patterns). When we are running memory intensive applications, we
> > hit the following assert in translator_access:
> > 
> >/* We cannot handle MMIO as second page. */
> >assert(tb->page_addr[1] != -1);
> > 
> > introduced in your commit 50627f1b. This is using existing applications
> > and standard Linux. We discussed this with Alistair Francis and he
> > mentioned that it looks like a load across a page boundary is happening,
> > and it so happens that the first page is DRAM and second page MMIO. We
> > tried - as a workaround - to return NULL instead of the assert to
> > trigger the slow path processing, and that allows the system to make
> > forward progress, but we aren't familiar with tcg, and as such aren't
> > sure if that is a correct fix.
> > 
> > So we'd like to get your input on this - and understand whether the
> > above usage isn't supported at all or if there are other possible
> > workarounds.  
> 
> Well, this may answer my question in
> 
> https://lore.kernel.org/qemu-devel/1d6b1894-9c45-2d70-abde-9c10c1b3b...@linaro.org/
> 
> as to how this could occur.
> 
> Until relatively recently, TCG would refuse to execute out of MMIO *at all*.  
> This was 
> relaxed to support Arm m-profile, which needs to execute a few instructions 
> out of MMIO 
> during the boot process, before jumping into flash.
> 
> This works by reading one instruction, translating it, executing it, and 
> immediately 
> discarding the translation.  It could be possible to adjust the translator to 
> allow the 
> second half of an instruction to be in MMIO, such that we execute and 
> discard, however...
> 
> What is it about CXL that requires modeling with MMIO?  If it is intended to 
> be used 
> interchangeably with RAM by the guest, then you really won't like the 
> performance you will 
> see with TCG executing out of these regions.

To be honest I wasn't aware of this restriction.

I 'thought' it was necessary to support interleaving as we need some callbacks
in the read / write paths to map through to the right address space.
So performance will suck even if we can model as memory (I'm not sure how to do
whilst maintaining the interleaving code).

To have anything approaching accurate modeling we need to apply a memory 
decoders
in the host, host-bridge and switches.  We could have faked all this and mapped 
directly
through to a single memory backend, but that would then have been useless for 
actually
testing the interleaving.

Specifically what happens in cxl_cfmws_find_device()
https://elixir.bootlin.com/qemu/latest/source/hw/cxl/cxl-host.c#L129

If we can do that for other types of region then I'm fine with changing it.

> 
> Could memory across the CXL link be modeled as a ROM device, similar to 
> flash?  This does 
> not have the same restrictions as MMIO.

Not sure - if we can do the handling above then sure we could make that change.
I can see there is a path to register the callbacks but I'd kind of assumed
ROM meant read only...

Jonathan

> 
> 
> r~
>

[PATCH 09/18] i386: Fix comment style in topology.h

From: Zhao Liu 

For function comments in this file, keep the comment style consistent
with other places.

Signed-off-by: Zhao Liu 
---
 include/hw/i386/topology.h | 33 +
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/hw/i386/topology.h b/include/hw/i386/topology.h
index b0174c18b7bd..5de905dc00d3 100644
--- a/include/hw/i386/topology.h
+++ b/include/hw/i386/topology.h
@@ -24,7 +24,8 @@
 #ifndef HW_I386_TOPOLOGY_H
 #define HW_I386_TOPOLOGY_H
 
-/* This file implements the APIC-ID-based CPU topology enumeration logic,
+/*
+ * This file implements the APIC-ID-based CPU topology enumeration logic,
  * documented at the following document:
  *   Intel® 64 Architecture Processor Topology Enumeration
  *   
http://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/
@@ -41,7 +42,8 @@
 
 #include "qemu/bitops.h"
 
-/* APIC IDs can be 32-bit, but beware: APIC IDs > 255 require x2APIC support
+/*
+ * APIC IDs can be 32-bit, but beware: APIC IDs > 255 require x2APIC support
  */
 typedef uint32_t apic_id_t;
 
@@ -60,8 +62,7 @@ typedef struct X86CPUTopoInfo {
 unsigned threads_per_core;
 } X86CPUTopoInfo;
 
-/* Return the bit width needed for 'count' IDs
- */
+/* Return the bit width needed for 'count' IDs */
 static unsigned apicid_bitwidth_for_count(unsigned count)
 {
 g_assert(count >= 1);
@@ -69,15 +70,13 @@ static unsigned apicid_bitwidth_for_count(unsigned count)
 return count ? 32 - clz32(count) : 0;
 }
 
-/* Bit width of the SMT_ID (thread ID) field on the APIC ID
- */
+/* Bit width of the SMT_ID (thread ID) field on the APIC ID */
 static inline unsigned apicid_smt_width(X86CPUTopoInfo *topo_info)
 {
 return apicid_bitwidth_for_count(topo_info->threads_per_core);
 }
 
-/* Bit width of the Core_ID field
- */
+/* Bit width of the Core_ID field */
 static inline unsigned apicid_core_width(X86CPUTopoInfo *topo_info)
 {
 /*
@@ -94,8 +93,7 @@ static inline unsigned apicid_die_width(X86CPUTopoInfo 
*topo_info)
 return apicid_bitwidth_for_count(topo_info->dies_per_pkg);
 }
 
-/* Bit offset of the Core_ID field
- */
+/* Bit offset of the Core_ID field */
 static inline unsigned apicid_core_offset(X86CPUTopoInfo *topo_info)
 {
 return apicid_smt_width(topo_info);
@@ -107,14 +105,14 @@ static inline unsigned apicid_die_offset(X86CPUTopoInfo 
*topo_info)
 return apicid_core_offset(topo_info) + apicid_core_width(topo_info);
 }
 
-/* Bit offset of the Pkg_ID (socket ID) field
- */
+/* Bit offset of the Pkg_ID (socket ID) field */
 static inline unsigned apicid_pkg_offset(X86CPUTopoInfo *topo_info)
 {
 return apicid_die_offset(topo_info) + apicid_die_width(topo_info);
 }
 
-/* Make APIC ID for the CPU based on Pkg_ID, Core_ID, SMT_ID
+/*
+ * Make APIC ID for the CPU based on Pkg_ID, Core_ID, SMT_ID
  *
  * The caller must make sure core_id < nr_cores and smt_id < nr_threads.
  */
@@ -127,7 +125,8 @@ static inline apic_id_t 
x86_apicid_from_topo_ids(X86CPUTopoInfo *topo_info,
topo_ids->smt_id;
 }
 
-/* Calculate thread/core/package IDs for a specific topology,
+/*
+ * Calculate thread/core/package IDs for a specific topology,
  * based on (contiguous) CPU index
  */
 static inline void x86_topo_ids_from_idx(X86CPUTopoInfo *topo_info,
@@ -154,7 +153,8 @@ static inline void x86_topo_ids_from_idx(X86CPUTopoInfo 
*topo_info,
 topo_ids->smt_id = cpu_index % nr_threads;
 }
 
-/* Calculate thread/core/package IDs for a specific topology,
+/*
+ * Calculate thread/core/package IDs for a specific topology,
  * based on APIC ID
  */
 static inline void x86_topo_ids_from_apicid(apic_id_t apicid,
@@ -178,7 +178,8 @@ static inline void x86_topo_ids_from_apicid(apic_id_t 
apicid,
 topo_ids->pkg_id = apicid >> apicid_pkg_offset(topo_info);
 }
 
-/* Make APIC ID for the CPU 'cpu_index'
+/*
+ * Make APIC ID for the CPU 'cpu_index'
  *
  * 'cpu_index' is a sequential, contiguous ID for the CPU.
  */
-- 
2.34.1

[PATCH 00/18] Support smp.clusters for x86

From: Zhao Liu 

Hi list,

This series is based on 13356ed (Merge tag 'block-pull-request' of
https://gitlab.com/stefanha/qemu into staging), and adds the cluster
support for x86 PC machine, which allows x86 can use smp.clusters
to configure the modlue level CPU topology of x86.

And since the compatibility issue (see section: ## Why not share L2
cache in cluster directly), this series also introduce a new command
to adjust the topology of the x86 L2 cache.

Welcome your comments!


# Backgroud

The "clusters" parameter in "smp" is introduced by ARM [1], but x86
hasn't supported it.

At present, x86 defaults L2 cache is shared in one core, but this is
not enough. There're some platforms that multiple cores share the
same L2 cache, e.g., Alder Lake-P shares L2 cache for one module of
Atom cores [2], that is, every four Atom cores shares one L2 cache.
Therefore, we need the new CPU topology level (cluster/module).

Another reason is for hybrid architecture. cluster support not only
provides another level of topology definition in x86, but would aslo
provide required code change for future our hybrid topology support.


# Overview

## Introduction of module level for x86

"cluster" in smp is the CPU topology level which is between "core" and
die.

For x86, the "cluster" in smp is corresponding to the module level [3],
which is above the core level. So use the "module" other than "cluster"
in x86 code.

And please note that x86 already has a cpu topology level also named
"cluster" [3], this level is at the upper level of the package. Here,
the cluster in x86 cpu topology is completely different from the
"clusters" as the smp parameter. After the module level is introduced,
the cluster as the smp parameter will actually refer to the module level
of x86.


## Why not share L2 cache in cluster directly

Though "clusters" was introduced to help define L2 cache topology
[1], using cluster to define x86's L2 cache topology will cause the
compatibility problem:

Currently, x86 defaults that the L2 cache is shared in one core, which
actually implies a default setting "cores per L2 cache is 1" and
therefore implicitly defaults to having as many L2 caches as cores.

For example (i386 PC machine):
-smp 16,sockets=2,dies=2,cores=2,threads=2,maxcpus=16 (*)

Considering the topology of the L2 cache, this (*) implicitly means "1
core per L2 cache" and "2 L2 caches per die".

If we use cluster to configure L2 cache topology with the new default
setting "clusters per L2 cache is 1", the above semantics will change
to "2 cores per cluster" and "1 cluster per L2 cache", that is, "2
cores per L2 cache".

So the same command (*) will cause changes in the L2 cache topology,
further affecting the performance of the virtual machine.

Therefore, x86 should only treat cluster as a cpu topology level and
avoid using it to change L2 cache by default for compatibility.


## module level in CPUID

Currently, we don't expose module level in CPUID.1FH because currently
linux (v6.2-rc6) doesn't support module level. And exposing module and
die levels at the same time in CPUID.1FH will cause linux to calculate
wrong die_id. The module level should be exposed until the real machine
has the module level in CPUID.1FH.

We can configure CPUID.04H.02H (L2 cache topology) with module level by
a new command:

"-cpu,x-l2-cache-topo=cluster"

More information about this command, please see the section: "## New
property: x-l2-cache-topo".


## New cache topology info in CPUCacheInfo

Currently, by default, the cache topology is encoded as:
1. i/d cache is shared in one core. 
   * In fact, the original is SMT level, which will be changed to core
 level in the patch: [PATCH 04/18] i386/cpu: Fix number of
 addressable IDs in CPUID.04H).
2. L2 cache is shared in one core.
3. L3 cache is shared in one die.

This default general setting has caused a misunderstanding, that is, the
cache topology is completely equated with a specific cpu topology, such
as the connection between L2 cache and core level, and the connection
between L3 cache and die level.

In fact, the settings of these topologies depend on the specific
platform and are not static. For example, on Alder Lake-P, every
four Atom cores share the same L2 cache [2].

Thus, in this patch set, we explicitly define the corresponding cache
topology for different cpu models and this has two benefits:
1. Easy to expand to new CPU models in the future, which has different
   cache topology.
2. It can easily support custom cache topology by some command (e.g.,
   x-l2-cache-topo).


## New property: x-l2-cache-topo

The property l2-cache-topo will be used to change the L2 cache topology
in CPUID.04H.

Now it allows user to set the L2 cache is shared in core level or
cluster level.

If user passes "-cpu x-l2-cache-topo=[core|cluster]" then older L2 cache
topology will be overrided by the new topology setting.

Since CPUID.04H is used by intel cpus, this property is available on
intel cpus as fo

[PATCH 07/18] i386: Support modules_per_die in X86CPUTopoInfo

From: Zhuocheng Ding 

Support module level in i386 cpu topology structure "X86CPUTopoInfo".

Before updating APIC ID parsing rule with module level, the
apicid_core_width() temporarily combines the core and module levels
together.

At present, we don't expose module level in CPUID.1FH because currently
linux (v6.2-rc6) doesn't support module level. And exposing module and
die levels at the same time in CPUID.1FH will cause linux to calculate
the wrong die_id. The module level should be exposed until the real
machine has the module level in CPUID.1FH.

In addition, update topology structure in test-x86-apicid.c.

Signed-off-by: Zhuocheng Ding 
Co-developed-by: Zhao Liu 
Signed-off-by: Zhao Liu 
---
 hw/i386/x86.c|  3 ++-
 include/hw/i386/topology.h   | 13 ---
 target/i386/cpu.c| 17 --
 tests/unit/test-x86-apicid.c | 45 +++-
 4 files changed, 46 insertions(+), 32 deletions(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 66902d1c0923..3f6e0dd7b827 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -70,7 +70,8 @@ inline void init_topo_info(X86CPUTopoInfo *topo_info,
 MachineState *ms = MACHINE(x86ms);
 
 topo_info->dies_per_pkg = ms->smp.dies;
-topo_info->cores_per_die = ms->smp.cores;
+topo_info->modules_per_die = ms->smp.clusters;
+topo_info->cores_per_module = ms->smp.cores;
 topo_info->threads_per_core = ms->smp.threads;
 }
 
diff --git a/include/hw/i386/topology.h b/include/hw/i386/topology.h
index 81573f6cfde0..bbb00dc4aad8 100644
--- a/include/hw/i386/topology.h
+++ b/include/hw/i386/topology.h
@@ -54,7 +54,8 @@ typedef struct X86CPUTopoIDs {
 
 typedef struct X86CPUTopoInfo {
 unsigned dies_per_pkg;
-unsigned cores_per_die;
+unsigned modules_per_die;
+unsigned cores_per_module;
 unsigned threads_per_core;
 } X86CPUTopoInfo;
 
@@ -78,7 +79,12 @@ static inline unsigned apicid_smt_width(X86CPUTopoInfo 
*topo_info)
  */
 static inline unsigned apicid_core_width(X86CPUTopoInfo *topo_info)
 {
-return apicid_bitwidth_for_count(topo_info->cores_per_die);
+/*
+ * TODO: Will separate module info from core_width when update
+ * APIC ID with module level.
+ */
+return apicid_bitwidth_for_count(topo_info->cores_per_module *
+ topo_info->modules_per_die);
 }
 
 /* Bit width of the Die_ID field */
@@ -128,7 +134,8 @@ static inline void x86_topo_ids_from_idx(X86CPUTopoInfo 
*topo_info,
  X86CPUTopoIDs *topo_ids)
 {
 unsigned nr_dies = topo_info->dies_per_pkg;
-unsigned nr_cores = topo_info->cores_per_die;
+unsigned nr_cores = topo_info->cores_per_module *
+topo_info->modules_per_die;
 unsigned nr_threads = topo_info->threads_per_core;
 
 topo_ids->pkg_id = cpu_index / (nr_dies * nr_cores * nr_threads);
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 61ec9a7499b8..6f3d114c7d12 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -336,7 +336,9 @@ static void encode_cache_cpuid801d(CPUCacheInfo *cache,
 
 /* L3 is shared among multiple cores */
 if (cache->level == 3) {
-l3_threads = topo_info->cores_per_die * topo_info->threads_per_core;
+l3_threads = topo_info->modules_per_die *
+ topo_info->cores_per_module *
+ topo_info->threads_per_core;
 *eax |= (l3_threads - 1) << 14;
 } else {
 *eax |= ((topo_info->threads_per_core - 1) << 14);
@@ -5218,11 +5220,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 uint32_t cpus_per_pkg;
 
 topo_info.dies_per_pkg = env->nr_dies;
-topo_info.cores_per_die = cs->nr_cores / env->nr_dies;
+topo_info.modules_per_die = env->nr_modules;
+topo_info.cores_per_module = cs->nr_cores / env->nr_dies / env->nr_modules;
 topo_info.threads_per_core = cs->nr_threads;
 
-cpus_per_pkg = topo_info.dies_per_pkg * topo_info.cores_per_die *
-   topo_info.threads_per_core;
+cpus_per_pkg = topo_info.dies_per_pkg * topo_info.modules_per_die *
+   topo_info.cores_per_module * topo_info.threads_per_core;
 
 /* Calculate & apply limits for different index ranges */
 if (index >= 0xC000) {
@@ -5298,8 +5301,8 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 if (*eax & 31) {
 int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14);
 int vcpus_per_socket = cpus_per_pkg;
-int cores_per_socket = topo_info.cores_per_die *
-   topo_info.dies_per_pkg;
+int cores_per_socket = cpus_per_pkg /
+   topo_info.threads_per_core;
 if (cores_per_socket > 1) {
 *eax &= ~0xFC00;
 *eax |= (pow2ceil(cores_per_socket) - 1) << 26;
@@ -548

[PATCH 03/18] softmmu: Fix CPUSTATE.nr_cores' calculation

From: Zhuocheng Ding 

>From CPUState.nr_cores' comment, it represents "number of cores within
this CPU package".

After 003f230 (machine: Tweak the order of topology members in struct
CpuTopology), the meaning of smp.cores changed to "the number of cores
in one die", but this commit missed to change CPUState.nr_cores'
caculation, so that CPUState.nr_cores became wrong and now it
misses to consider numbers of clusters and dies.

At present, only i386 is using CPUState.nr_cores.

But as for i386, which supports die level, the uses of CPUState.nr_cores
are very confusing:

Early uses are based on the meaning of "cores per package" (before die
is introduced into i386), and later uses are based on "cores per die"
(after die's introduction).

This difference is due to that commit a94e142 (target/i386: Add CPUID.1F
generation support for multi-dies PCMachine) misunderstood that
CPUState.nr_cores means "cores per die" when caculated
CPUID.1FH.01H:EBX. After that, the changes in i386 all followed this
wrong understanding.

With the influence of 003f230 and a94e142, for i386 currently the result
of CPUState.nr_cores is "cores per die", thus the original uses of
CPUState.cores based on the meaning of "cores per package" are wrong
when mutiple dies exist:
1. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.01H:EBX[bits 23:16] is
   incorrect because it expects "cpus per package" but now the
   result is "cpus per die".
2. In cpu_x86_cpuid() of target/i386/cpu.c, for all leaves of CPUID.04H:
   EAX[bits 31:26] is incorrect because they expect "cpus per package"
   but now the result is "cpus per die". The error not only impacts the
   EAX caculation in cache_info_passthrough case, but also impacts other
   cases of setting cache topology for Intel CPU according to cpu
   topology (specifically, the incoming parameter "num_cores" expects
   "cores per package" in encode_cache_cpuid4()).
3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.0BH.01H:EBX[bits
   15:00] is incorrect because the EBX of 0BH.01H (core level) expects
   "cpus per package", which may be different with 1FH.01H (The reason
   is 1FH can support more levels. For QEMU, 1FH also supports die,
   1FH.01H:EBX[bits 15:00] expects "cpus per die").
4. In cpu_x86_cpuid() of target/i386/cpu.c, when CPUID.8001H is
   caculated, here "cpus per package" is expected to be checked, but in
   fact, now it checks "cpus per die". Though "cpus per die" also works
   for this code logic, this isn't consistent with AMD's APM.
5. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.8008H:ECX expects
   "cpus per package" but it obtains "cpus per die".
6. In simulate_rdmsr() of target/i386/hvf/x86_emu.c, in
   kvm_rdmsr_core_thread_count() of target/i386/kvm/kvm.c, and in
   helper_rdmsr() of target/i386/tcg/sysemu/misc_helper.c,
   MSR_CORE_THREAD_COUNT expects "cpus per package" and "cores per
   package", but in these functions, it obtains "cpus per die" and
   "cores per die".

On the other hand, these uses are correct now (they are added in/after
a94e142):
1. In cpu_x86_cpuid() of target/i386/cpu.c, topo_info.cores_per_die
   meets the actual meaning of CPUState.nr_cores ("cores per die").
2. In cpu_x86_cpuid() of target/i386/cpu.c, vcpus_per_socket (in CPUID.
   04H's caculation) considers number of dies, so it's correct.
3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.1FH.01H:EBX[bits
   15:00] needs "cpus per die" and it gets the correct result, and
   CPUID.1FH.02H:EBX[bits 15:00] gets correct "cpus per package".

When CPUState.nr_cores is correctly changed to "cores per package" again
, the above errors will be fixed without extra work, but the "currently"
correct cases will go wrong and need special handling to pass correct
"cpus/cores per die" they want.

Thus in this patch, we fix CPUState.nr_cores' caculation to fit the
original meaning "cores per package", as well as changing caculation of
topo_info.cores_per_die, vcpus_per_socket and CPUID.1FH.

In addition, in the nr_threads' comment, specify it represents the
number of threads in the "core" to avoid confusion.

Fixes: a94e142 (target/i386: Add CPUID.1F generation support for multi-dies 
PCMachine)
Fixes: 003f230 (machine: Tweak the order of topology members in struct 
CpuTopology)
Signed-off-by: Zhuocheng Ding 
Co-developed-by: Zhao Liu 
Signed-off-by: Zhao Liu 
---
 include/hw/core/cpu.h | 2 +-
 softmmu/cpus.c| 2 +-
 target/i386/cpu.c | 9 -
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 2417597236bc..5253e4e839bb 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -274,7 +274,7 @@ struct qemu_work_item;
  *   QOM parent.
  * @tcg_cflags: Pre-computed cflags for this cpu.
  * @nr_cores: Number of cores within this CPU package.
- * @nr_threads: Number of threads within this CPU.
+ * @nr_threads: Number of threads within this CPU core.
  * @running: #true if CPU is currently running (lockless).
  * @has_

[PATCH 06/18] i386: Introduce module-level cpu topology to CPUX86State

From: Zhuocheng Ding 

smp command has the "clusters" parameter but x86 hasn't supported that
level. Though "clusters" was introduced to help define L2 cache topology
[1], using cluster to define x86's L2 cache topology will cause the
compatibility problem:

Currently, x86 defaults that the L2 cache is shared in one core, which
actually implies a default setting "cores per L2 cache is 1" and
therefore implicitly defaults to having as many L2 caches as cores.

For example (i386 PC machine):
-smp 16,sockets=2,dies=2,cores=2,threads=2,maxcpus=16 (*)

Considering the topology of the L2 cache, this (*) implicitly means "1
core per L2 cache" and "2 L2 caches per die".

If we use cluster to configure L2 cache topology with the new default
setting "clusters per L2 cache is 1", the above semantics will change
to "2 cores per cluster" and "1 cluster per L2 cache", that is, "2
cores per L2 cache".

So the same command (*) will cause changes in the L2 cache topology,
further affecting the performance of the virtual machine.

Therefore, x86 should only treat cluster as a cpu topology level and
avoid using it to change L2 cache by default for compatibility.

"cluster" in smp is the CPU topology level which is between "core" and
die.

For x86, the "cluster" in smp is corresponding to the module level [2],
which is above the core level. So use the "module" other than "cluster"
in i386 code.

And please note that x86 already has a cpu topology level also named
"cluster" [2], this level is at the upper level of the package. Here,
the cluster in x86 cpu topology is completely different from the
"clusters" as the smp parameter. After the module level is introduced,
the cluster as the smp parameter will actually refer to the module level
of x86.

[1]: 0d87178 (hw/core/machine: Introduce CPU cluster topology support)
[2]: SDM, vol.3, ch.9, 9.9.1 Hierarchical Mapping of Shared Resources.

Signed-off-by: Zhuocheng Ding 
Co-developed-by: Zhao Liu 
Signed-off-by: Zhao Liu 
---
 hw/i386/x86.c | 1 +
 target/i386/cpu.c | 1 +
 target/i386/cpu.h | 6 ++
 3 files changed, 8 insertions(+)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 78cc131926c8..66902d1c0923 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -305,6 +305,7 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 init_topo_info(&topo_info, x86ms);
 
 env->nr_dies = ms->smp.dies;
+env->nr_modules = ms->smp.clusters;
 
 /*
  * If APIC ID is not set,
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 4cda84eb96f1..61ec9a7499b8 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6781,6 +6781,7 @@ static void x86_cpu_initfn(Object *obj)
 CPUX86State *env = &cpu->env;
 
 env->nr_dies = 1;
+env->nr_modules = 1;
 cpu_set_cpustate_pointers(cpu);
 
 object_property_add(obj, "feature-words", "X86CPUFeatureWordInfo",
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index d4bc19577a21..f3afea765982 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1810,7 +1810,13 @@ typedef struct CPUArchState {
 
 TPRAccess tpr_access_type;
 
+/* Number of dies per package. */
 unsigned nr_dies;
+/*
+ * Number of modules per die. Module level in x86 cpu topology is
+ * corresponding to smp.clusters.
+ */
+unsigned nr_modules;
 } CPUX86State;
 
 struct kvm_msrs;
-- 
2.34.1

[PATCH 11/18] i386/cpu: Introduce cluster-id to X86CPU

From: Zhuocheng Ding 

We introduce cluster-id other than module-id to be consistent with
CpuInstanceProperties.cluster-id, and this avoids the confusion
of parameter names when hotplugging.

Following the legacy smp check rules, also add the cluster_id validity
into x86_cpu_pre_plug().

Signed-off-by: Zhuocheng Ding 
Co-developed-by: Zhao Liu 
Signed-off-by: Zhao Liu 
---
 hw/i386/x86.c | 33 +
 target/i386/cpu.c |  2 ++
 target/i386/cpu.h |  1 +
 3 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 68fce87d18ac..57d646bf360f 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -324,6 +324,14 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 cpu->die_id = 0;
 }
 
+/*
+ * cluster-id was optional in QEMU 8.0 and older, so keep it optional
+ * if there's only one cluster per die.
+ */
+if (cpu->cluster_id < 0 && ms->smp.clusters == 1) {
+cpu->cluster_id = 0;
+}
+
 if (cpu->socket_id < 0) {
 error_setg(errp, "CPU socket-id is not set");
 return;
@@ -340,6 +348,14 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
cpu->die_id, ms->smp.dies - 1);
 return;
 }
+if (cpu->cluster_id < 0) {
+error_setg(errp, "CPU cluster-id is not set");
+return;
+} else if (cpu->cluster_id > ms->smp.clusters - 1) {
+error_setg(errp, "Invalid CPU cluster-id: %u must be in range 
0:%u",
+   cpu->cluster_id, ms->smp.clusters - 1);
+return;
+}
 if (cpu->core_id < 0) {
 error_setg(errp, "CPU core-id is not set");
 return;
@@ -359,16 +375,9 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 
 topo_ids.pkg_id = cpu->socket_id;
 topo_ids.die_id = cpu->die_id;
+topo_ids.module_id = cpu->cluster_id;
 topo_ids.core_id = cpu->core_id;
 topo_ids.smt_id = cpu->thread_id;
-
-/*
- * TODO: This is the temporary initialization for topo_ids.module_id to
- * avoid "maybe-uninitialized" compilation errors. Will remove when
- * X86CPU supports cluster_id.
- */
-topo_ids.module_id = 0;
-
 cpu->apic_id = x86_apicid_from_topo_ids(&topo_info, &topo_ids);
 }
 
@@ -415,6 +424,14 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 }
 cpu->die_id = topo_ids.die_id;
 
+if (cpu->cluster_id != -1 && cpu->cluster_id != topo_ids.module_id) {
+error_setg(errp, "property cluster-id: %u doesn't match set apic-id:"
+" 0x%x (cluster-id: %u)", cpu->cluster_id, cpu->apic_id,
+topo_ids.module_id);
+return;
+}
+cpu->cluster_id = topo_ids.module_id;
+
 if (cpu->core_id != -1 && cpu->core_id != topo_ids.core_id) {
 error_setg(errp, "property core-id: %u doesn't match set apic-id:"
 " 0x%x (core-id: %u)", cpu->core_id, cpu->apic_id,
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 6f3d114c7d12..27bbbc36b11c 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6980,12 +6980,14 @@ static Property x86_cpu_properties[] = {
 DEFINE_PROP_UINT32("apic-id", X86CPU, apic_id, 0),
 DEFINE_PROP_INT32("thread-id", X86CPU, thread_id, 0),
 DEFINE_PROP_INT32("core-id", X86CPU, core_id, 0),
+DEFINE_PROP_INT32("cluster-id", X86CPU, cluster_id, 0),
 DEFINE_PROP_INT32("die-id", X86CPU, die_id, 0),
 DEFINE_PROP_INT32("socket-id", X86CPU, socket_id, 0),
 #else
 DEFINE_PROP_UINT32("apic-id", X86CPU, apic_id, UNASSIGNED_APIC_ID),
 DEFINE_PROP_INT32("thread-id", X86CPU, thread_id, -1),
 DEFINE_PROP_INT32("core-id", X86CPU, core_id, -1),
+DEFINE_PROP_INT32("cluster-id", X86CPU, cluster_id, -1),
 DEFINE_PROP_INT32("die-id", X86CPU, die_id, -1),
 DEFINE_PROP_INT32("socket-id", X86CPU, socket_id, -1),
 #endif
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index f3afea765982..8668e74e0c87 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1966,6 +1966,7 @@ struct ArchCPU {
 int32_t node_id; /* NUMA node this CPU belongs to */
 int32_t socket_id;
 int32_t die_id;
+int32_t cluster_id;
 int32_t core_id;
 int32_t thread_id;
 
-- 
2.34.1

[PATCH 17/18] i386: Use CPUCacheInfo.share_level to encode CPUID[0x8000001D].EAX[bits 25:14]

From: Zhao Liu 

CPUID[0x801D].EAX[bits 25:14] is used to represent the cache
topology for amd CPUs.

After cache models have topology information, we can use
CPUCacheInfo.share_level to decide which topology level to be encoded
into CPUID[0x801D].EAX[bits 25:14].

Signed-off-by: Zhao Liu 
---
 target/i386/cpu.c | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index d691c02e3c06..5816dc99b1d4 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -355,20 +355,12 @@ static void encode_cache_cpuid801d(CPUCacheInfo 
*cache,
uint32_t *eax, uint32_t *ebx,
uint32_t *ecx, uint32_t *edx)
 {
-uint32_t sharing_apic_ids;
 assert(cache->size == cache->line_size * cache->associativity *
   cache->partitions * cache->sets);
 
 *eax = CACHE_TYPE(cache->type) | CACHE_LEVEL(cache->level) |
(cache->self_init ? CACHE_SELF_INIT_LEVEL : 0);
-
-/* L3 is shared among multiple cores */
-if (cache->level == 3) {
-sharing_apic_ids = 1 << apicid_die_offset(topo_info);
-} else {
-sharing_apic_ids = 1 << apicid_core_offset(topo_info);
-}
-*eax |= (sharing_apic_ids - 1) << 14;
+*eax |= max_processor_ids_for_cache(cache, topo_info) << 14;
 
 assert(cache->line_size > 0);
 assert(cache->partitions > 0);
-- 
2.34.1

[PATCH 14/18] i386: Add cache topology info in CPUCacheInfo

From: Zhao Liu 

Currently, by default, the cache topology is encoded as:
1. i/d cache is shared in one core.
2. L2 cache is shared in one core.
3. L3 cache is shared in one die.

This default general setting has caused a misunderstanding, that is, the
cache topology is completely equated with a specific cpu topology, such
as the connection between L2 cache and core level, and the connection
between L3 cache and die level.

In fact, the settings of these topologies depend on the specific
platform and are not static. For example, on Alder Lake-P, every
four Atom cores share the same L2 cache.

Thus, we should explicitly define the corresponding cache topology for
different cache models to increase scalability.

Except legacy_l2_cache_cpuid2 (its default topo level is INVALID),
explicitly set the corresponding topology level for all other cache
models. In order to be compatible with the existing cache topology, set
the CORE level for the i/d cache, set the CORE level for L2 cache, and
set the DIE level for L3 cache.

The field for CPUID[4].EAX[bits 25:14] or CPUID[0x801D].EAX[bits
25:14] will be set based on CPUCacheInfo.share_level.

Signed-off-by: Zhao Liu 
---
 target/i386/cpu.c | 19 +++
 target/i386/cpu.h | 16 
 2 files changed, 35 insertions(+)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 27bbbc36b11c..364534e84b1b 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -433,6 +433,7 @@ static CPUCacheInfo legacy_l1d_cache = {
 .sets = 64,
 .partitions = 1,
 .no_invd_sharing = true,
+.share_level = CORE,
 };
 
 /*FIXME: CPUID leaf 0x8005 is inconsistent with leaves 2 & 4 */
@@ -447,6 +448,7 @@ static CPUCacheInfo legacy_l1d_cache_amd = {
 .partitions = 1,
 .lines_per_tag = 1,
 .no_invd_sharing = true,
+.share_level = CORE,
 };
 
 /* L1 instruction cache: */
@@ -460,6 +462,7 @@ static CPUCacheInfo legacy_l1i_cache = {
 .sets = 64,
 .partitions = 1,
 .no_invd_sharing = true,
+.share_level = CORE,
 };
 
 /*FIXME: CPUID leaf 0x8005 is inconsistent with leaves 2 & 4 */
@@ -474,6 +477,7 @@ static CPUCacheInfo legacy_l1i_cache_amd = {
 .partitions = 1,
 .lines_per_tag = 1,
 .no_invd_sharing = true,
+.share_level = CORE,
 };
 
 /* Level 2 unified cache: */
@@ -487,6 +491,7 @@ static CPUCacheInfo legacy_l2_cache = {
 .sets = 4096,
 .partitions = 1,
 .no_invd_sharing = true,
+.share_level = CORE,
 };
 
 /*FIXME: CPUID leaf 2 descriptor is inconsistent with CPUID leaf 4 */
@@ -509,6 +514,7 @@ static CPUCacheInfo legacy_l2_cache_amd = {
 .associativity = 16,
 .sets = 512,
 .partitions = 1,
+.share_level = CORE,
 };
 
 /* Level 3 unified cache: */
@@ -524,6 +530,7 @@ static CPUCacheInfo legacy_l3_cache = {
 .self_init = true,
 .inclusive = true,
 .complex_indexing = true,
+.share_level = DIE,
 };
 
 /* TLB definitions: */
@@ -1668,6 +1675,7 @@ static const CPUCaches epyc_cache_info = {
 .lines_per_tag = 1,
 .self_init = 1,
 .no_invd_sharing = true,
+.share_level = CORE,
 },
 .l1i_cache = &(CPUCacheInfo) {
 .type = INSTRUCTION_CACHE,
@@ -1680,6 +1688,7 @@ static const CPUCaches epyc_cache_info = {
 .lines_per_tag = 1,
 .self_init = 1,
 .no_invd_sharing = true,
+.share_level = CORE,
 },
 .l2_cache = &(CPUCacheInfo) {
 .type = UNIFIED_CACHE,
@@ -1690,6 +1699,7 @@ static const CPUCaches epyc_cache_info = {
 .partitions = 1,
 .sets = 1024,
 .lines_per_tag = 1,
+.share_level = CORE,
 },
 .l3_cache = &(CPUCacheInfo) {
 .type = UNIFIED_CACHE,
@@ -1703,6 +1713,7 @@ static const CPUCaches epyc_cache_info = {
 .self_init = true,
 .inclusive = true,
 .complex_indexing = true,
+.share_level = DIE,
 },
 };
 
@@ -1718,6 +1729,7 @@ static const CPUCaches epyc_rome_cache_info = {
 .lines_per_tag = 1,
 .self_init = 1,
 .no_invd_sharing = true,
+.share_level = CORE,
 },
 .l1i_cache = &(CPUCacheInfo) {
 .type = INSTRUCTION_CACHE,
@@ -1730,6 +1742,7 @@ static const CPUCaches epyc_rome_cache_info = {
 .lines_per_tag = 1,
 .self_init = 1,
 .no_invd_sharing = true,
+.share_level = CORE,
 },
 .l2_cache = &(CPUCacheInfo) {
 .type = UNIFIED_CACHE,
@@ -1740,6 +1753,7 @@ static const CPUCaches epyc_rome_cache_info = {
 .partitions = 1,
 .sets = 1024,
 .lines_per_tag = 1,
+.share_level = CORE,
 },
 .l3_cache = &(CPUCacheInfo) {
 .type = UNIFIED_CACHE,
@@ -1753,6 +1767,7 @@ static const CPUCaches epyc_rome_cache_info = {
 .self_init = true,
 .inclusive = true,
 .complex_indexing = true,
+.share_level = DIE,
 },
 };
 
@@ -1768,6 +1783,7 @@ static const CPUCaches epyc_milan_cache_info = {
 .lines_per_tag = 1,
 .

[PATCH 02/18] tests: Rename test-x86-cpuid.c to test-x86-apicid.c

From: Zhao Liu 

In fact, this unit tests APIC ID other than CPUID.
Rename to test-x86-apicid.c to make its name more in line with its
actual content.

Signed-off-by: Zhao Liu 
---
 MAINTAINERS| 2 +-
 tests/unit/meson.build | 4 ++--
 tests/unit/{test-x86-cpuid.c => test-x86-apicid.c} | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)
 rename tests/unit/{test-x86-cpuid.c => test-x86-apicid.c} (99%)

diff --git a/MAINTAINERS b/MAINTAINERS
index c581c11a645a..a6a0c7fe5795 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1674,7 +1674,7 @@ F: include/hw/southbridge/piix.h
 F: hw/misc/sga.c
 F: hw/isa/apm.c
 F: include/hw/isa/apm.h
-F: tests/unit/test-x86-cpuid.c
+F: tests/unit/test-x86-apicid.c
 F: tests/qtest/test-x86-cpuid-compat.c
 
 PC Chipset
diff --git a/tests/unit/meson.build b/tests/unit/meson.build
index ffa444f4323c..a9df2843e92e 100644
--- a/tests/unit/meson.build
+++ b/tests/unit/meson.build
@@ -20,8 +20,8 @@ tests = {
   'test-opts-visitor': [testqapi],
   'test-visitor-serialization': [testqapi],
   'test-bitmap': [],
-  # all code tested by test-x86-cpuid is inside topology.h
-  'test-x86-cpuid': [],
+  # all code tested by test-x86-apicid is inside topology.h
+  'test-x86-apicid': [],
   'test-cutils': [],
   'test-div128': [],
   'test-shift128': [],
diff --git a/tests/unit/test-x86-cpuid.c b/tests/unit/test-x86-apicid.c
similarity index 99%
rename from tests/unit/test-x86-cpuid.c
rename to tests/unit/test-x86-apicid.c
index bfabc0403a1a..2b104f86d7c2 100644
--- a/tests/unit/test-x86-cpuid.c
+++ b/tests/unit/test-x86-apicid.c
@@ -1,5 +1,5 @@
 /*
- *  Test code for x86 CPUID and Topology functions
+ *  Test code for x86 APIC ID and Topology functions
  *
  *  Copyright (c) 2012 Red Hat Inc.
  *
-- 
2.34.1

[PATCH 18/18] i386: Add new property to control L2 cache topo in CPUID.04H

From: Zhao Liu 

The property x-l2-cache-topo will be used to change the L2 cache
topology in CPUID.04H.

Now it allows user to set the L2 cache is shared in core level or
cluster level.

If user passes "-cpu x-l2-cache-topo=[core|cluster]" then older L2 cache
topology will be overrided by the new topology setting.

Here we expose to user "cluster" instead of "module", to be consistent
with "cluster-id" naming.

Since CPUID.04H is used by intel CPUs, this property is available on
intel CPUs as for now.

When necessary, it can be extended to CPUID.801DH for amd CPUs.

Signed-off-by: Zhao Liu 
---
 target/i386/cpu.c | 33 -
 target/i386/cpu.h |  2 ++
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 5816dc99b1d4..cf84c720a431 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -240,12 +240,15 @@ static uint32_t max_processor_ids_for_cache(CPUCacheInfo 
*cache,
 case CORE:
 num_ids = 1 << apicid_core_offset(topo_info);
 break;
+case MODULE:
+num_ids = 1 << apicid_module_offset(topo_info);
+break;
 case DIE:
 num_ids = 1 << apicid_die_offset(topo_info);
 break;
 default:
 /*
- * Currently there is no use case for SMT, MODULE and PACKAGE, so use
+ * Currently there is no use case for SMT and PACKAGE, so use
  * assert directly to facilitate debugging.
  */
 g_assert_not_reached();
@@ -6633,6 +6636,33 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
 env->cache_info_amd.l3_cache = &legacy_l3_cache;
 }
 
+if (cpu->l2_cache_topo_level) {
+/*
+ * FIXME: Currently only supports changing CPUID[4] (for intel), and
+ * will support changing CPUID[0x801D] when necessary.
+ */
+if (!IS_INTEL_CPU(env)) {
+error_setg(errp, "only intel cpus supports x-l2-cache-topo");
+return;
+}
+
+if (!strcmp(cpu->l2_cache_topo_level, "core")) {
+env->cache_info_cpuid4.l2_cache->share_level = CORE;
+} else if (!strcmp(cpu->l2_cache_topo_level, "cluster")) {
+/*
+ * We expose to users "cluster" instead of "module", to be
+ * consistent with "cluster-id" naming.
+ */
+env->cache_info_cpuid4.l2_cache->share_level = MODULE;
+} else {
+error_setg(errp,
+   "x-l2-cache-topo doesn't support '%s', "
+   "and it only supports 'core' or 'cluster'",
+   cpu->l2_cache_topo_level);
+return;
+}
+}
+
 #ifndef CONFIG_USER_ONLY
 MachineState *ms = MACHINE(qdev_get_machine());
 qemu_register_reset(x86_cpu_machine_reset_cb, cpu);
@@ -7135,6 +7165,7 @@ static Property x86_cpu_properties[] = {
  false),
 DEFINE_PROP_BOOL("x-intel-pt-auto-level", X86CPU, intel_pt_auto_level,
  true),
+DEFINE_PROP_STRING("x-l2-cache-topo", X86CPU, l2_cache_topo_level),
 DEFINE_PROP_END_OF_LIST()
 };
 
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 5a955431f759..aa7e96c586c7 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1987,6 +1987,8 @@ struct ArchCPU {
 int32_t thread_id;
 
 int32_t hv_max_vps;
+
+char *l2_cache_topo_level;
 };
 
 
-- 
2.34.1

[PATCH 10/18] i386: Update APIC ID parsing rule to support module level

From: Zhuocheng Ding 

Add the module level parsing support for APIC ID.

With this support, now the conversion between X86CPUTopoIDs,
X86CPUTopoInfo and APIC ID is completed.

Signed-off-by: Zhuocheng Ding 
Co-developed-by: Zhao Liu 
Signed-off-by: Zhao Liu 
---
 hw/i386/x86.c  | 19 ---
 include/hw/i386/topology.h | 36 ++--
 2 files changed, 26 insertions(+), 29 deletions(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index b66719230d57..68fce87d18ac 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -310,11 +310,11 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 
 /*
  * If APIC ID is not set,
- * set it based on socket/die/core/thread properties.
+ * set it based on socket/die/cluster/core/thread properties.
  */
 if (cpu->apic_id == UNASSIGNED_APIC_ID) {
-int max_socket = (ms->smp.max_cpus - 1) /
-smp_threads / smp_cores / ms->smp.dies;
+int max_socket = (ms->smp.max_cpus - 1) / smp_threads / smp_cores /
+ms->smp.clusters / ms->smp.dies;
 
 /*
  * die-id was optional in QEMU 4.0 and older, so keep it optional
@@ -378,15 +378,12 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 
 x86_topo_ids_from_apicid(cpu->apic_id, &topo_info, &topo_ids);
 
-/*
- * TODO: Before APIC ID supports module level parsing, there's no need
- * to expose module_id info.
- */
 error_setg(errp,
-"Invalid CPU [socket: %u, die: %u, core: %u, thread: %u] with"
-" APIC ID %" PRIu32 ", valid index range 0:%d",
-topo_ids.pkg_id, topo_ids.die_id, topo_ids.core_id, 
topo_ids.smt_id,
-cpu->apic_id, ms->possible_cpus->len - 1);
+"Invalid CPU [socket: %u, die: %u, module: %u, core: %u, thread: 
%u]"
+" with APIC ID %" PRIu32 ", valid index range 0:%d",
+topo_ids.pkg_id, topo_ids.die_id, topo_ids.module_id,
+topo_ids.core_id, topo_ids.smt_id, cpu->apic_id,
+ms->possible_cpus->len - 1);
 return;
 }
 
diff --git a/include/hw/i386/topology.h b/include/hw/i386/topology.h
index 5de905dc00d3..3cec97b377f2 100644
--- a/include/hw/i386/topology.h
+++ b/include/hw/i386/topology.h
@@ -79,12 +79,13 @@ static inline unsigned apicid_smt_width(X86CPUTopoInfo 
*topo_info)
 /* Bit width of the Core_ID field */
 static inline unsigned apicid_core_width(X86CPUTopoInfo *topo_info)
 {
-/*
- * TODO: Will separate module info from core_width when update
- * APIC ID with module level.
- */
-return apicid_bitwidth_for_count(topo_info->cores_per_module *
- topo_info->modules_per_die);
+return apicid_bitwidth_for_count(topo_info->cores_per_module);
+}
+
+/* Bit width of the Module_ID (cluster ID) field */
+static inline unsigned apicid_module_width(X86CPUTopoInfo *topo_info)
+{
+return apicid_bitwidth_for_count(topo_info->modules_per_die);
 }
 
 /* Bit width of the Die_ID field */
@@ -99,10 +100,16 @@ static inline unsigned apicid_core_offset(X86CPUTopoInfo 
*topo_info)
 return apicid_smt_width(topo_info);
 }
 
+/* Bit offset of the Module_ID (cluster ID) field */
+static inline unsigned apicid_module_offset(X86CPUTopoInfo *topo_info)
+{
+return apicid_core_offset(topo_info) + apicid_core_width(topo_info);
+}
+
 /* Bit offset of the Die_ID field */
 static inline unsigned apicid_die_offset(X86CPUTopoInfo *topo_info)
 {
-return apicid_core_offset(topo_info) + apicid_core_width(topo_info);
+return apicid_module_offset(topo_info) + apicid_module_width(topo_info);
 }
 
 /* Bit offset of the Pkg_ID (socket ID) field */
@@ -121,6 +128,7 @@ static inline apic_id_t 
x86_apicid_from_topo_ids(X86CPUTopoInfo *topo_info,
 {
 return (topo_ids->pkg_id  << apicid_pkg_offset(topo_info)) |
(topo_ids->die_id  << apicid_die_offset(topo_info)) |
+   (topo_ids->module_id << apicid_module_offset(topo_info)) |
(topo_ids->core_id << apicid_core_offset(topo_info)) |
topo_ids->smt_id;
 }
@@ -138,11 +146,6 @@ static inline void x86_topo_ids_from_idx(X86CPUTopoInfo 
*topo_info,
 unsigned nr_cores = topo_info->cores_per_module;
 unsigned nr_threads = topo_info->threads_per_core;
 
-/*
- * Currently smp for i386 doesn't support "clusters", modules_per_die is
- * only 1. Therefore, the module_id generated from the module topology will
- * not conflict with the module_id generated according to the apicid.
- */
 topo_ids->pkg_id = cpu_index / (nr_dies * nr_modules *
nr_cores * nr_threads);
 topo_ids->die_id = cpu_index / (nr_modules * nr_cores *
@@ -166,12 +169,9 @@ static inline void x86_topo_ids_from_apicid(apic_id_t 
apicid,
 topo_ids->core_id =
 (apicid >> apicid_core_offset(topo_info)) &
 ~(0xUL << apicid_core_width(topo

[PATCH 05/18] i386/cpu: Consolidate the use of topo_info in cpu_x86_cpuid()

From: Zhao Liu 

In cpu_x86_cpuid(), there are many variables in representing the cpu
topology, e.g., topo_info, cs->nr_cores/cs->nr_threads.

Since the names of cs->nr_cores/cs->nr_threads does not accurately
represent its meaning, the use of cs->nr_cores/cs->nr_threads is prone
to confusion and mistakes.

And the structure X86CPUTopoInfo names its memebers clearly, thus the
variable "topo_info" should be preferred.

Suggested-by: Robert Hoo 
Signed-off-by: Zhao Liu 
---
 target/i386/cpu.c | 30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 7833505092d8..4cda84eb96f1 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -5215,11 +5215,15 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 uint32_t limit;
 uint32_t signature[3];
 X86CPUTopoInfo topo_info;
+uint32_t cpus_per_pkg;
 
 topo_info.dies_per_pkg = env->nr_dies;
 topo_info.cores_per_die = cs->nr_cores / env->nr_dies;
 topo_info.threads_per_core = cs->nr_threads;
 
+cpus_per_pkg = topo_info.dies_per_pkg * topo_info.cores_per_die *
+   topo_info.threads_per_core;
+
 /* Calculate & apply limits for different index ranges */
 if (index >= 0xC000) {
 limit = env->cpuid_xlevel2;
@@ -5255,8 +5259,8 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 *ecx |= CPUID_EXT_OSXSAVE;
 }
 *edx = env->features[FEAT_1_EDX];
-if (cs->nr_cores * cs->nr_threads > 1) {
-*ebx |= (cs->nr_cores * cs->nr_threads) << 16;
+if (cpus_per_pkg > 1) {
+*ebx |= cpus_per_pkg << 16;
 *edx |= CPUID_HT;
 }
 if (!cpu->enable_pmu) {
@@ -5293,10 +5297,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  */
 if (*eax & 31) {
 int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14);
-int vcpus_per_socket = cs->nr_cores * cs->nr_threads;
-if (cs->nr_cores > 1) {
+int vcpus_per_socket = cpus_per_pkg;
+int cores_per_socket = topo_info.cores_per_die *
+   topo_info.dies_per_pkg;
+if (cores_per_socket > 1) {
 *eax &= ~0xFC00;
-*eax |= (pow2ceil(cs->nr_cores) - 1) << 26;
+*eax |= (pow2ceil(cores_per_socket) - 1) << 26;
 }
 if (host_vcpus_per_cache > vcpus_per_socket) {
 *eax &= ~0x3FFC000;
@@ -5436,12 +5442,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 switch (count) {
 case 0:
 *eax = apicid_core_offset(&topo_info);
-*ebx = cs->nr_threads;
+*ebx = topo_info.threads_per_core;
 *ecx |= CPUID_TOPOLOGY_LEVEL_SMT;
 break;
 case 1:
 *eax = apicid_pkg_offset(&topo_info);
-*ebx = cs->nr_cores * cs->nr_threads;
+*ebx = cpus_per_pkg;
 *ecx |= CPUID_TOPOLOGY_LEVEL_CORE;
 break;
 default:
@@ -5472,7 +5478,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 switch (count) {
 case 0:
 *eax = apicid_core_offset(&topo_info);
-*ebx = cs->nr_threads;
+*ebx = topo_info.threads_per_core;
 *ecx |= CPUID_TOPOLOGY_LEVEL_SMT;
 break;
 case 1:
@@ -5482,7 +5488,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 break;
 case 2:
 *eax = apicid_pkg_offset(&topo_info);
-*ebx = cs->nr_cores * cs->nr_threads;
+*ebx = cpus_per_pkg;
 *ecx |= CPUID_TOPOLOGY_LEVEL_DIE;
 break;
 default:
@@ -5707,7 +5713,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  * discards multiple thread information if it is set.
  * So don't set it here for Intel to make Linux guests happy.
  */
-if (cs->nr_cores * cs->nr_threads > 1) {
+if (cpus_per_pkg > 1) {
 if (env->cpuid_vendor1 != CPUID_VENDOR_INTEL_1 ||
 env->cpuid_vendor2 != CPUID_VENDOR_INTEL_2 ||
 env->cpuid_vendor3 != CPUID_VENDOR_INTEL_3) {
@@ -5769,7 +5775,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  *eax |= (cpu_x86_virtual_addr_width(env) << 8);
 }
 *ebx = env->features[FEAT_8000_0008_EBX];
-if (cs->nr_cores * cs->nr_threads > 1) {
+if (cpus_per_pkg > 1) {
 /*
  * Bits 15:12 is "The number of bits in the initial
  * Core::X86::Apic::ApicId[ApicId] value that indicate
@@ -5777,7 +5783,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  * Bits 7:0

[PATCH 13/18] hw/i386/pc: Support smp.clusters for x86 PC machine

From: Zhuocheng Ding 

As module-level topology support is added to X86CPU, now we can enable
the support for the cluster parameter on PC machines. With this support,
we can define a 5-level x86 CPU topology with "-smp":

-smp cpus=*,maxcpus=*,sockets=*,dies=*,clusters=*,cores=*,threads=*.

Additionally, add the 5-level topology example in description of "-smp".

Signed-off-by: Zhuocheng Ding 
Signed-off-by: Zhao Liu 
---
 hw/i386/pc.c|  1 +
 qemu-options.hx | 10 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 6e592bd969aa..c329df56ebd2 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1929,6 +1929,7 @@ static void pc_machine_class_init(ObjectClass *oc, void 
*data)
 mc->default_cpu_type = TARGET_DEFAULT_CPU_TYPE;
 mc->nvdimm_supported = true;
 mc->smp_props.dies_supported = true;
+mc->smp_props.clusters_supported = true;
 mc->default_ram_id = "pc.ram";
 
 object_class_property_add(oc, PC_MACHINE_MAX_RAM_BELOW_4G, "size",
diff --git a/qemu-options.hx b/qemu-options.hx
index d59d19704bc5..3700e1aa97ea 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -312,14 +312,14 @@ SRST
 -smp 8,sockets=2,cores=2,threads=2,maxcpus=8
 
 The following sub-option defines a CPU topology hierarchy (2 sockets
-totally on the machine, 2 dies per socket, 2 cores per die, 2 threads
-per core) for PC machines which support sockets/dies/cores/threads.
-Some members of the option can be omitted but their values will be
-automatically computed:
+totally on the machine, 2 dies per socket, 2 clusters per die, 2 cores per
+cluster, 2 threads per core) for PC machines which support sockets/dies
+/clusters/cores/threads. Some members of the option can be omitted but
+their values will be automatically computed:
 
 ::
 
--smp 16,sockets=2,dies=2,cores=2,threads=2,maxcpus=16
+-smp 32,sockets=2,dies=2,clusters=2,cores=2,threads=2,maxcpus=32
 
 The following sub-option defines a CPU topology hierarchy (2 sockets
 totally on the machine, 2 clusters per socket, 2 cores per cluster,
-- 
2.34.1

[PATCH 01/18] machine: Fix comment of machine_parse_smp_config()

From: Zhao Liu 

Now smp supports dies and clusters, so add description about these 2
levels in the comment of machine_parse_smp_config().

Fixes: 864c3b5 (hw/core/machine: Introduce CPU cluster topology support)
Suggested-by: Robert Hoo 
Signed-off-by: Zhao Liu 
---
 hw/core/machine-smp.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/hw/core/machine-smp.c b/hw/core/machine-smp.c
index c3dab007dadc..3fd9e641efde 100644
--- a/hw/core/machine-smp.c
+++ b/hw/core/machine-smp.c
@@ -51,8 +51,8 @@ static char *cpu_hierarchy_to_string(MachineState *ms)
  * machine_parse_smp_config: Generic function used to parse the given
  *   SMP configuration
  *
- * Any missing parameter in "cpus/maxcpus/sockets/cores/threads" will be
- * automatically computed based on the provided ones.
+ * Any missing parameter in "cpus/maxcpus/sockets/dies/clusters/cores/threads"
+ * will be automatically computed based on the provided ones.
  *
  * In the calculation of omitted sockets/cores/threads: we prefer sockets
  * over cores over threads before 6.2, while preferring cores over sockets
@@ -66,7 +66,8 @@ static char *cpu_hierarchy_to_string(MachineState *ms)
  *
  * For compatibility, apart from the parameters that will be computed, newly
  * introduced topology members which are likely to be target specific should
- * be directly set as 1 if they are omitted (e.g. dies for PC since 4.1).
+ * be directly set as 1 if they are omitted (e.g. dies for PC since v4.1 and
+ * clusters for arm since v7.0).
  */
 void machine_parse_smp_config(MachineState *ms,
   const SMPConfiguration *config, Error **errp)
-- 
2.34.1

[PATCH 16/18] i386: Fix NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]

From: Zhao Liu 

>From AMD's APM, NumSharingCache (CPUID[0x801D].EAX[bits 25:14])
means [1]:

The number of logical processors sharing this cache is the value of
this field incremented by 1. To determine which logical processors are
sharing a cache, determine a Share Id for each processor as follows:

ShareId = LocalApicId >> log2(NumSharingCache+1)

Logical processors with the same ShareId then share a cache. If
NumSharingCache+1 is not a power of two, round it up to the next power
of two.

>From the description above, the caculation of this feild should be same
as CPUID[4].EAX[bits 25:14] for intel cpus. So also use the offsets of
APIC ID to caculate this field.

Note: I don't have the hardware available, hope someone can help me to
confirm whether this calculation is correct, thanks!

[1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
 Information

Signed-off-by: Zhao Liu 
---
 target/i386/cpu.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 96ef96860604..d691c02e3c06 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -355,7 +355,7 @@ static void encode_cache_cpuid801d(CPUCacheInfo *cache,
uint32_t *eax, uint32_t *ebx,
uint32_t *ecx, uint32_t *edx)
 {
-uint32_t l3_threads;
+uint32_t sharing_apic_ids;
 assert(cache->size == cache->line_size * cache->associativity *
   cache->partitions * cache->sets);
 
@@ -364,13 +364,11 @@ static void encode_cache_cpuid801d(CPUCacheInfo 
*cache,
 
 /* L3 is shared among multiple cores */
 if (cache->level == 3) {
-l3_threads = topo_info->modules_per_die *
- topo_info->cores_per_module *
- topo_info->threads_per_core;
-*eax |= (l3_threads - 1) << 14;
+sharing_apic_ids = 1 << apicid_die_offset(topo_info);
 } else {
-*eax |= ((topo_info->threads_per_core - 1) << 14);
+sharing_apic_ids = 1 << apicid_core_offset(topo_info);
 }
+*eax |= (sharing_apic_ids - 1) << 14;
 
 assert(cache->line_size > 0);
 assert(cache->partitions > 0);
-- 
2.34.1

[PATCH 08/18] i386: Support module_id in X86CPUTopoIDs

From: Zhuocheng Ding 

Add module_id member in X86CPUTopoIDs.

module_id can be parsed from APIC ID, so before updating the parsing
rule of APIC ID, temporarily set the module_id generated in this way to
0.

module_id can be also generated from cpu topology, and before i386
supports "clusters" in smp, the default "clusters per die" is only 1,
thus the module_id generated in this way is 0, so that it will not
conflict with the module_id generated by APIC ID.

Signed-off-by: Zhuocheng Ding 
Co-developed-by: Zhao Liu 
Signed-off-by: Zhao Liu 
---
 hw/i386/x86.c  | 17 +
 include/hw/i386/topology.h | 24 
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 3f6e0dd7b827..b66719230d57 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -361,6 +361,14 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 topo_ids.die_id = cpu->die_id;
 topo_ids.core_id = cpu->core_id;
 topo_ids.smt_id = cpu->thread_id;
+
+/*
+ * TODO: This is the temporary initialization for topo_ids.module_id to
+ * avoid "maybe-uninitialized" compilation errors. Will remove when
+ * X86CPU supports cluster_id.
+ */
+topo_ids.module_id = 0;
+
 cpu->apic_id = x86_apicid_from_topo_ids(&topo_info, &topo_ids);
 }
 
@@ -369,6 +377,11 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 MachineState *ms = MACHINE(x86ms);
 
 x86_topo_ids_from_apicid(cpu->apic_id, &topo_info, &topo_ids);
+
+/*
+ * TODO: Before APIC ID supports module level parsing, there's no need
+ * to expose module_id info.
+ */
 error_setg(errp,
 "Invalid CPU [socket: %u, die: %u, core: %u, thread: %u] with"
 " APIC ID %" PRIu32 ", valid index range 0:%d",
@@ -494,6 +507,10 @@ const CPUArchIdList 
*x86_possible_cpu_arch_ids(MachineState *ms)
 ms->possible_cpus->cpus[i].props.has_die_id = true;
 ms->possible_cpus->cpus[i].props.die_id = topo_ids.die_id;
 }
+if (ms->smp.clusters > 1) {
+ms->possible_cpus->cpus[i].props.has_cluster_id = true;
+ms->possible_cpus->cpus[i].props.cluster_id = topo_ids.module_id;
+}
 ms->possible_cpus->cpus[i].props.has_core_id = true;
 ms->possible_cpus->cpus[i].props.core_id = topo_ids.core_id;
 ms->possible_cpus->cpus[i].props.has_thread_id = true;
diff --git a/include/hw/i386/topology.h b/include/hw/i386/topology.h
index bbb00dc4aad8..b0174c18b7bd 100644
--- a/include/hw/i386/topology.h
+++ b/include/hw/i386/topology.h
@@ -48,6 +48,7 @@ typedef uint32_t apic_id_t;
 typedef struct X86CPUTopoIDs {
 unsigned pkg_id;
 unsigned die_id;
+unsigned module_id;
 unsigned core_id;
 unsigned smt_id;
 } X86CPUTopoIDs;
@@ -134,12 +135,21 @@ static inline void x86_topo_ids_from_idx(X86CPUTopoInfo 
*topo_info,
  X86CPUTopoIDs *topo_ids)
 {
 unsigned nr_dies = topo_info->dies_per_pkg;
-unsigned nr_cores = topo_info->cores_per_module *
-topo_info->modules_per_die;
+unsigned nr_modules = topo_info->modules_per_die;
+unsigned nr_cores = topo_info->cores_per_module;
 unsigned nr_threads = topo_info->threads_per_core;
 
-topo_ids->pkg_id = cpu_index / (nr_dies * nr_cores * nr_threads);
-topo_ids->die_id = cpu_index / (nr_cores * nr_threads) % nr_dies;
+/*
+ * Currently smp for i386 doesn't support "clusters", modules_per_die is
+ * only 1. Therefore, the module_id generated from the module topology will
+ * not conflict with the module_id generated according to the apicid.
+ */
+topo_ids->pkg_id = cpu_index / (nr_dies * nr_modules *
+   nr_cores * nr_threads);
+topo_ids->die_id = cpu_index / (nr_modules * nr_cores *
+   nr_threads) % nr_dies;
+topo_ids->module_id = cpu_index / (nr_cores * nr_threads) %
+  nr_modules;
 topo_ids->core_id = cpu_index / nr_threads % nr_cores;
 topo_ids->smt_id = cpu_index % nr_threads;
 }
@@ -156,6 +166,12 @@ static inline void x86_topo_ids_from_apicid(apic_id_t 
apicid,
 topo_ids->core_id =
 (apicid >> apicid_core_offset(topo_info)) &
 ~(0xUL << apicid_core_width(topo_info));
+/*
+ * TODO: This is the temporary initialization for topo_ids.module_id to
+ * avoid "maybe-uninitialized" compilation errors. Will remove when APIC
+ * ID supports module level parsing.
+ */
+topo_ids->module_id = 0;
 topo_ids->die_id =
 (apicid >> apicid_die_offset(topo_info)) &
 ~(0xUL << apicid_die_width(topo_info));
-- 
2.34.1

[PATCH 12/18] tests: Add test case of APIC ID for module level parsing

From: Zhuocheng Ding 

After i386 supports module level, it's time to add the test for module
level's parsing.

Signed-off-by: Zhuocheng Ding 
Co-developed-by: Zhao Liu 
Signed-off-by: Zhao Liu 
---
 tests/unit/test-x86-apicid.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/tests/unit/test-x86-apicid.c b/tests/unit/test-x86-apicid.c
index f21b8a5d95c2..55b731ccae55 100644
--- a/tests/unit/test-x86-apicid.c
+++ b/tests/unit/test-x86-apicid.c
@@ -37,6 +37,7 @@ static void test_topo_bits(void)
 topo_info = (X86CPUTopoInfo) {1, 1, 1, 1};
 g_assert_cmpuint(apicid_smt_width(&topo_info), ==, 0);
 g_assert_cmpuint(apicid_core_width(&topo_info), ==, 0);
+g_assert_cmpuint(apicid_module_width(&topo_info), ==, 0);
 g_assert_cmpuint(apicid_die_width(&topo_info), ==, 0);
 
 topo_info = (X86CPUTopoInfo) {1, 1, 1, 1};
@@ -74,13 +75,22 @@ static void test_topo_bits(void)
 topo_info = (X86CPUTopoInfo) {1, 1, 33, 2};
 g_assert_cmpuint(apicid_core_width(&topo_info), ==, 6);
 
-topo_info = (X86CPUTopoInfo) {1, 1, 30, 2};
+topo_info = (X86CPUTopoInfo) {1, 6, 30, 2};
+g_assert_cmpuint(apicid_module_width(&topo_info), ==, 3);
+topo_info = (X86CPUTopoInfo) {1, 7, 30, 2};
+g_assert_cmpuint(apicid_module_width(&topo_info), ==, 3);
+topo_info = (X86CPUTopoInfo) {1, 8, 30, 2};
+g_assert_cmpuint(apicid_module_width(&topo_info), ==, 3);
+topo_info = (X86CPUTopoInfo) {1, 9, 30, 2};
+g_assert_cmpuint(apicid_module_width(&topo_info), ==, 4);
+
+topo_info = (X86CPUTopoInfo) {1, 6, 30, 2};
 g_assert_cmpuint(apicid_die_width(&topo_info), ==, 0);
-topo_info = (X86CPUTopoInfo) {2, 1, 30, 2};
+topo_info = (X86CPUTopoInfo) {2, 6, 30, 2};
 g_assert_cmpuint(apicid_die_width(&topo_info), ==, 1);
-topo_info = (X86CPUTopoInfo) {3, 1, 30, 2};
+topo_info = (X86CPUTopoInfo) {3, 6, 30, 2};
 g_assert_cmpuint(apicid_die_width(&topo_info), ==, 2);
-topo_info = (X86CPUTopoInfo) {4, 1, 30, 2};
+topo_info = (X86CPUTopoInfo) {4, 6, 30, 2};
 g_assert_cmpuint(apicid_die_width(&topo_info), ==, 2);
 
 /* build a weird topology and see if IDs are calculated correctly
@@ -91,6 +101,7 @@ static void test_topo_bits(void)
 topo_info = (X86CPUTopoInfo) {1, 1, 6, 3};
 g_assert_cmpuint(apicid_smt_width(&topo_info), ==, 2);
 g_assert_cmpuint(apicid_core_offset(&topo_info), ==, 2);
+g_assert_cmpuint(apicid_module_offset(&topo_info), ==, 5);
 g_assert_cmpuint(apicid_die_offset(&topo_info), ==, 5);
 g_assert_cmpuint(apicid_pkg_offset(&topo_info), ==, 5);
 
-- 
2.34.1

[PATCH 04/18] i386/cpu: Fix number of addressable IDs in CPUID.04H

From: Zhao Liu 

For i-cache and d-cache, the maximum IDs for CPUs sharing cache (
CPUID.04H.00H:EAX[bits 25:14] and CPUID.04H.01H:EAX[bits 25:14]) are
both 0, and this means i-cache and d-cache are shared in the SMT level.
This is correct if there's single thread per core, but is wrong for the
hyper threading case (one core contains multiple threads) since the
i-cache and d-cache are shared in the core level other than SMT level.

Therefore, in order to be compatible with both multi-threaded and
single-threaded situations, we should set i-cache and d-cache be shared
at the core level by default.

Refer to the fixes of cache_info_passthrough ([1], [2]) and SDM, the
CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits 31:26] should use the
nearest power-of-2 integer.

The nearest power-of-2 integer can be caculated by pow2ceil() or by
using APIC ID offset (like L3 topology using 1 << die_offset [3]).

But in fact, CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits 31:26]
are associated with APIC ID. For example, in linux kernel, the field
"num_threads_sharing" (Bits 25 - 14) is parsed with APIC ID. And for
another example, on Alder Lake P, the CPUID.04H:EAX[bits 31:26] is not
matched with actual core numbers and it's caculated by:
"(1 << (pkg_offset - core_offset)) - 1".

Therefore the offset of APIC ID should be preferred to caculate nearest
power-of-2 integer for CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits
31:26]:
1. d/i cache is shared in a core, 1 << core_offset should be used
   instand of "1" in encode_cache_cpuid4() for CPUID.04H.00H:EAX[bits
   25:14] and CPUID.04H.01H:EAX[bits 25:14].
2. L2 cache is supposed to be shared in a core as for now, thereby
   1 << core_offset should also be used instand of "cs->nr_threads" in
   encode_cache_cpuid4() for CPUID.04H.02H:EAX[bits 25:14].
3. Similarly, the value for CPUID.04H:EAX[bits 31:26] should also be
   replaced by the offsets upper SMT level in APIC ID.

And since [1] and [2] are good enough to make cache_into_passthrough
work well, its "pow2ceil()" uses are enough so that they're no need to
be replaced by APIC ID offset way.

[1]: efb3934 (x86: cpu: make sure number of addressable IDs for processor cores 
meets the spec)
[2]: d7caf13 (x86: cpu: fixup number of addressable IDs for logical processors 
sharing cache)
[3]: d65af28 (i386: Update new x86_apicid parsing rules with die_offset support)

Fixes: 7e3482f (i386: Helpers to encode cache information consistently)
Suggested-by: Robert Hoo 
Signed-off-by: Zhao Liu 
---
 target/i386/cpu.c | 20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 29afec12c281..7833505092d8 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -5212,7 +5212,6 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 {
 X86CPU *cpu = env_archcpu(env);
 CPUState *cs = env_cpu(env);
-uint32_t die_offset;
 uint32_t limit;
 uint32_t signature[3];
 X86CPUTopoInfo topo_info;
@@ -5308,27 +5307,38 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 *eax = *ebx = *ecx = *edx = 0;
 } else {
 *eax = 0;
+int addressable_cores_offset = apicid_pkg_offset(&topo_info) -
+   apicid_core_offset(&topo_info);
+int core_offset, die_offset;
+
 switch (count) {
 case 0: /* L1 dcache info */
+core_offset = apicid_core_offset(&topo_info);
 encode_cache_cpuid4(env->cache_info_cpuid4.l1d_cache,
-1, cs->nr_cores,
+(1 << core_offset),
+(1 << addressable_cores_offset),
 eax, ebx, ecx, edx);
 break;
 case 1: /* L1 icache info */
+core_offset = apicid_core_offset(&topo_info);
 encode_cache_cpuid4(env->cache_info_cpuid4.l1i_cache,
-1, cs->nr_cores,
+(1 << core_offset),
+(1 << addressable_cores_offset),
 eax, ebx, ecx, edx);
 break;
 case 2: /* L2 cache info */
+core_offset = apicid_core_offset(&topo_info);
 encode_cache_cpuid4(env->cache_info_cpuid4.l2_cache,
-cs->nr_threads, cs->nr_cores,
+(1 << core_offset),
+(1 << addressable_cores_offset),
 eax, ebx, ecx, edx);
 break;
 case 3: /* L3 cache info */
 die_offset = apicid_die_offset(&topo_info);
 if (cpu->enable_l3_cache) {
 encode_cache_cpuid4(env->cache_info_cpuid4.l3_cache,
-

[PATCH 15/18] i386: Use CPUCacheInfo.share_level to encode CPUID[4].EAX[bits 25:14]

From: Zhao Liu 

CPUID[4].EAX[bits 25:14] is used to represent the cache topology for
intel CPUs.

After cache models have topology information, we can use
CPUCacheInfo.share_level to decide which topology level to be encoded
into CPUID[4].EAX[bits 25:14].

Additionally, since maximum_processor_id (original "num_apic_ids") is
parsed based on cpu topology levels, which are verified when parsing
smp, it's no need to check this value by "assert(num_apic_ids > 0)"
again, so remove this assert.

Signed-off-by: Zhao Liu 
---
 target/i386/cpu.c | 55 +++
 1 file changed, 36 insertions(+), 19 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 364534e84b1b..96ef96860604 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -231,22 +231,50 @@ static uint8_t cpuid2_cache_descriptor(CPUCacheInfo 
*cache)
((t) == UNIFIED_CACHE) ? CACHE_TYPE_UNIFIED : \
0 /* Invalid value */)
 
+static uint32_t max_processor_ids_for_cache(CPUCacheInfo *cache,
+X86CPUTopoInfo *topo_info)
+{
+uint32_t num_ids = 0;
+
+switch (cache->share_level) {
+case CORE:
+num_ids = 1 << apicid_core_offset(topo_info);
+break;
+case DIE:
+num_ids = 1 << apicid_die_offset(topo_info);
+break;
+default:
+/*
+ * Currently there is no use case for SMT, MODULE and PACKAGE, so use
+ * assert directly to facilitate debugging.
+ */
+g_assert_not_reached();
+}
+
+return num_ids - 1;
+}
+
+static uint32_t max_core_ids_in_package(X86CPUTopoInfo *topo_info)
+{
+uint32_t num_cores = 1 << (apicid_pkg_offset(topo_info) -
+   apicid_core_offset(topo_info));
+return num_cores - 1;
+}
 
 /* Encode cache info for CPUID[4] */
 static void encode_cache_cpuid4(CPUCacheInfo *cache,
-int num_apic_ids, int num_cores,
+X86CPUTopoInfo *topo_info,
 uint32_t *eax, uint32_t *ebx,
 uint32_t *ecx, uint32_t *edx)
 {
 assert(cache->size == cache->line_size * cache->associativity *
   cache->partitions * cache->sets);
 
-assert(num_apic_ids > 0);
 *eax = CACHE_TYPE(cache->type) |
CACHE_LEVEL(cache->level) |
(cache->self_init ? CACHE_SELF_INIT_LEVEL : 0) |
-   ((num_cores - 1) << 26) |
-   ((num_apic_ids - 1) << 14);
+   (max_core_ids_in_package(topo_info) << 26) |
+   (max_processor_ids_for_cache(cache, topo_info) << 14);
 
 assert(cache->line_size > 0);
 assert(cache->partitions > 0);
@@ -5335,38 +5363,27 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 *eax = *ebx = *ecx = *edx = 0;
 } else {
 *eax = 0;
-int addressable_cores_offset = apicid_pkg_offset(&topo_info) -
-   apicid_core_offset(&topo_info);
-int core_offset, die_offset;
 
 switch (count) {
 case 0: /* L1 dcache info */
-core_offset = apicid_core_offset(&topo_info);
 encode_cache_cpuid4(env->cache_info_cpuid4.l1d_cache,
-(1 << core_offset),
-(1 << addressable_cores_offset),
+&topo_info,
 eax, ebx, ecx, edx);
 break;
 case 1: /* L1 icache info */
-core_offset = apicid_core_offset(&topo_info);
 encode_cache_cpuid4(env->cache_info_cpuid4.l1i_cache,
-(1 << core_offset),
-(1 << addressable_cores_offset),
+&topo_info,
 eax, ebx, ecx, edx);
 break;
 case 2: /* L2 cache info */
-core_offset = apicid_core_offset(&topo_info);
 encode_cache_cpuid4(env->cache_info_cpuid4.l2_cache,
-(1 << core_offset),
-(1 << addressable_cores_offset),
+&topo_info,
 eax, ebx, ecx, edx);
 break;
 case 3: /* L3 cache info */
-die_offset = apicid_die_offset(&topo_info);
 if (cpu->enable_l3_cache) {
 encode_cache_cpuid4(env->cache_info_cpuid4.l3_cache,
-(1 << die_offset),
-(1 << addressable_cores_offset),
+&topo_info,
 eax, ebx, ecx, edx);
 break;
 }
-- 
2.34.1

Re: [PATCH] vhost-user-fs: add capability to allow migration

Anton Kuchin  wrote:
> On 01/02/2023 16:26, Juan Quintela wrote:
>> Anton Kuchin  wrote:
>>> On 19/01/2023 18:02, Stefan Hajnoczi wrote:
 On Thu, 19 Jan 2023 at 10:29, Anton Kuchin  
 wrote:
> On 19/01/2023 16:30, Stefan Hajnoczi wrote:
>> On Thu, 19 Jan 2023 at 07:43, Anton Kuchin  
>> wrote:
>>> On 18/01/2023 17:52, Stefan Hajnoczi wrote:
 On Sun, 15 Jan 2023 at 12:21, Anton Kuchin 
  wrote:
>> Once told that, I think that you are making your live harder in the
>> future when you add the other migratable devices.
>>
>> I am assuming here that your "underlying device" is:
>>
>> enum VhostUserFSType {
>>  VHOST_USER_NO_MIGRATABLE = 0,
>>  // The one we are doing here
>>  VHOST_USER_EXTERNAL_MIGRATABLE,
>>  // The one you describe for the future
>>  VHOST_USER_INTERNAL_MIGRATABLE,
>> };
>>
>> struct VHostUserFS {
>>  /*< private >*/
>>  VirtIODevice parent;
>>  VHostUserFSConf conf;
>>  struct vhost_virtqueue *vhost_vqs;
>>  struct vhost_dev vhost_dev;
>>  VhostUserState vhost_user;
>>  VirtQueue **req_vqs;
>>  VirtQueue *hiprio_vq;
>>  int32_t bootindex;
>>  enum migration_type;
>>  /*< public >*/
>> };
>>
>>
>> static int vhost_user_fs_pre_save(void *opaque)
>> {
>>  VHostUserFS *s = opaque;
>>
>>  if (s->migration_type == VHOST_USER_NO_MIGRATABLE)) {
>>  error_report("Migration of vhost-user-fs devices requires internal 
>> FUSE "
>>   "state of backend to be preserved. If orchestrator can 
>> "
>>   "guarantee this (e.g. dst connects to the same backend 
>> "
>>   "instance or backend state is migrated) set 
>> 'vhost-user-fs' "
>>   "migration capability to true to enable migration.");
>>  return -1;
>>  }
>>  if (s->migration_type == VHOST_USER_EXTERNAL_MIGRATABLE) {
>>  return 0;
>>  }
>>  if (s->migration_type == VHOST_USER_INTERNAL_MIGRATABLE) {
>>  error_report("still not implemented");
>>  return -1;
>>  }
>>  assert("we don't reach here");
>> }
>>
>> Your initial vmstateDescription
>>
>> static const VMStateDescription vuf_vmstate = {
>>  .name = "vhost-user-fs",
>>  .unmigratable = 1,
>>  .minimum_version_id = 0,
>>  .version_id = 0,
>>  .fields = (VMStateField[]) {
>>  VMSTATE_INT8(migration_type, struct VHostUserFS),
>>  VMSTATE_VIRTIO_DEVICE,
>>  VMSTATE_END_OF_LIST()
>>  },
>>  .pre_save = vhost_user_fs_pre_save,
>> };
>>
>> And later you change it to something like:
>>
>> static bool vhost_fs_user_internal_state_needed(void *opaque)
>> {
>>  VHostUserFS *s = opaque;
>>
>>  return s->migration_type == VMOST_USER_INTERNAL_MIGRATABLE;
>> }
>>
>> static const VMStateDescription vmstate_vhost_user_fs_internal_sub = {
>>  .name = "vhost-user-fs/internal",
>>  .version_id = 1,
>>  .minimum_version_id = 1,
>>  .needed = &vhost_fs_user_internal_state_needed,
>>  .fields = (VMStateField[]) {
>>   // Whatever
>>  VMSTATE_END_OF_LIST()
>>  }
>> };
>>
>> static const VMStateDescription vuf_vmstate = {
>>  .name = "vhost-user-fs",
>>  .minimum_version_id = 0,
>>  .version_id = 0,
>>  .fields = (VMStateField[]) {
>>  VMSTATE_INT8(migration_type, struct VHostUserFS),
>>  VMSTATE_VIRTIO_DEVICE,
>>  VMSTATE_END_OF_LIST()
>>  },
>>  .pre_save = vhost_user_fs_pre_save,
>>  .subsections = (const VMStateDescription*[]) {
>>  &vmstate_vhost_user_fs_internal_sub,
>>  NULL
>>  }
>> };
>>
>> And you are done.
>>
>> I will propose to use a property to set migration_type, but I didn't
>> want to write the code right now.
>>
>> I think that this proposal will make Stephan happy, and it is just
>> adding and extra uint8_t that is helpul to implement everything.
>
> That is exactly the approach I'm trying to implement it right now. Single
> flag can be convenient for orchestrator but makes it too hard to account in
> all cases for all devices on qemu side without breaking future
> compatibility.
> So I'm rewriting it with properties.

Nice.  That would be my proposal.  Just a bit complicated for a proof of 
concetp.

> BTW do you think each vhost-user device should have its own enum of
> migration
> types or maybe we could make them common for all device types?

I will put it for vhost-user, because as far as I know nobody else is
asking for this functionality.

The most similar device that I can think of right now is vfio devices.
But they are implemeting callbacks to save hardware device state, and
they go device by device, i.e. there is nothing general there.

Later, Juan.

Re: [PATCH v3 2/3] hvf: implement guest debugging on Apple Silicon hosts

2023-02-02 Thread Mads Ynddal



> Are you running the Linux guest on multiple cores? If yes, could you
> check if the issue persists also when using a single core?

Yes, I was running with 2 cores. I just tested with 1 and 2 several times. I
haven't reproduced the bug with 1 core, but happens around 70% of the time with
2 cores.

—
Mads Ynddal

Re: [PULL 0/7] Python patches

On Wed, 25 Jan 2023 at 02:34, John Snow  wrote:
>
> The following changes since commit 13356edb87506c148b163b8c7eb0695647d00c2a:
>
>   Merge tag 'block-pull-request' of https://gitlab.com/stefanha/qemu into 
> staging (2023-01-24 09:45:33 +)
>
> are available in the Git repository at:
>
>   https://gitlab.com/jsnow/qemu.git tags/python-pull-request
>
> for you to fetch changes up to bd4c0ef409140bd1be393407c04005ac077d4574:
>
>   python/qemu/machine: use socketpair() for QMP by default (2023-01-24 
> 13:37:13 -0500)
>
> 
> Python
>
> Bits and pieces, kibbles'n'bits
>
> 


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/8.0
for any user-visible changes.

-- PMM

Re: [PATCH v3] migration: Remove res_compatible parameter

Vladimir Sementsov-Ogievskiy  wrote:
> On 1/30/23 11:06, Juan Quintela wrote:
>> It was only used for RAM, and in that case, it means that this amount
>> of data was sent for memory.  Just delete the field in all callers.
>
> Could you describe, why it's safe to change the behavior for RAM?

I will start for the change in migration.c

-uint64_t pend_pre, pend_compat, pend_post;
+uint64_t pend_pre, pend_post;
 bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
 
-qemu_savevm_state_pending_estimate(&pend_pre, &pend_compat, &pend_post);
-uint64_t pending_size = pend_pre + pend_compat + pend_post;

Here we add the three to pending_size, so it don't matter.

+qemu_savevm_state_pending_estimate(&pend_pre, &pend_post);
+uint64_t pending_size = pend_pre + pend_post;
 
-trace_migrate_pending_estimate(pending_size,
-   pend_pre, pend_compat, pend_post);
+trace_migrate_pending_estimate(pending_size, pend_pre, pend_post);

Trace, we don't care again.

-if (pend_pre + pend_compat <= s->threshold_size) {

We do the test on pend_pre + pend_compat.
So, the only place where we are using pend_compat alone is on trace
messages.

I guess we can agree that changing trace messages don't matter.


-qemu_savevm_state_pending_exact(&pend_pre, &pend_compat, &pend_post);
-pending_size = pend_pre + pend_compat + pend_post;
-trace_migrate_pending_exact(pending_size,
-pend_pre, pend_compat, pend_post);
+if (pend_pre <= s->threshold_size) {
+qemu_savevm_state_pending_exact(&pend_pre, &pend_post);
+pending_size = pend_pre + pend_post;
+trace_migrate_pending_exact(pending_size, pend_pre, pend_post);


Ok, Now go to ram.c changes:

The only change that we do is that if we are on postcopy, we add the

@@ -3448,9 +3444,9 @@ static void ram_state_pending_exact(void *opaque,
 
 if (migrate_postcopy_ram()) {
 /* We can do postcopy, and all the data is postcopiable */
-*res_compatible += remaining_size;
+*res_postcopy += remaining_size;
 } else {
-*res_precopy_only += remaining_size;
+*res_precopy += remaining_size;
 }
 }

The only use of res_compatible was when we were on postocpy, nothing
else sets it.

But when we are on postcopy, we can send that pages trough postcopy, so
we add them there and if we are on precopy, res_compatible was already zero.

So I think that is ok, i.e. no behaviour change.

What do you think?

> Also, I think it would be a lot better to split the change for RAM
> (s/res_compatible/res_postcopy) in one patch, and then drop the
> totally unused res_compatible file in the second patch

Ok, will do for next version.

Later, Juan.

Re: [PATCH] block/mirror: add 'write-blocking-after-ready' copy mode

2023-02-02 Thread Fiona Ebner

Am 31.01.23 um 19:18 schrieb Denis V. Lunev:
> On 1/31/23 18:44, Vladimir Sementsov-Ogievskiy wrote:
>> + Den
>>
>> Den, I remember we thought about that, and probably had a solution?
>>
>> Another possible approach to get benefits from both modes is to switch
>> to blocking mode after first loop of copying. [*]
>>
>> Hmm. Thinking about proposed solution it seems, that [*] is better.
>> The main reason of "write-blocking" mode is to force convergence of
>> the mirror job. But you lose this thing if you postpone switching to
>> the moment when mirror becomes READY: we may never reach it.
>>
>>
>>
>> Or, may be we can selectively skip or block guest writes, to keep some
>> specified level of convergence?
>>
>> Or, we can start in "background" mode, track the speed of convergence
>> (like dirty-delta per minute), and switch to blocking if speed becomes
>> less than some threshold.
>>
> 
> That is absolutely correct. It would be very kind to start with
> asynchronous mirror and switch to the synchronous one after
> specific amount of loops. This should be somehow measured.
> 
> Originally we have thought about switching after the one loop
> (when basic background sending is done), but may be 2 iterations
> (initial + first one) would be better.
> 
> Talking about this specific approach and our past experience
> I should note that reaching completion point in the
> asynchronous mode was a real pain and slow down for the guest.
> We have intentionally switched at that time to the sync mode
> for that purpose. Thus this patch is a "worst half-way" for
> us.

While the asynchronous mode can take longer of course, the slowdown of
the guest is bigger with the synchronous mode, or what am I missing?

We have been using asynchronous mode for years (starting migration after
all drive-mirror jobs are READY) and AFAIK our users did not complain
about them not reaching READY or heavy slowdowns in the guest. We had
two reports about an assertion failure recently when drive-mirror still
got work left at the time the blockdrives are inactivated by migration.

> Frankly speaking I would say that this switch could be considered
> NOT QEMU job and we should just send a notification (event) for the
> completion of the each iteration and management software should
> take a decision to switch from async mode to the sync one.

Unfortunately, our management software is a bit limited in that regard
currently and making listening for events available in the necessary
place would take a bit of work. Having the switch command would nearly
be enough for us (we'd just switch after READY). But we'd also need that
when the switch happens after READY, that all remaining asynchronous
operations are finished by the command. Otherwise, the original issue
with inactivating block drives while mirror still has work remains. Do
those semantics for the switch sound acceptable?

> Under such a condition we should have a command to perform mode
> switch. This seems more QEMU/QAPI way :)
> 

I feel like a switch at an arbitrary time might be hard to do correctly,
i.e. to not miss some (implicit) assertion.

But I can try and go for this approach if it's more likely to be
accepted upstream. So a 'drive-mirror-change' command which takes the
'job-id' and a 'copy-mode' argument? And only allow the switch from
'background' to 'active' (initially)?

Or a more generic 'block-job-change'? But that would need a way to take
'JobType'-specific arguments. Is that doable?

(...)

>>> @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job,
>>> Error **errp)
>>>   if (s->in_flight == 0 && cnt == 0) {
>>>   trace_mirror_before_flush(s);
>>>   if (!job_is_ready(&s->common.job)) {
>>> +    if (s->copy_mode ==
>>> + MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) {
>>> +    /*
>>> + * Pause guest I/O to check if we can switch to
>>> active mode.
>>> + * To set actively_synced to true below, it is
>>> necessary to
>>> + * have seen and synced all guest I/O.
>>> + */
>>> +    s->in_drain = true;
>>> +    bdrv_drained_begin(bs);
>>> +    if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) {
>>> +    bdrv_drained_end(bs);
>>> +    s->in_drain = false;
>>> +    continue;
>>> +    }
>>> +    s->in_active_mode = true;
>>> + bdrv_disable_dirty_bitmap(s->dirty_bitmap);
>>> +    bdrv_drained_end(bs);
>>> +    s->in_drain = false;
>>> +    }
>>
>> I doubt, do we really need the drained section?
> 
> Frankly speaking I do not like this. I'd better would not
> rely on the enable/disable of the whole bitmap but encode
> mode into MirrorOp and mark blocks dirty only for those
> operations for which in_active_mode was set at the
> operation start.
>

Do you mean

Re: [PATCH] block/mirror: add 'write-blocking-after-ready' copy mode

2023-02-02 Thread Fiona Ebner

Am 31.01.23 um 18:44 schrieb Vladimir Sementsov-Ogievskiy:
>> @@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job,
>> Error **errp)
>>   if (s->in_flight == 0 && cnt == 0) {
>>   trace_mirror_before_flush(s);
>>   if (!job_is_ready(&s->common.job)) {
>> +    if (s->copy_mode ==
>> +    MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) {
>> +    /*
>> + * Pause guest I/O to check if we can switch to
>> active mode.
>> + * To set actively_synced to true below, it is
>> necessary to
>> + * have seen and synced all guest I/O.
>> + */
>> +    s->in_drain = true;
>> +    bdrv_drained_begin(bs);
>> +    if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) {
>> +    bdrv_drained_end(bs);
>> +    s->in_drain = false;
>> +    continue;
>> +    }
>> +    s->in_active_mode = true;
>> +    bdrv_disable_dirty_bitmap(s->dirty_bitmap);
>> +    bdrv_drained_end(bs);
>> +    s->in_drain = false;
>> +    }
> 
> I doubt, do we really need the drained section?
> 
> Why can't we simply set s->in_active_mode here and don't care?
> 
> I think bdrv_get_dirty_count(s->dirty_bitmap) can't become > 0 here, as
> cnt is zero, we are in context of bs and we don't have yield point. (ok,
> we have one in drained_begin(), but what I think we can simply drop
> drained section here).
> 

Thank you for the explanation! I wasn't sure if it's really needed and
wanted to take it safe (should've mentioned that in the original mail).

>> +
>>   if (mirror_flush(s) < 0) {
>>   /* Go check s->ret.  */
>>   continue;
>>   }
>> +
>>   /* We're out of the streaming phase.  From now on,
>> if the job
>>    * is cancelled we will actually complete all
>> pending I/O and
>>    * report completion.  This way, block-job-cancel
>> will leave
>> @@ -1443,7 +1465,7 @@ static int coroutine_fn
>> bdrv_mirror_top_do_write(BlockDriverState *bs,
>>   if (s->job) {
>>   copy_to_target = s->job->ret >= 0 &&
>>    !job_is_cancelled(&s->job->common.job) &&
>> - s->job->copy_mode ==
>> MIRROR_COPY_MODE_WRITE_BLOCKING;
>> + s->job->in_active_mode;
>>   }
>>     if (copy_to_target) {
>> @@ -1494,7 +1516,7 @@ static int coroutine_fn
>> bdrv_mirror_top_pwritev(BlockDriverState *bs,
>>   if (s->job) {
>>   copy_to_target = s->job->ret >= 0 &&
>>    !job_is_cancelled(&s->job->common.job) &&
>> - s->job->copy_mode ==
>> MIRROR_COPY_MODE_WRITE_BLOCKING;
>> + s->job->in_active_mode;
>>   }
>>     if (copy_to_target) {
>> @@ -1792,7 +1814,10 @@ static BlockJob *mirror_start_job(
>>   goto fail;
>>   }
>>   if (s->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING) {
>> +    s->in_active_mode = true;
>>   bdrv_disable_dirty_bitmap(s->dirty_bitmap);
>> +    } else {
>> +    s->in_active_mode = false;
>>   }
>>     ret = block_job_add_bdrv(&s->common, "source", bs, 0,
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 95ac4fa634..2a983ed78d 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -1200,10 +1200,13 @@
>>   #  addition, data is copied in background just like in
>>   #  @background mode.
>>   #
>> +# @write-blocking-after-ready: starts out in @background mode and
>> switches to
>> +#  @write-blocking when transitioning to
>> ready.
> 
> You should add also (Since: 8.0) label
> 

I'll try to remember it next time.

>> +#
>>   # Since: 3.0
>>   ##
>>   { 'enum': 'MirrorCopyMode',
>> -  'data': ['background', 'write-blocking'] }
>> +  'data': ['background', 'write-blocking',
>> 'write-blocking-after-ready'] }
>>     ##
>>   # @BlockJobInfo:
>

Re: [PULL 5/5] migration: simplify migration_iteration_run()

Vladimir Sementsov-Ogievskiy  wrote:
> On 30.01.23 11:03, Juan Quintela wrote:
>> Signed-off-by: Juan Quintela 
>> Reviewed-by: Dr. David Alan Gilbert 
>> ---
>>   migration/migration.c | 24 
>>   1 file changed, 12 insertions(+), 12 deletions(-)
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 594a42f085..644c61e91d 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -3764,23 +3764,23 @@ static MigIterateState 
>> migration_iteration_run(MigrationState *s)
>>   pend_pre, pend_compat, pend_post);
>>   }
>>   -if (pending_size && pending_size >= s->threshold_size) {
>> -/* Still a significant amount to transfer */
>> -if (!in_postcopy && pend_pre <= s->threshold_size &&
>> -qatomic_read(&s->start_postcopy)) {
>> -if (postcopy_start(s)) {
>> -error_report("%s: postcopy failed to start", __func__);
>> -}
>> -return MIG_ITERATE_SKIP;
>> -}
>> -/* Just another iteration step */
>> -qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
>> -} else {
>> +if (pending_size < s->threshold_size) {
>
> to keep the logic, formally it should be "if (!pending_size || pending_size < 
> s->threshold_size)"...

> Actually, could s->threshold_size be 0 here? Or, worth an assertion 
> assert(s->threshold_size) ?

It is weird, but it could:

bandwidth = (double)transferred / time_spent;
s->threshold_size = bandwidth * s->parameters.downtime_limit;

That is the (real) only place when we set it, and if the network was
completely down, bandwidth could be zero.

But I think that it is better to explain in the other way.  What does
the current code do:

if (too much dirty memory to converge)
do another iteration
else
do final iteration

What the new code do is

if (low enough memory to converge)
do final iteration.

do another iteration.

So, how we decide if we converge?
if pending_size < s->threshold_size

we "could" change it to pending_size <= s->threshold_size for the zero case

But I think that we would be shooting ourselves in the foot, because we
are having network trouble (only reason that s->theshold_size could be
zero) and we still have all the devices state to migrate.

And once that we enter that state, there is no way back, guest is
stopped in source and we are not able to send anything else.

So I think it is better this way.

What do you think?

Later, Juan.

Re: [PATCH v2 2/3] util/userfaultfd: Add uffd_open()

Peter Xu  wrote:
> Add a helper to create the uffd handle.
>
> Reviewed-by: Philippe Mathieu-Daudé 
> Signed-off-by: Peter Xu 

Reviewed-by: Juan Quintela 

I can get this one through migration tree.

Re: [PATCH v3 6/6] i386: Add new CPU model SapphireRapids

2023-02-02 Thread Igor Mammedov

On Fri,  6 Jan 2023 00:38:26 -0800
Lei Wang  wrote:

> The new CPU model mostly inherits features from Icelake-Server, while
> adding new features:
>  - AMX (Advance Matrix eXtensions)
>  - Bus Lock Debug Exception
> and new instructions:
>  - AVX VNNI (Vector Neural Network Instruction):
> - VPDPBUS: Multiply and Add Unsigned and Signed Bytes
> - VPDPBUSDS: Multiply and Add Unsigned and Signed Bytes with Saturation
> - VPDPWSSD: Multiply and Add Signed Word Integers
> - VPDPWSSDS: Multiply and Add Signed Integers with Saturation
>  - FP16: Replicates existing AVX512 computational SP (FP32) instructions
>using FP16 instead of FP32 for ~2X performance gain
>  - SERIALIZE: Provide software with a simple way to force the processor to
>complete all modifications, faster, allowed in all privilege levels and
>not causing an unconditional VM exit
>  - TSX Suspend Load Address Tracking: Allows programmers to choose which
>memory accesses do not need to be tracked in the TSX read set
>  - AVX512_BF16: Vector Neural Network Instructions supporting BFLOAT16
>inputs and conversion instructions from IEEE single precision

you should mention all new features here, preferably in format:
 feature-flag:  expected CPUID feature bits , so reviewer would be able to find 
it in spec
 
also you mention that it inherits most of the features from Icelake cpu,
it would be nice to provide cpuid diff between real Icelake and new cpu
somewhere in cover letter to simplify comparison.

> 
> Features may be added in future versions:
>  - CET (virtualization support hasn't been merged)
> Instructions may be added in future versions:
>  - fast zero-length MOVSB (KVM doesn't support yet)
>  - fast short STOSB (KVM doesn't support yet)
>  - fast short CMPSB, SCASB (KVM doesn't support yet)
> 
> Signed-off-by: Lei Wang 
> Reviewed-by: Robert Hoo 
> ---
>  target/i386/cpu.c | 135 ++
>  target/i386/cpu.h |   4 ++
>  2 files changed, 139 insertions(+)
> 
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 946df29a3d..5e1ecd50b3 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -3576,6 +3576,141 @@ static const X86CPUDefinition builtin_x86_defs[] = {
>  { /* end of list */ }
>  }
>  },
> +{
> +.name = "SapphireRapids",
> +.level = 0x20,
> +.vendor = CPUID_VENDOR_INTEL,
> +.family = 6,
> +.model = 143,
> +.stepping = 4,
> +/*
> + * please keep the ascending order so that we can have a clear view 
> of
> + * bit position of each feature.
> + */

unless you have a way to enforce this comment. I wouldn't expect that would
work in practice.

Also If you wish to bring more order here, you should start with a prerequisite
patch that would do the same to all existing CPU models.
That way one can easily see a difference between Icelake and new cpu model.

As it is in this patch (with all unnecessary reordering) it's hard for reviewer
to see differences.
I'd suggest to just copy Icelake definition and modify it to suit new cpu model
(without reordering all feature bits)  

> +.features[FEAT_1_EDX] =
> +CPUID_FP87 | CPUID_VME | CPUID_DE | CPUID_PSE | CPUID_TSC |
> +CPUID_MSR | CPUID_PAE | CPUID_MCE | CPUID_CX8 | CPUID_APIC |
> +CPUID_SEP | CPUID_MTRR | CPUID_PGE | CPUID_MCA | CPUID_CMOV |
> +CPUID_PAT | CPUID_PSE36 | CPUID_CLFLUSH | CPUID_MMX | CPUID_FXSR 
> |
> +CPUID_SSE | CPUID_SSE2,
> +.features[FEAT_1_ECX] =
> +CPUID_EXT_SSE3 | CPUID_EXT_PCLMULQDQ | CPUID_EXT_SSSE3 |
> +CPUID_EXT_FMA | CPUID_EXT_CX16 | CPUID_EXT_PCID | 
> CPUID_EXT_SSE41 |
> +CPUID_EXT_SSE42 | CPUID_EXT_X2APIC | CPUID_EXT_MOVBE |
> +CPUID_EXT_POPCNT | CPUID_EXT_TSC_DEADLINE_TIMER | CPUID_EXT_AES |
> +CPUID_EXT_XSAVE | CPUID_EXT_AVX | CPUID_EXT_F16C | 
> CPUID_EXT_RDRAND,
> +.features[FEAT_8000_0001_EDX] =
> +CPUID_EXT2_SYSCALL | CPUID_EXT2_NX | CPUID_EXT2_PDPE1GB |
> +CPUID_EXT2_RDTSCP | CPUID_EXT2_LM,
> +.features[FEAT_8000_0001_ECX] =
> +CPUID_EXT3_LAHF_LM | CPUID_EXT3_ABM | CPUID_EXT3_3DNOWPREFETCH,
> +.features[FEAT_8000_0008_EBX] =
> +CPUID_8000_0008_EBX_WBNOINVD,
> +.features[FEAT_7_0_EBX] =
> +CPUID_7_0_EBX_FSGSBASE | CPUID_7_0_EBX_BMI1 | CPUID_7_0_EBX_HLE |
> +CPUID_7_0_EBX_AVX2 | CPUID_7_0_EBX_SMEP | CPUID_7_0_EBX_BMI2 |
> +CPUID_7_0_EBX_ERMS | CPUID_7_0_EBX_INVPCID | CPUID_7_0_EBX_RTM |
> +CPUID_7_0_EBX_AVX512F | CPUID_7_0_EBX_AVX512DQ |
> +CPUID_7_0_EBX_RDSEED | CPUID_7_0_EBX_ADX | CPUID_7_0_EBX_SMAP |
> +CPUID_7_0_EBX_AVX512IFMA | CPUID_7_0_EBX_CLFLUSHOPT |
> +CPUID_7_0_EBX_CLWB | CPUID_7_0_EBX_AVX512CD | 
> CPUID_7_0_EBX_SHA_NI |
> +CPUID_7_0_EBX_AVX512BW | C

Re: [PATCH v15 01/11] s390x/cpu topology: adding s390 specificities to CPU topology

2023-02-02 Thread Thomas Huth


On 01/02/2023 14.20, Pierre Morel wrote:

S390 adds two new SMP levels, drawers and books to the CPU
topology.
The S390 CPU have specific toplogy features like dedication


Nit: s/toplogy/topology/


and polarity to give to the guest indications on the host
vCPUs scheduling and help the guest take the best decisions
on the scheduling of threads on the vCPUs.

Let us provide the SMP properties with books and drawers levels
and S390 CPU with dedication and polarity,

Signed-off-by: Pierre Morel 
---


Apart from the nit:
Reviewed-by: Thomas Huth

Re: [PULL 08/34] hw/arm/virt: Let the virtio-iommu bypass MSIs


Hi Eric,

On 3/7/20 17:53, Peter Maydell wrote:

From: Eric Auger 

At the moment the virtio-iommu translates MSI transactions.
This behavior is inherited from ARM SMMU. The virt machine
code knows where the guest MSI doorbells are so we can easily
declare those regions as VIRTIO_IOMMU_RESV_MEM_T_MSI. With that
setting the guest will not map MSIs through the IOMMU and those
transactions will be simply bypassed.

Depending on which MSI controller is in use (ITS or GICV2M),
we declare either:
- the ITS interrupt translation space (ITS_base + 0x1),
   containing the GITS_TRANSLATOR or
- The GICV2M single frame, containing the MSI_SETSP_NS register.

Signed-off-by: Eric Auger 
Message-id: 20200629070404.10969-6-eric.au...@redhat.com
Reviewed-by: Peter Maydell 
Signed-off-by: Peter Maydell 
---
  include/hw/arm/virt.h |  7 +++
  hw/arm/virt.c | 30 ++
  2 files changed, 37 insertions(+)




  static void create_gic(VirtMachineState *vms)
@@ -2198,8 +2200,36 @@ out:
  static void virt_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
  DeviceState *dev, Error **errp)
  {
+VirtMachineState *vms = VIRT_MACHINE(hotplug_dev);
+
  if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
  virt_memory_pre_plug(hotplug_dev, dev, errp);
+} else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI)) {
+hwaddr db_start = 0, db_end = 0;
+char *resv_prop_str;
+
+switch (vms->msi_controller) {
+case VIRT_MSI_CTRL_NONE:
+return;
+case VIRT_MSI_CTRL_ITS:
+/* GITS_TRANSLATER page */
+db_start = base_memmap[VIRT_GIC_ITS].base + 0x1;
+db_end = base_memmap[VIRT_GIC_ITS].base +
+ base_memmap[VIRT_GIC_ITS].size - 1;
+break;
+case VIRT_MSI_CTRL_GICV2M:
+/* MSI_SETSPI_NS page */
+db_start = base_memmap[VIRT_GIC_V2M].base;
+db_end = db_start + base_memmap[VIRT_GIC_V2M].size - 1;
+break;
+}
+resv_prop_str = g_strdup_printf("0x%"PRIx64":0x%"PRIx64":%u",
+db_start, db_end,
+VIRTIO_IOMMU_RESV_MEM_T_MSI);
+
+qdev_prop_set_uint32(dev, "len-reserved-regions", 1);


Where is "len-reserved-regions" declared?

Since qdev_prop_set_uint32() uses &error_abort, isn't this call
aborting the process? I am confused how this code path is exercised,
what am I missing?


+qdev_prop_set_string(dev, "reserved-regions[0]", resv_prop_str);
+g_free(resv_prop_str);
  }
  }

Re: [PULL 00/35] Testing, docs, semihosting and plugin updates

On Wed, 1 Feb 2023 at 18:07, Alex Bennée  wrote:
> Peter Maydell  writes:
> > I think this is "you can't put labels in qemu-options.hx,
> > because it gets included in two .rst files (invocation.rst
> > and qemu-manpage.rst), and Sphinx complains about the
> > duplicate labels, even though one of the two files is
> > only used in the HTML and one is only used in the manpages".
>
> Oh boo - anyway to work around that because they are helpful links?

Nothing easy. The problem is that Sphinx looks at every .rst
file in the source directory, regardless of whether it's
reachable from the document you specify as the root of the
manual or not. So both lots of .rst files get processed
for both the HTML manual set and the manpages, even though
they don't need to be[*]. This is a long-standing design
deficiency in Sphinx. The only thing I could think of was
splitting the manpages and html docs entirely into separate
subdirectories, and having meson symlink the files which are
actually shared between them. But that seems like quite a lot
of extra machinery.

[*] This shows up for instance in the HTML docs getting a
not-linked-to-from-anywhere HTML version of the qemu(1) manpage:
https://www.qemu.org/docs/master/system/qemu-manpage.html

-- PMM

Re: [PULL 08/34] hw/arm/virt: Let the virtio-iommu bypass MSIs


On 2/2/23 11:47, Philippe Mathieu-Daudé wrote:

Hi Eric,

On 3/7/20 17:53, Peter Maydell wrote:

From: Eric Auger 

At the moment the virtio-iommu translates MSI transactions.
This behavior is inherited from ARM SMMU. The virt machine
code knows where the guest MSI doorbells are so we can easily
declare those regions as VIRTIO_IOMMU_RESV_MEM_T_MSI. With that
setting the guest will not map MSIs through the IOMMU and those
transactions will be simply bypassed.

Depending on which MSI controller is in use (ITS or GICV2M),
we declare either:
- the ITS interrupt translation space (ITS_base + 0x1),
   containing the GITS_TRANSLATOR or
- The GICV2M single frame, containing the MSI_SETSP_NS register.

Signed-off-by: Eric Auger 
Message-id: 20200629070404.10969-6-eric.au...@redhat.com
Reviewed-by: Peter Maydell 
Signed-off-by: Peter Maydell 
---
  include/hw/arm/virt.h |  7 +++
  hw/arm/virt.c | 30 ++
  2 files changed, 37 insertions(+)




  static void create_gic(VirtMachineState *vms)
@@ -2198,8 +2200,36 @@ out:
  static void virt_machine_device_pre_plug_cb(HotplugHandler 
*hotplug_dev,
  DeviceState *dev, Error 
**errp)

  {
+    VirtMachineState *vms = VIRT_MACHINE(hotplug_dev);
+
  if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
  virt_memory_pre_plug(hotplug_dev, dev, errp);
+    } else if (object_dynamic_cast(OBJECT(dev), 
TYPE_VIRTIO_IOMMU_PCI)) {

+    hwaddr db_start = 0, db_end = 0;
+    char *resv_prop_str;
+
+    switch (vms->msi_controller) {
+    case VIRT_MSI_CTRL_NONE:
+    return;
+    case VIRT_MSI_CTRL_ITS:
+    /* GITS_TRANSLATER page */
+    db_start = base_memmap[VIRT_GIC_ITS].base + 0x1;
+    db_end = base_memmap[VIRT_GIC_ITS].base +
+ base_memmap[VIRT_GIC_ITS].size - 1;
+    break;
+    case VIRT_MSI_CTRL_GICV2M:
+    /* MSI_SETSPI_NS page */
+    db_start = base_memmap[VIRT_GIC_V2M].base;
+    db_end = db_start + base_memmap[VIRT_GIC_V2M].size - 1;
+    break;
+    }
+    resv_prop_str = g_strdup_printf("0x%"PRIx64":0x%"PRIx64":%u",
+    db_start, db_end,
+    VIRTIO_IOMMU_RESV_MEM_T_MSI);
+
+    qdev_prop_set_uint32(dev, "len-reserved-regions", 1);


Where is "len-reserved-regions" declared?

Since qdev_prop_set_uint32() uses &error_abort, isn't this call
aborting the process? I am confused how this code path is exercised,
what am I missing?


The call path is:

  qdev_prop_set_uint32 ->
object_property_set_int ->
  object_property_set_qobject ->
object_property_set ->
  object_property_find_err

So QEMU should abort displaying:

"Property 'virtio-iommu-pci.len-reserved-regions' not found".


+    qdev_prop_set_string(dev, "reserved-regions[0]", resv_prop_str);
+    g_free(resv_prop_str);
  }
  }

Re: [PATCH v2 3/3] util/userfaultfd: Support /dev/userfaultfd

Peter Xu  wrote:
> Teach QEMU to use /dev/userfaultfd when it existed and fallback to the
> system call if either it's not there or doesn't have enough permission.
>
> Firstly, as long as the app has permission to access /dev/userfaultfd, it
> always have the ability to trap kernel faults which QEMU mostly wants.
> Meanwhile, in some context (e.g. containers) the userfaultfd syscall can be
> forbidden, so it can be the major way to use postcopy in a restricted
> environment with strict seccomp setup.
>
> Reviewed-by: Philippe Mathieu-Daudé 
> Signed-off-by: Peter Xu 


Hi

Can we change this code to not use the global variable.

> ---
>  util/trace-events  |  1 +
>  util/userfaultfd.c | 37 +
>  2 files changed, 38 insertions(+)
>
> diff --git a/util/trace-events b/util/trace-events
> index c8f53d7d9f..16f78d8fe5 100644
> --- a/util/trace-events
> +++ b/util/trace-events
> @@ -93,6 +93,7 @@ qemu_vfio_region_info(const char *desc, uint64_t 
> region_ofs, uint64_t region_siz
>  qemu_vfio_pci_map_bar(int index, uint64_t region_ofs, uint64_t region_size, 
> int ofs, void *host) "map region bar#%d addr 0x%"PRIx64" size 0x%"PRIx64" ofs 
> 0x%x host %p"
>  
>  #userfaultfd.c
> +uffd_detect_open_mode(int mode) "%d"
>  uffd_query_features_nosys(int err) "errno: %i"
>  uffd_query_features_api_failed(int err) "errno: %i"
>  uffd_create_fd_nosys(int err) "errno: %i"
> diff --git a/util/userfaultfd.c b/util/userfaultfd.c
> index 9845a2ec81..7dceab51d6 100644
> --- a/util/userfaultfd.c
> +++ b/util/userfaultfd.c
> @@ -18,10 +18,47 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +
> +typedef enum {
> +UFFD_UNINITIALIZED = 0,
> +UFFD_USE_DEV_PATH,
> +UFFD_USE_SYSCALL,
> +} uffd_open_mode;
> +
> +static int uffd_dev;
> +
> +static uffd_open_mode uffd_detect_open_mode(void)
> +{
> +static uffd_open_mode open_mode;
> +
> +if (open_mode == UFFD_UNINITIALIZED) {
> +/*
> + * Make /dev/userfaultfd the default approach because it has better
> + * permission controls, meanwhile allows kernel faults without any
> + * privilege requirement (e.g. SYS_CAP_PTRACE).
> + */
> +uffd_dev = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC);
> +if (uffd_dev >= 0) {
> +open_mode = UFFD_USE_DEV_PATH;
> +} else {
> +/* Fallback to the system call */
> +open_mode = UFFD_USE_SYSCALL;
> +}
> +trace_uffd_detect_open_mode(open_mode);
> +}
> +
> +return open_mode;
> +}
>  
>  int uffd_open(int flags)
>  {
>  #if defined(__linux__) && defined(__NR_userfaultfd)
> +if (uffd_detect_open_mode() == UFFD_USE_DEV_PATH) {
> +assert(uffd_dev >= 0);
> +return ioctl(uffd_dev, USERFAULTFD_IOC_NEW, flags);
> +}
> +
>  return syscall(__NR_userfaultfd, flags);
>  #else
>  return -EINVAL;

static int open_userfaultd(void)
{
/*
 * Make /dev/userfaultfd the default approach because it has better
 * permission controls, meanwhile allows kernel faults without any
 * privilege requirement (e.g. SYS_CAP_PTRACE).
 */
 int uffd = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC);
 if (uffd >= 0) {
return uffd;
 }
 return -1;
}

int uffd_open(int flags)
{
#if defined(__linux__) && defined(__NR_userfaultfd)
static int uffd = -2;
if (uffd == -2) {
uffd = open_userfaultd();
}
if (uffd >= 0) {
return ioctl(uffd, USERFAULTFD_IOC_NEW, flags);
}
return syscall(__NR_userfaultfd, flags);
#else
 return -EINVAL;

27 lines vs 42

No need for enum type
No need for global variable

What do you think?

Later, Juan.

Re: [PATCH v2 1/3] linux-headers: Update to v6.1

Peter Xu  wrote:
> Signed-off-by: Peter Xu 

How does this change gets into the tree?

I know that it is "automagically" generated, but who decides when that
goes into the tree?

As we need that for the following patch?

Later, Juan.

Re: [PATCH] accel/tcg: Complete cpu initialization before registration

On Wed, 1 Feb 2023 at 20:37, Richard Henderson
 wrote:
>
> On 2/1/23 04:20, Eric Auger wrote:
> > What I fail to understand is why this code is called with a kvm
> > accelerated qemu (the test runs by default with kvm).
> ...
> > #2  0x02aaab1500f0 in vmsa_ttbr_write
> > (env=0x2aaac393850, ri=0x2aaac3c90e0, value=2154950976315703518) at
> > ../target/arm/helper.c:3784
> > #3  0x02aaab14e5a8 in write_raw_cp_reg
> > (env=env@entry=0x2aaac393850, ri=ri@entry=0x2aaac3c90e0,
> > v=v@entry=2154950976315703518)
>
> This is indeed very curious -- vmsa_ttbr_write is supposed to be the "cooked" 
> .writefn,
> not the .raw_writefn.  We're not supposed to arrive here at all.

If you only provide a cooked .writefn and no .raw_writefn,
the default is to assume that the cooked function will also
work as the raw one. None of the ARMCPRegInfo structs that
use vmsa_ttbr_write specify a raw_writefn...

thanks
-- PMM

Re: Display update issue on M1 Macs

2023-02-02 Thread BALATON Zoltan

On Tue, 31 Jan 2023, BALATON Zoltan wrote:

On Tue, 31 Jan 2023, Akihiko Odaki wrote:

[...]
To summarise previous discussion:

- There's a problem on Apple M1 Macs with sm501 and ati-vga 2d accel
functions drawing from device model into the video memory of the emulated
card which is not shown on screen when the display update callback is
called from another thread. This works on x86_64 host so I suspect it may
be related to missing memory synchronisation that ARM may need.

- This can be reproduced running AmigaOS4 on sam460ex or MorphOS (demo iso
downliadable from their web site) on sam460ex, pegasos2 or mac99,via=pmu
with -device ati-vga,romfile="" as described here:
http://zero.eik.bme.hu/~balaton/qemu/amiga/

- I can't test it myself lacking hardware so I have to rely on reports
from people who have this hardware so there may be some uncertainity in
the info I get.

- We have confirmed it's not related to a known race condition as
disabling dirty tracking and always doing full updates of whole screen
did not fix it:

But there is an exception: memory_region_snapshot_and_clear_dirty()
releases iothread lock, and that broke raspi3b display device:

https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=wn+k8dqneb_...@mail.gmail.com/T/

It is unexpected that gfx_update() callback releases iothread lock so
it may break things in peculiar ways.

Peter, is there any change in the situation regarding the race
introduced by memory_region_snapshot_and_clear_dirty()?

For now, to workaround the issue, I think you can create another
mutex and make the entire sm501_2d_engine_write() and
sm501_update_display() critical sections.

Interesting thread but not sure it's the same problem so this
workaround may not be enough to fix my issue. Here's a video posted by
one of the people who reported it showing the problem on M1 Mac:

https://www.youtube.com/watch?v=FDqoNbp6PQs

and here's how it looks like on other machines:

https://www.youtube.com/watch?v=ML7-F4HNFKQ

There are also videos showing it running on RPi 4 and G5 Mac without
this issue so it seems to only happen on Apple Silicon M1 Macs. What's
strange is that graphics elements are not just delayed which I think
should happen with missing thread synchronisation where the update
callback would miss some pixels rendered during it's running but
subsequent update callbacks would eventually draw those, woudn't they?
Also setting full_update to 1 in sm501_update_display() callback to
disable dirty tracking does not fix the problem. So it looks like as
if sm501_2d_operation() running on one CPU core only writes data to
the local cache of that core which sm501_update_display() running on
other core can't see, so maybe some cache synchronisation is needed in
memory_region_set_dirty() or if that's already there maybe I should
call it for all changes not only those in the visible display area?
I'm still not sure I understand the problem and don't know what could
be a fix for it so anything to test to identify the issue better might
also bring us closer to a solution.

If you set full_update to 1, you may also comment out
memory_region_snapshot_and_clear_dirty() and
memory_region_snapshot_get_dirty() to avoid the iothread mutex being
unlocked. The iothread mutex should ensure cache coherency as well.

But as you say, it's weird that the rendered result is not just delayed
but missed. That may imply other possibilities (e.g., the results are
overwritten by someone else). If the problem persists after commenting
out memory_region_snapshot_and_clear_dirty() and
memory_region_snapshot_get_dirty(), I think you can assume the
inter-thread coherency between sm501_2d_operation() and
sm501_update_display() is not causing the problem.

I've asked people who reported and can reproduce it to test this but it
did not change anything so confirmed it's not that race condition but
looks more like some cache inconsistency maybe. Any other ideas?

I can come up with two important differences between x86 and Arm which
can affect the execution of QEMU:
1. Memory model. Arm uses a memory model more relaxed than x86 so it is
more sensitive for synchronization failures among threads.
2. Different instructions. TCG uses JIT so differences in instructions
matter.

We should be able to exclude 1) as a potential cause of the problem.
iothread mutex should take care of race condition and even cache
coherency problem; mutex includes memory barrier functionality.

[...]
For difference 2), you may try to use TCI. You can find details of TCI in
tcg/tci/README.

This was tested and also with TCI got the same results just much slower.

The common sense tells, however, the memory model is usually the cause of
the problem when you see behavioral differences between x86 and Arm, and
TCG should work fine with both of x86 and Arm as they should have been
tested well.

[...]
Fortunately macOS provides Rosetta 2 for x86 emulatio

Re: [PATCH v5 04/20] scripts/clean-includes: Improve --git commit message

Markus Armbruster  wrote:

> It's less terse.  Fine with me.  The mix of passive and active voice
> feels a bit awkward, though.  Another try:
>
>   All .c should include qemu/osdep.h first.  This script performs three
>   related cleanups:
>
>   * Ensure .c files include qemu/osdep.h first.
>   * Including it in a .h is redundant, since the .c  already includes
> it.  Drop such inclusions.
>   * Likewise, including headers qemu/osdep.h includes is redundant.
> Drop these, too.

Perfect, thanks.

Reviewed-by: Juan Quintela 

or whatever you want it O:-)

Re: An issue with x86 tcg and MMIO

2023-02-02 Thread Richard Henderson


On 2/1/23 23:39, Jonathan Cameron wrote:

Not sure - if we can do the handling above then sure we could make that change.
I can see there is a path to register the callbacks but I'd kind of assumed
ROM meant read only...


I think "romd" means "read mostly".

In the case of flash, I believe that a write changes modes (block erase something 
something) and the region changes state into MMIO.  But normal state is read mode where 
read+execute go through unchallenged.


It has been a long time since I studied how all that works, so I may well have forgotten 
something.



r~

Re: [PULL 08/34] hw/arm/virt: Let the virtio-iommu bypass MSIs

On Thu, 2 Feb 2023 at 10:47, Philippe Mathieu-Daudé  wrote:
> Where is "len-reserved-regions" declared?

  DEFINE_PROP_ARRAY("reserved-regions", ...)

does this. For an array property "foo" the machinery creates an integer
property "foo-len", which must be set first. Setting that
then creates properties "foo[0]", "foo[1]", ... which can be set.

thanks
-- PMM

Re: [RFC v5 1/3] rcu: introduce rcu_read_is_locked()

Chuang Xu  wrote:
> add rcu_read_is_locked() to detect holding of rcu lock.
>
> Signed-off-by: Chuang Xu 

Reviewed-by: Juan Quintela 

Althought I think that petting a review from Paolo or anyone that knows
more than RCU could be a good idea.

> ---
>  include/qemu/rcu.h | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
> index b063c6fde8..719916d9d3 100644
> --- a/include/qemu/rcu.h
> +++ b/include/qemu/rcu.h
> @@ -119,6 +119,13 @@ static inline void rcu_read_unlock(void)
>  }
>  }
>  
> +static inline bool rcu_read_is_locked(void)
> +{
> +struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader();
> +
> +return p_rcu_reader->depth > 0;
> +}
> +
>  extern void synchronize_rcu(void);
>  
>  /*

Re: [RFC v5 3/3] migration: reduce time of loading non-iterable vmstate

Chuang Xu  wrote:
> The duration of loading non-iterable vmstate accounts for a significant
> portion of downtime (starting with the timestamp of source qemu stop and
> ending with the timestamp of target qemu start). Most of the time is spent
> committing memory region changes repeatedly.
>
> This patch packs all the changes to memory region during the period of
> loading non-iterable vmstate in a single memory transaction. With the
> increase of devices, this patch will greatly improve the performance.
>
> Here are the test1 results:
> test info:
> - Host
>   - Intel(R) Xeon(R) Platinum 8260 CPU
>   - NVIDIA Mellanox ConnectX-5
> - VM
>   - 32 CPUs 128GB RAM VM
>   - 8 16-queue vhost-net device
>   - 16 4-queue vhost-user-blk device.
>
>   time of loading non-iterable vmstate downtime
> beforeabout 150 ms  740+ ms
> after about 30 ms   630+ ms
>
> In test2, we keep the number of the device the same as test1, reduce the
> number of queues per device:
>
> Here are the test2 results:
> test info:
> - Host
>   - Intel(R) Xeon(R) Platinum 8260 CPU
>   - NVIDIA Mellanox ConnectX-5
> - VM
>   - 32 CPUs 128GB RAM VM
>   - 8 1-queue vhost-net device
>   - 16 1-queue vhost-user-blk device.
>
>   time of loading non-iterable vmstate downtime
> beforeabout 90 ms  about 250 ms
>
> after about 25 ms  about 160 ms
>
> In test3, we keep the number of queues per device the same as test1, reduce
> the number of devices:
>
> Here are the test3 results:
> test info:
> - Host
>   - Intel(R) Xeon(R) Platinum 8260 CPU
>   - NVIDIA Mellanox ConnectX-5
> - VM
>   - 32 CPUs 128GB RAM VM
>   - 1 16-queue vhost-net device
>   - 1 4-queue vhost-user-blk device.
>
>   time of loading non-iterable vmstate downtime
> beforeabout 20 ms  about 70 ms
> after about 11 ms  about 60 ms
>
> As we can see from the test results above, both the number of queues and
> the number of devices have a great impact on the time of loading non-iterable
> vmstate. The growth of the number of devices and queues will lead to more
> mr commits, and the time consumption caused by the flatview reconstruction
> will also increase.
>
> Signed-off-by: Chuang Xu 

Reviewed-by: Juan Quintela 

And BTW, nice load times improvements.

Re: [PATCH v5 0/8] virtio-mem: Handle preallocation with migration

David Hildenbrand  wrote:
> On 17.01.23 12:22, David Hildenbrand wrote:
>> While playing with migration of virtio-mem with an ordinary file backing,
>> I realized that migration and prealloc doesn't currently work as expected
>> for virtio-mem. Further, Jing Qi reported that setup issues (insufficient
>> huge pages on the destination) result in QEMU getting killed with SIGBUS
>> instead of failing gracefully.
>> In contrast to ordinary memory backend preallocation, virtio-mem
>> preallocates memory before plugging blocks to the guest. Consequently,
>> when migrating we are not actually preallocating on the destination but
>> "only" migrate pages. Fix that be migrating the bitmap early, before any
>> RAM content, and use that information to preallocate memory early, before
>> migrating any RAM.
>> Postcopy needs some extra care, and I realized that
>> prealloc+postcopy is
>> shaky in general. Let's at least try to mimic what ordinary
>> prealloc+postcopy does: temporarily allocate the memory, discard it, and
>> cross fingers that we'll still have sufficient memory when postcopy
>> actually tries placing pages.
>> Cc: Dr. David Alan Gilbert 
>> Cc: Juan Quintela 
>> Cc: Peter Xu 
>> Cc: Michael S. Tsirkin 
>> Cc: Michal Privoznik 
>
> @Juan, David: I can similarly take this via my tree as well.

Reviewing it.

I can get it through migration tree, thanks.

Later, Juan.

Re: [PATCH v3 0/6] Support for new CPU model SapphireRapids

2023-02-02 Thread Igor Mammedov

On Fri,  6 Jan 2023 00:38:20 -0800
Lei Wang  wrote:

> This series aims to add a new CPU model SapphireRapids, and tries to
> address the problem stated in
> https://lore.kernel.org/all/20220812055751.14553-1-lei4.w...@intel.com/T/#mcf67dbd1ad37c65d7988c36a2b267be9afd2fb30,
> so that named CPU model can define its own AMX values, and QEMU won't
> pass the wrong AMX values to KVM in future platforms if they have
> different values supported.
> 
> The original patch is
> https://lore.kernel.org/all/20220812055751.14553-1-lei4.w...@intel.com/T/#u.

MultiBitFeatureInfo looks like an interesting
idea but among fixing whatever issues this has atm,
you'd probably need to introduce a new  qdev_bitfield property
infrastructure so that such features could be treated like any
other qdev properties.
Also when MultiBitFeatureInfo is added, one should convert all
other usecases where it's applicable (not only for new code)
so that we would end up with consolidated approach instead of
zoo mess.

I'm not sure all MultiBitFeatureInfo complexity is necessary
just for adding a new CPU model, I'd rather use ad hoc approach
that we were using before for non boolean features.

And then try to develop MultiBitFeatureInfo & co as a separate
series to demonstrate how much it will improve current
cpu models definitions.

PS:
 'make check-acceptance' are broken with this

> ---
> 
> Changelog:
> 
> v3:
>  - Rebase on the latest QEMU (d1852caab131ea898134fdcea8c14bc2ee75fbe9).
>  - v2: 
> https://lore.kernel.org/qemu-devel/20221102085256.81139-1-lei4.w...@intel.com/
> 
> v2:
>  - Fix when passing all zeros of AMX-related CPUID, QEMU will warn
>unsupported.
>  - Remove unnecessary function definition and make code cleaner.
>  - Fix some typos.
>  - v1: 
> https://lore.kernel.org/qemu-devel/20221027020036.373140-1-lei4.w...@intel.com/T/#t
> 
> 
> Lei Wang (6):
>   i386: Introduce FeatureWordInfo for AMX CPUID leaf 0x1D and 0x1E
>   i386: Remove unused parameter "uint32_t bit" in
> feature_word_description()
>   i386: Introduce new struct "MultiBitFeatureInfo" for multi-bit
> features
>   i386: Mask and report unavailable multi-bit feature values
>   i386: Initialize AMX CPUID leaves with corresponding env->features[]
> leaves
>   i386: Add new CPU model SapphireRapids
> 
>  target/i386/cpu-internal.h |  11 ++
>  target/i386/cpu.c  | 311 +++--
>  target/i386/cpu.h  |  16 ++
>  3 files changed, 322 insertions(+), 16 deletions(-)
> 
> 
> base-commit: d1852caab131ea898134fdcea8c14bc2ee75fbe9

Re: [RFC v5 0/3] migration: reduce time of loading non-iterable vmstate

Chuang Xu  wrote:
> In this version:
>
> - rename rcu_read_locked() to rcu_read_is_locked().
> - adjust the sanity check in address_space_to_flatview().
> - improve some comments.
>
> The duration of loading non-iterable vmstate accounts for a significant
> portion of downtime (starting with the timestamp of source qemu stop and
> ending with the timestamp of target qemu start). Most of the time is spent
> committing memory region changes repeatedly.
>
> This patch packs all the changes to memory region during the period of
> loading non-iterable vmstate in a single memory transaction. With the
> increase of devices, this patch will greatly improve the performance.
>
> Here are the test1 results:
> test info:
> - Host
>   - Intel(R) Xeon(R) Platinum 8260 CPU
>   - NVIDIA Mellanox ConnectX-5
> - VM
>   - 32 CPUs 128GB RAM VM
>   - 8 16-queue vhost-net device
>   - 16 4-queue vhost-user-blk device.
>
>   time of loading non-iterable vmstate downtime
> beforeabout 150 ms  740+ ms
> after about 30 ms   630+ ms
>
> (This result is different from that of v1. It may be that someone has 
> changed something on my host.., but it does not affect the display of 
> the optimization effect.)
>
>
> In test2, we keep the number of the device the same as test1, reduce the 
> number of queues per device:
>
> Here are the test2 results:
> test info:
> - Host
>   - Intel(R) Xeon(R) Platinum 8260 CPU
>   - NVIDIA Mellanox ConnectX-5
> - VM
>   - 32 CPUs 128GB RAM VM
>   - 8 1-queue vhost-net device
>   - 16 1-queue vhost-user-blk device.
>
>   time of loading non-iterable vmstate downtime
> beforeabout 90 ms  about 250 ms
>
> after about 25 ms  about 160 ms
>
>
>
> In test3, we keep the number of queues per device the same as test1, reduce 
> the number of devices:
>
> Here are the test3 results:
> test info:
> - Host
>   - Intel(R) Xeon(R) Platinum 8260 CPU
>   - NVIDIA Mellanox ConnectX-5
> - VM
>   - 32 CPUs 128GB RAM VM
>   - 1 16-queue vhost-net device
>   - 1 4-queue vhost-user-blk device.
>
>   time of loading non-iterable vmstate downtime
> beforeabout 20 ms  about 70 ms
> after about 11 ms  about 60 ms
>
>
> As we can see from the test results above, both the number of queues and 
> the number of devices have a great impact on the time of loading non-iterable 
> vmstate. The growth of the number of devices and queues will lead to more 
> mr commits, and the time consumption caused by the flatview reconstruction 
> will also increase.
>
> Please review, Chuang.

Hi

As on the review, I aggree with the patches, but I am waiting for Paolo
to review the rcu change (that I think that it is trivial, but I am not
the rcu maintainer).

If it happens that you need to send another version, I think that you
can change the RFC for PATCH.

Again, very good job.

Later, Juan.


> [v4]
>
> - attach more information in the cover letter.
> - remove changes on virtio_load.
> - add rcu_read_locked() to detect holding of rcu lock.
>
> [v3]
>
> - move virtio_load_check_delay() from virtio_memory_listener_commit() to 
>   virtio_vmstate_change().
> - add delay_check flag to VirtIODevice to make sure virtio_load_check_delay() 
>   will be called when delay_check is true.
>
> [v2]
>
> - rebase to latest upstream.
> - add sanity check to address_space_to_flatview().
> - postpone the init of the vring cache until migration's loading completes. 
>
> [v1]
>
> The duration of loading non-iterable vmstate accounts for a significant
> portion of downtime (starting with the timestamp of source qemu stop and
> ending with the timestamp of target qemu start). Most of the time is spent
> committing memory region changes repeatedly.
>
> This patch packs all the changes to memory region during the period of
> loading non-iterable vmstate in a single memory transaction. With the
> increase of devices, this patch will greatly improve the performance.
>
> Here are the test results:
> test vm info:
> - 32 CPUs 128GB RAM
> - 8 16-queue vhost-net device
> - 16 4-queue vhost-user-blk device.
>
>   time of loading non-iterable vmstate
> beforeabout 210 ms
> after about 40 ms

Re: [PULL 08/34] hw/arm/virt: Let the virtio-iommu bypass MSIs


On 2/2/23 11:58, Peter Maydell wrote:

On Thu, 2 Feb 2023 at 10:47, Philippe Mathieu-Daudé  wrote:

Where is "len-reserved-regions" declared?


   DEFINE_PROP_ARRAY("reserved-regions", ...)

does this. For an array property "foo" the machinery creates an integer
property "foo-len", which must be set first. Setting that
then creates properties "foo[0]", "foo[1]", ... which can be set.


Oh indeed now I see the array prefix:

  #define PROP_ARRAY_LEN_PREFIX "len-"

Not obvious to realize while grepping.

Thanks for the quick answer!

Re: [PATCH] block/mirror: add 'write-blocking-after-ready' copy mode

2023-02-02 Thread Denis V. Lunev


On 2/2/23 11:19, Fiona Ebner wrote:

Am 31.01.23 um 19:18 schrieb Denis V. Lunev:

On 1/31/23 18:44, Vladimir Sementsov-Ogievskiy wrote:

+ Den

Den, I remember we thought about that, and probably had a solution?

Another possible approach to get benefits from both modes is to switch
to blocking mode after first loop of copying. [*]

Hmm. Thinking about proposed solution it seems, that [*] is better.
The main reason of "write-blocking" mode is to force convergence of
the mirror job. But you lose this thing if you postpone switching to
the moment when mirror becomes READY: we may never reach it.



Or, may be we can selectively skip or block guest writes, to keep some
specified level of convergence?

Or, we can start in "background" mode, track the speed of convergence
(like dirty-delta per minute), and switch to blocking if speed becomes
less than some threshold.


That is absolutely correct. It would be very kind to start with
asynchronous mirror and switch to the synchronous one after
specific amount of loops. This should be somehow measured.

Originally we have thought about switching after the one loop
(when basic background sending is done), but may be 2 iterations
(initial + first one) would be better.

Talking about this specific approach and our past experience
I should note that reaching completion point in the
asynchronous mode was a real pain and slow down for the guest.
We have intentionally switched at that time to the sync mode
for that purpose. Thus this patch is a "worst half-way" for
us.

While the asynchronous mode can take longer of course, the slowdown of
the guest is bigger with the synchronous mode, or what am I missing?

We have been using asynchronous mode for years (starting migration after
all drive-mirror jobs are READY) and AFAIK our users did not complain
about them not reaching READY or heavy slowdowns in the guest. We had
two reports about an assertion failure recently when drive-mirror still
got work left at the time the blockdrives are inactivated by migration.

We have tests internally which performs migration with different
IO write rates during the process. We use such kind of tests
more than 10 years and before we have switched to QEMU
from previous proprietary hypervisor.

Once the switch happened, we have started to experience
failures here due to asynchronous nature of the mirror. That
is why we have raised initial discussion and this code have
appeared. I may be incorrect or too self oriented here,
but this could be my understanding of the situation :)


Frankly speaking I would say that this switch could be considered
NOT QEMU job and we should just send a notification (event) for the
completion of the each iteration and management software should
take a decision to switch from async mode to the sync one.

Unfortunately, our management software is a bit limited in that regard
currently and making listening for events available in the necessary
place would take a bit of work. Having the switch command would nearly
be enough for us (we'd just switch after READY). But we'd also need that
when the switch happens after READY, that all remaining asynchronous
operations are finished by the command.
That could be a matter of the other event I believe. We switch mode and 
reset
the state. New READY event will be sent once the bitmap is cleared. That 
seems

fair.

  Otherwise, the original issue
with inactivating block drives while mirror still has work remains. Do
those semantics for the switch sound acceptable?


Under such a condition we should have a command to perform mode
switch. This seems more QEMU/QAPI way :)


I feel like a switch at an arbitrary time might be hard to do correctly,
i.e. to not miss some (implicit) assertion.

But I can try and go for this approach if it's more likely to be
accepted upstream. So a 'drive-mirror-change' command which takes the
'job-id' and a 'copy-mode' argument? And only allow the switch from
'background' to 'active' (initially)?

That is what I can not guarantee as we have Kevin and Stefan here. Though
in general this opens interesting possibilities for other use-cases :)


Or a more generic 'block-job-change'? But that would need a way to take
'JobType'-specific arguments. Is that doable?

(...)


@@ -1035,10 +1036,31 @@ static int coroutine_fn mirror_run(Job *job,
Error **errp)
   if (s->in_flight == 0 && cnt == 0) {
   trace_mirror_before_flush(s);
   if (!job_is_ready(&s->common.job)) {
+    if (s->copy_mode ==
+ MIRROR_COPY_MODE_WRITE_BLOCKING_AFTER_READY) {
+    /*
+ * Pause guest I/O to check if we can switch to
active mode.
+ * To set actively_synced to true below, it is
necessary to
+ * have seen and synced all guest I/O.
+ */
+    s->in_drain = true;
+    bdrv_drained_begin(bs);
+    if (bdrv_get_dirty_count(s->dirty_bitmap) > 0) {

Re: [PATCH v5 0/8] virtio-mem: Handle preallocation with migration

2023-02-02 Thread Michael S. Tsirkin

On Tue, Jan 17, 2023 at 12:22:41PM +0100, David Hildenbrand wrote:
> While playing with migration of virtio-mem with an ordinary file backing,
> I realized that migration and prealloc doesn't currently work as expected
> for virtio-mem. Further, Jing Qi reported that setup issues (insufficient
> huge pages on the destination) result in QEMU getting killed with SIGBUS
> instead of failing gracefully.
> 
> In contrast to ordinary memory backend preallocation, virtio-mem
> preallocates memory before plugging blocks to the guest. Consequently,
> when migrating we are not actually preallocating on the destination but
> "only" migrate pages. Fix that be migrating the bitmap early, before any
> RAM content, and use that information to preallocate memory early, before
> migrating any RAM.
> 
> Postcopy needs some extra care, and I realized that prealloc+postcopy is
> shaky in general. Let's at least try to mimic what ordinary
> prealloc+postcopy does: temporarily allocate the memory, discard it, and
> cross fingers that we'll still have sufficient memory when postcopy
> actually tries placing pages.
> 
> Cc: Dr. David Alan Gilbert 
> Cc: Juan Quintela 
> Cc: Peter Xu 
> Cc: Michael S. Tsirkin 
> Cc: Michal Privoznik 


Reviewed-by: Michael S. Tsirkin 

> v4 -> v5:
> - "migration/savevm: Move more savevm handling into vmstate_save()"
> -- Extended patch description regarding tracing
> - "migration/savevm: Prepare vmdesc json writer in
>qemu_savevm_state_setup()"
> -- Move freeing to migrate_fd_cleanup()
> - "migration/savevm: Allow immutable device state to be migrated early (i.e.,
> before RAM)"
> -- "immutable" -> "early_setup"
> -- Extend comment
> - Added some RBs (thanks!)
> 
> v3 -> v4:
> - First 3 patches:
> -- Minimze code changes and simplify
> -- Save immutable device state during qemu_savevm_state_setup()
> -- Don't use vmsd priorities, use a new flag
> -- Split it logically up
> - "migration/ram: Factor out check for advised postcopy"
> -- Don't factor out postcopy_is_running()
> - "virtio-mem: Migrate immutable properties early"
> -- Adjust to changed vmsd interface
> - "virtio-mem: Proper support for preallocation with migration"
> -- Drop sanity check in virtio_mem_post_load_early()
> 
> v2 -> v3:
> - New approach/rewrite, drop RB and TB of last patch
> 
> v1 -> v2:
> - Added RBs and Tested-bys
> - "virtio-mem: Fail if a memory backend with "prealloc=on" is specified"
> -- Fail instead of warn
> -- Adjust subject/description
> 
> 
> David Hildenbrand (8):
>   migration/savevm: Move more savevm handling into vmstate_save()
>   migration/savevm: Prepare vmdesc json writer in
> qemu_savevm_state_setup()
>   migration/savevm: Allow immutable device state to be migrated early
> (i.e., before RAM)
>   migration/vmstate: Introduce VMSTATE_WITH_TMP_TEST() and
> VMSTATE_BITMAP_TEST()
>   migration/ram: Factor out check for advised postcopy
>   virtio-mem: Fail if a memory backend with "prealloc=on" is specified
>   virtio-mem: Migrate immutable properties early
>   virtio-mem: Proper support for preallocation with migration
> 
>  hw/core/machine.c  |   4 +-
>  hw/virtio/virtio-mem.c | 144 -
>  include/hw/virtio/virtio-mem.h |   8 ++
>  include/migration/misc.h   |   4 +-
>  include/migration/vmstate.h|  28 ++-
>  migration/migration.c  |   9 +++
>  migration/migration.h  |   4 +
>  migration/ram.c|   8 +-
>  migration/savevm.c | 105 +---
>  9 files changed, 255 insertions(+), 59 deletions(-)
> 
> -- 
> 2.39.0

Re: [PATCH v1 1/5] migration/ram: Fix populate_read_range()

David Hildenbrand  wrote:
> Unfortunately, commit f7b9dcfbcf44 broke populate_read_range(): the loop
> end condition is very wrong, resulting in that function not populating the
> full range. Lets' fix that.
>
> Fixes: f7b9dcfbcf44 ("migration/ram: Factor out populating pages readable in 
> ram_block_populate_pages()")
> Cc: qemu-sta...@nongnu.org
> Signed-off-by: David Hildenbrand 

Reviewed-by: Juan Quintela

Re: [PATCH v1 2/5] migration/ram: Fix error handling in ram_write_tracking_start()

David Hildenbrand  wrote:
> If something goes wrong during uffd_change_protection(), we would miss
> to unregister uffd-wp and not release our reference. Fix it by
> performing the uffd_change_protection(true) last.
>
> Note that a uffd_change_protection(false) on the recovery path without a
> prior uffd_change_protection(false) is fine.
>
> Fixes: 278e2f551a09 ("migration: support UFFD write fault processing in 
> ram_save_iterate()")
> Cc: qemu-sta...@nongnu.org
> Signed-off-by: David Hildenbrand 

Reviewed-by: Juan Quintela

Re: [PATCH v1 3/5] migration/ram: Don't explicitly unprotect when unregistering uffd-wp

David Hildenbrand  wrote:
> When unregistering uffd-wp, older kernels before commit f369b07c86143
> ("mm/uffd:reset write protection when unregister with wp-mode") won't
> clear the uffd-wp PTE bit. When re-registering uffd-wp, the previous
> uffd-wp PTE bits would trigger again. With above commit, the kernel will
> clear the uffd-wp PTE bits when unregistering itself.
>
> Consequently, we'll clear the uffd-wp PTE bits now twice -- whereby we
> don't care about clearing them at all: a new background snapshot will
> re-register uffd-wp and re-protect all memory either way.
>
> So let's skip the manual clearing of uffd-wp. If ever relevant, we
> could clear conditionally in uffd_unregister_memory() -- we just need a
> way to figure out more recent kernels.
>
> Signed-off-by: David Hildenbrand 

Reviewed-by: Juan Quintela 

Fixing a bug by removing code O:-)

Re: [PATCH v1 4/5] migration/ram: Rely on used_length for uffd_change_protection()

David Hildenbrand  wrote:
> ram_mig_ram_block_resized() will abort migration (including background
> snapshots) when resizing a RAMBlock. ram_block_populate_read() will only
> populate RAM up to used_length, so at least for anonymous memory
> protecting everything between used_length and max_length won't
> actually be protected and is just a NOP.
>
> So let's only protect everything up to used_length.
>
> Note: it still makes sense to register uffd-wp for max_length, such
> that RAM_UF_WRITEPROTECT is independent of a changing used_length.
>
> Signed-off-by: David Hildenbrand 

Reviewed-by: Juan Quintela

Re: [PATCH v1 0/5] migration/ram: background snapshot fixes and optimiations

David Hildenbrand  wrote:
> On 05.01.23 13:45, David Hildenbrand wrote:
>> Playing with background snapshots in combination with hugetlb and
>> virtio-mem, I found two issues and some reasonable optimizations (skip
>> unprotecting when unregistering).
>> With virtio-mem (RamDiscardManager), we now won't be allocating
>> unnecessary
>> page tables for unplugged ranges when using uffd-wp with shmem/hugetlb.
>> Cc: Juan Quintela  (maintainer:Migration)
>> Cc: "Dr. David Alan Gilbert"  (maintainer:Migration)
>> Cc: Peter Xu 
>> Cc: Andrey Gruzdev 
>
> @Juan, David: I can take this via my tree (especially the last patch
> only optimized virtio-mem for now). Just let me know.

I reviewed everything now.

Queued on my tree.

Later, Juan.

Re: [PATCH v1 5/5] migration/ram: Optimize ram_write_tracking_start() for RamDiscardManager

David Hildenbrand  wrote:
> ram_block_populate_read() already optimizes for RamDiscardManager.
> However, ram_write_tracking_start() will still try protecting discarded
> memory ranges.
>
> Let's optimize, because discarded ranges don't map any pages and
>
> (1) For anonymous memory, trying to protect using uffd-wp without a mapped
> page is ignored by the kernel and consequently a NOP.
>
> (2) For shared/file-backed memory, we will fill present page tables in the
> range with PTE markers. However, we will even allocate page tables
> just to fill them with unnecessary PTE markers and effectively
> waste memory.
>
> So let's exclude these ranges, just like ram_block_populate_read()
> already does.
>
> Signed-off-by: David Hildenbrand 

Reviewed-by: Juan Quintela

[PATCH v6 3/8] igb: add ICR_RXDW

IGB uses RXDW ICR bit to indicate that rx descriptor has been written
back. This is the same as RXT0 bit in older HW.

Signed-off-by: Sriram Yagnaraman 
---
 hw/net/e1000x_regs.h | 4 
 hw/net/igb_core.c| 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/hw/net/e1000x_regs.h b/hw/net/e1000x_regs.h
index fb5b861135..f509db73a7 100644
--- a/hw/net/e1000x_regs.h
+++ b/hw/net/e1000x_regs.h
@@ -335,6 +335,7 @@
 #define E1000_ICR_RXDMT00x0010 /* rx desc min. threshold (0) */
 #define E1000_ICR_RXO   0x0040 /* rx overrun */
 #define E1000_ICR_RXT0  0x0080 /* rx timer intr (ring 0) */
+#define E1000_ICR_RXDW  0x0080 /* rx desc written back */
 #define E1000_ICR_MDAC  0x0200 /* MDIO access complete */
 #define E1000_ICR_RXCFG 0x0400 /* RX /c/ ordered set */
 #define E1000_ICR_GPI_EN0   0x0800 /* GP Int 0 */
@@ -378,6 +379,7 @@
 #define E1000_ICS_RXDMT0E1000_ICR_RXDMT0/* rx desc min. threshold */
 #define E1000_ICS_RXO   E1000_ICR_RXO   /* rx overrun */
 #define E1000_ICS_RXT0  E1000_ICR_RXT0  /* rx timer intr */
+#define E1000_ICS_RXDW  E1000_ICR_RXDW  /* rx desc written back */
 #define E1000_ICS_MDAC  E1000_ICR_MDAC  /* MDIO access complete */
 #define E1000_ICS_RXCFG E1000_ICR_RXCFG /* RX /c/ ordered set */
 #define E1000_ICS_GPI_EN0   E1000_ICR_GPI_EN0   /* GP Int 0 */
@@ -407,6 +409,7 @@
 #define E1000_IMS_RXDMT0E1000_ICR_RXDMT0/* rx desc min. threshold */
 #define E1000_IMS_RXO   E1000_ICR_RXO   /* rx overrun */
 #define E1000_IMS_RXT0  E1000_ICR_RXT0  /* rx timer intr */
+#define E1000_IMS_RXDW  E1000_ICR_RXDW  /* rx desc written back */
 #define E1000_IMS_MDAC  E1000_ICR_MDAC  /* MDIO access complete */
 #define E1000_IMS_RXCFG E1000_ICR_RXCFG /* RX /c/ ordered set */
 #define E1000_IMS_GPI_EN0   E1000_ICR_GPI_EN0   /* GP Int 0 */
@@ -441,6 +444,7 @@
 #define E1000_IMC_RXDMT0E1000_ICR_RXDMT0/* rx desc min. threshold */
 #define E1000_IMC_RXO   E1000_ICR_RXO   /* rx overrun */
 #define E1000_IMC_RXT0  E1000_ICR_RXT0  /* rx timer intr */
+#define E1000_IMC_RXDW  E1000_ICR_RXDW  /* rx desc written back */
 #define E1000_IMC_MDAC  E1000_ICR_MDAC  /* MDIO access complete */
 #define E1000_IMC_RXCFG E1000_ICR_RXCFG /* RX /c/ ordered set */
 #define E1000_IMC_GPI_EN0   E1000_ICR_GPI_EN0   /* GP Int 0 */
diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c
index b484e6ac30..1ddf54f630 100644
--- a/hw/net/igb_core.c
+++ b/hw/net/igb_core.c
@@ -1582,7 +1582,7 @@ igb_receive_internal(IGBCore *core, const struct iovec 
*iov, int iovcnt,
 n |= E1000_ICS_RXDMT0;
 }
 
-n |= E1000_ICR_RXT0;
+n |= E1000_ICR_RXDW;
 
 trace_e1000e_rx_written_to_guest(rxr.i->idx);
 }
-- 
2.34.1

Re: [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration

2023-02-02 Thread Eugenio Perez Martin

On Thu, Feb 2, 2023 at 2:00 AM Si-Wei Liu  wrote:
>
>
>
> On 1/12/2023 9:24 AM, Eugenio Pérez wrote:
> > It's possible to migrate vdpa net devices if they are shadowed from the
> >
> > start.  But to always shadow the dataplane is effectively break its host
> >
> > passthrough, so its not convenient in vDPA scenarios.
> >
> >
> >
> > This series enables dynamically switching to shadow mode only at
> >
> > migration time.  This allow full data virtqueues passthrough all the
> >
> > time qemu is not migrating.
> >
> >
> >
> > Successfully tested with vdpa_sim_net (but it needs some patches, I
> >
> > will send them soon) and qemu emulated device with vp_vdpa with
> >
> > some restrictions:
> >
> > * No CVQ.
> >
> > * VIRTIO_RING_F_STATE patches.
> What are these patches (I'm not sure I follow VIRTIO_RING_F_STATE, is it
> a new feature that other vdpa driver would need for live migration)?
>

Not really,

Since vp_vdpa wraps a virtio-net-pci driver to give it vdpa
capabilities it needs a virtio in-band method to set and fetch the
virtqueue state. Jason sent a proposal some time ago [1], and I
implemented it in qemu's virtio emulated device.

I can send them as a RFC but I didn't worry about making it pretty,
nor I think they should be merged at the moment. vdpa parent drivers
should follow vdpa_sim changes.

Thanks!

[1] https://lists.oasis-open.org/archives/virtio-comment/202103/msg00036.html

> -Siwei
>
> >
> > * Expose _F_SUSPEND, but ignore it and suspend on ring state fetch like
> >
> >DPDK.
> >
> >
> >
> > Comments are welcome, especially in the patcheswith RFC in the message.
> >
> >
> >
> > v2:
> >
> > - Use a migration listener instead of a memory listener to know when
> >
> >the migration starts.
> >
> > - Add stuff not picked with ASID patches, like enable rings after
> >
> >driver_ok
> >
> > - Add rewinding on the migration src, not in dst
> >
> > - v1 at https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01664.html
> >
> >
> >
> > Eugenio Pérez (13):
> >
> >vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check
> >
> >vdpa net: move iova tree creation from init to start
> >
> >vdpa: copy cvq shadow_data from data vqs, not from x-svq
> >
> >vdpa: rewind at get_base, not set_base
> >
> >vdpa net: add migration blocker if cannot migrate cvq
> >
> >vhost: delay set_vring_ready after DRIVER_OK
> >
> >vdpa: delay set_vring_ready after DRIVER_OK
> >
> >vdpa: Negotiate _F_SUSPEND feature
> >
> >vdpa: add feature_log parameter to vhost_vdpa
> >
> >vdpa net: allow VHOST_F_LOG_ALL
> >
> >vdpa: add vdpa net migration state notifier
> >
> >vdpa: preemptive kick at enable
> >
> >vdpa: Conditionally expose _F_LOG in vhost_net devices
> >
> >
> >
> >   include/hw/virtio/vhost-backend.h |   4 +
> >
> >   include/hw/virtio/vhost-vdpa.h|   1 +
> >
> >   hw/net/vhost_net.c|  25 ++-
> >
> >   hw/virtio/vhost-vdpa.c|  64 +---
> >
> >   hw/virtio/vhost.c |   3 +
> >
> >   net/vhost-vdpa.c  | 247 +-
> >
> >   6 files changed, 278 insertions(+), 66 deletions(-)
> >
> >
> >
>

Re: [PATCH 0/6] Shorten the runtime of some gitlab-CI shared runner jobs


On 30/1/23 11:44, Thomas Huth wrote:

We're currently facing the problem that the gitlab-CI jobs for the
shared runners take too much of the limited CI minutes on gitlab.com.
Here are now some patches that optimize some of the jobs a little bit
to take less runtime. We slightly lose some test coverage by some of
these changes (e.g. by dropping ppc-softmmu from a Clang-based test
and only continue testing ppc64-softmmu with Clang in another job),
but that should still be much better than running out of CI minutes
after 3/4 of a month.


FWIW the last time I wanted to add some tests Alex suggested me to
run a before/after gcov report and justify the new path tested.

Maybe we should enforce something similar, either cover new paths
or fix a bug.

Re: [PATCH] block/mirror: add 'write-blocking-after-ready' copy mode

2023-02-02 Thread Kevin Wolf

Am 02.02.2023 um 11:19 hat Fiona Ebner geschrieben:
> Am 31.01.23 um 19:18 schrieb Denis V. Lunev:
> > Frankly speaking I would say that this switch could be considered
> > NOT QEMU job and we should just send a notification (event) for the
> > completion of the each iteration and management software should
> > take a decision to switch from async mode to the sync one.

My first thought was very similar. We should provide a building block
that just switches between the two modes and then the management tool
can decide what the right policy is.

Adding a new event when the first iteration is done (I'm not sure if
there is much value in having it for later iterations) makes sense to
me if someone wants to use it. If we add it, let's not forget that
events can be lost and clients must be able to query the same
information, so we'd have to add it to query-jobs, too - which in turn
requires adding a job type specific struct to JobInfo first.

Once we have this generic infrastructure with low-level building block,
I wouldn't necessarily be opposed to having an option build on top where
QEMU automatically does what we consider most useful for most users.
auto-finalize/dismiss already do something similar.

> Unfortunately, our management software is a bit limited in that regard
> currently and making listening for events available in the necessary
> place would take a bit of work. Having the switch command would nearly
> be enough for us (we'd just switch after READY). But we'd also need
> that when the switch happens after READY, that all remaining
> asynchronous operations are finished by the command. Otherwise, the
> original issue with inactivating block drives while mirror still has
> work remains. Do those semantics for the switch sound acceptable?

Completing the remaining asynchronous operations can take a while, so I
don't think it's something to be done in a synchronous QMP command.
Do we need an event that tells you that the switch has completed?

But having to switch the mirror job to sync mode just to avoid doing I/O
on an inactive device sounds wrong to me. It doesn't fix the root cause
of that problem, but just papers over it.

Why does your management tool not complete the mirror job before it
does the migration switchover that inactivates images?

Kevin

Re: An issue with x86 tcg and MMIO

On Thu, 2 Feb 2023 at 10:56, Richard Henderson
 wrote:
>
> On 2/1/23 23:39, Jonathan Cameron wrote:
> > Not sure - if we can do the handling above then sure we could make that 
> > change.
> > I can see there is a path to register the callbacks but I'd kind of assumed
> > ROM meant read only...
>
> I think "romd" means "read mostly".
>
> In the case of flash, I believe that a write changes modes (block erase 
> something
> something) and the region changes state into MMIO.  But normal state is read 
> mode where
> read+execute go through unchallenged.

In QEMU a ROMD MemoryRegion (created by memory_region_init_rom_device())
is a memory region backed by RAM for reads and by callbacks for writes.
(I think ROMD stands for "ROM device".)

You can then use memory_region_device_set_romd() to put the ROMD into
either ROMD mode (the default, reads backed by RAM) or MMIO mode
(reads backed by MMIO callbacks). Writes are always callbacks regardless.
This is mainly meant for flash devices, which are usually reads-as-data
but have a programming mode where you write a command to it and then
read back command results. It's possible to use it for other tricks too.

When a ROMD is in ROMD mode then execution from it is as fast as execution
from any RAM; when it is in MMIO mode then execution from it is as slow
as execution from any other MMIO-backed MemoryRegion.

Note that AFAIK you can't execute from MMIO at all with KVM (either
ROMD-in-MMIO mode or a plain old MMIO device).

You might want to look at whether QEMU's iommu functionality is helpful
to you -- I'm assuming CXL doesn't do weird stuff on a less-than-page
granularity, and the iommu APIs will let you do "programmatically decide
where this address should actually go". The other option involves
mapping and unmapping MemoryRegions inside a container MR.

thanks
-- PMM

Re: [PATCH] hw/arm: Use TYPE_ARM_SMMUV3

On Tue, 24 Jan 2023 at 23:21, Richard Henderson
 wrote:
>
> Use the macro instead of two explicit string literals.
>
> Signed-off-by: Richard Henderson 
> ---
>  hw/arm/sbsa-ref.c | 3 ++-
>  hw/arm/virt.c | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)



Applied to target-arm.next, thanks.

-- PMM

Re: [PATCH] target/arm: Fix physical address resolution for Stage2

On Thu, 26 Jan 2023 at 23:32, Richard Henderson
 wrote:
>
> Conversion to probe_access_full missed applying the page offset.
>
> Cc: qemu-sta...@nongnu.org
> Reported-by: Sid Manning 
> Fixes: f3639a64f602 ("target/arm: Use softmmu tlbs for page table walking")
> Signed-off-by: Richard Henderson 
> ---
>  target/arm/ptw.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)



Applied to target-arm.next, thanks.

-- PMM

Re: [PATCH v5 1/8] migration/savevm: Move more savevm handling into vmstate_save()

David Hildenbrand  wrote:
> Let's move more code into vmstate_save(), reducing code duplication and
> preparing for reuse of vmstate_save() in qemu_savevm_state_setup(). We
> have to move vmstate_save() to make the compiler happy.
>
> We'll now also trace from qemu_save_device_state(), triggering the same
> tracepoints as previously called from
> qemu_savevm_state_complete_precopy_non_iterable() only. Note that
> qemu_save_device_state() ignores iterable device state, such as RAM,
> and consequently doesn't trigger some other trace points (e.g.,
> trace_savevm_state_setup()).
>
> Signed-off-by: David Hildenbrand 

Reviewed-by: Juan Quintela

Re: [PATCH v5 2/8] migration/savevm: Prepare vmdesc json writer in qemu_savevm_state_setup()

David Hildenbrand  wrote:
> ... and store it in the migration state. This is a preparation for
> storing selected vmds's already in qemu_savevm_state_setup().
>
> Reviewed-by: Dr. David Alan Gilbert 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Juan Quintela

Re: [PATCH v5 3/8] migration/savevm: Allow immutable device state to be migrated early (i.e., before RAM)

David Hildenbrand  wrote:
> For virtio-mem, we want to have the plugged/unplugged state of memory
> blocks available before migrating any actual RAM content, and perform
> sanity checks before touching anything on the destination. This
> information is immutable on the migration source while migration is active,
>
> We want to use this information for proper preallocation support with
> migration: currently, we don't preallocate memory on the migration target,
> and especially with hugetlb, we can easily run out of hugetlb pages during
> RAM migration and will crash (SIGBUS) instead of catching this gracefully
> via preallocation.
>
> Migrating device state via a VMSD before we start iterating is currently
> impossible: the only approach that would be possible is avoiding a VMSD
> and migrating state manually during save_setup(), to be restored during
> load_state().
>
> Let's allow for migrating device state via a VMSD early, during the
> setup phase in qemu_savevm_state_setup(). To keep it simple, we
> indicate applicable VMSD's using an "early_setup" flag.
>
> Note that only very selected devices (i.e., ones seriously messing with
> RAM setup) are supposed to make use of such early state migration.
>
> While at it, also use a bool for the "unmigratable" member.
>
> Signed-off-by: David Hildenbrand 

Reviewed-by: Juan Quintela

Re: [PATCH v5 4/8] migration/vmstate: Introduce VMSTATE_WITH_TMP_TEST() and VMSTATE_BITMAP_TEST()

David Hildenbrand  wrote:
> We'll make use of both next in the context of virtio-mem.
>
> Reviewed-by: Dr. David Alan Gilbert 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Juan Quintela

Re: [PATCH v5 5/8] migration/ram: Factor out check for advised postcopy

David Hildenbrand  wrote:
> Let's factor out this check, to be used in virtio-mem context next.
>
> While at it, fix a spelling error in a related comment.
>
> Signed-off-by: David Hildenbrand 

Reviewed-by: Juan Quintela

Re: [PATCH v5 6/8] virtio-mem: Fail if a memory backend with "prealloc=on" is specified

David Hildenbrand  wrote:
> "prealloc=on" for the memory backend does not work as expected, as
> virtio-mem will simply discard all preallocated memory immediately again.
> In the best case, it's an expensive NOP. In the worst case, it's an
> unexpected allocation error.
>
> Instead, "prealloc=on" should be specified for the virtio-mem device only,
> such that virtio-mem will try preallocating memory before plugging
> memory dynamically to the guest. Fail if such a memory backend is
> provided.
>
> Tested-by: Michal Privoznik 
> Reviewed-by: Dr. David Alan Gilbert 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Juan Quintela

Re: [PATCH v5 7/8] virtio-mem: Migrate immutable properties early

David Hildenbrand  wrote:
> The bitmap and the size are immutable while migration is active: see
> virtio_mem_is_busy(). We can migrate this information early, before
> migrating any actual RAM content. Further, all information we need for
> sanity checks is immutable as well.
>
> Having this information in place early will, for example, allow for
> properly preallocating memory before touching these memory locations
> during RAM migration: this way, we can make sure that all memory was
> actually preallocated and that any user errors (e.g., insufficient
> hugetlb pages) can be handled gracefully.
>
> In contrast, usable_region_size and requested_size can theoretically
> still be modified on the source while the VM is running. Keep migrating
> these properties the usual, late, way.
>
> Use a new device property to keep behavior of compat machines
> unmodified.
>
> Signed-off-by: David Hildenbrand 

Reviewed-by: Juan Quintela 


> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 616f3a207c..29b57f6448 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -41,7 +41,9 @@
>  #include "hw/virtio/virtio-pci.h"
>  #include "qom/object_interfaces.h"
>  
> -GlobalProperty hw_compat_7_2[] = {};
> +GlobalProperty hw_compat_7_2[] = {
> +{ "virtio-mem", "x-early-migration", "false" },
> +};

We can stop prettending that we don't support this kind of properties O:-)

Re: [PATCH v5 8/8] virtio-mem: Proper support for preallocation with migration

David Hildenbrand  wrote:
> Ordinary memory preallocation runs when QEMU starts up and creates the
> memory backends, before processing the incoming migration stream. With
> virtio-mem, we don't know which memory blocks to preallocate before
> migration started. Now that we migrate the virtio-mem bitmap early, before
> migrating any RAM content, we can safely preallocate memory for all plugged
> memory blocks before migrating any RAM content.
>
> This is especially relevant for the following cases:
>
> (1) User errors
>
> With hugetlb/files, if we don't have sufficient backend memory available on
> the migration destination, we'll crash QEMU (SIGBUS) during RAM migration
> when running out of backend memory. Preallocating memory before actual
> RAM migration allows for failing gracefully and informing the user about
> the setup problem.
>
> (2) Excluded memory ranges during migration
>
> For example, virtio-balloon free page hinting will exclude some pages
> from getting migrated. In that case, we won't crash during RAM
> migration, but later, when running the VM on the destination, which is
> bad.
>
> To fix this for new QEMU machines that migrate the bitmap early,
> preallocate the memory early, before any RAM migration. Warn with old
> QEMU machines.
>
> Getting postcopy right is a bit tricky, but we essentially now implement
> the same (problematic) preallocation logic as ordinary preallocation:
> preallocate memory early and discard it again before precopy starts. During
> ordinary preallocation, discarding of RAM happens when postcopy is advised.
> As the state (bitmap) is loaded after postcopy was advised but before
> postcopy starts listening, we have to discard memory we preallocated
> immediately again ourselves.
>
> Note that nothing (not even hugetlb reservations) guarantees for postcopy
> that backend memory (especially, hugetlb pages) are still free after they
> were freed ones while discarding RAM. Still, allocating that memory at
> least once helps catching some basic setup problems.
>
> Before this change, trying to restore a VM when insufficient hugetlb
> pages are around results in the process crashing to to a "Bus error"
> (SIGBUS). With this change, QEMU fails gracefully:
>
>   qemu-system-x86_64: qemu_prealloc_mem: preallocating memory failed: Bad 
> address
>   qemu-system-x86_64: error while loading state for instance 0x0 of device 
> ':00:03.0/virtio-mem-device-early'
>   qemu-system-x86_64: load of migration failed: Cannot allocate memory
>
> And we can even introspect the early migration data, including the
> bitmap:
>   $ ./scripts/analyze-migration.py -f STATEFILE
>   {
>   "ram (2)": {
>   "section sizes": {
>   ":00:03.0/mem0": "0x00078000",
>   ":00:04.0/mem1": "0x00078000",
>   "pc.ram": "0x0001",
>   "/rom@etc/acpi/tables": "0x0002",
>   "pc.bios": "0x0004",
>   ":00:02.0/e1000.rom": "0x0004",
>   "pc.rom": "0x0002",
>   "/rom@etc/table-loader": "0x1000",
>   "/rom@etc/acpi/rsdp": "0x1000"
>   }
>   },
>   ":00:03.0/virtio-mem-device-early (51)": {
>   "tmp": "00 00 00 01 40 00 00 00 00 00 00 07 80 00 00 00 00 00 00 00 00 
> 20 00 00 00 00 00 00",
>   "size": "0x4000",
>   "bitmap": "ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [...]
>   },
>   ":00:04.0/virtio-mem-device-early (53)": {
>   "tmp": "00 00 00 08 c0 00 00 00 00 00 00 07 80 00 00 00 00 00 00 00 00 
> 20 00 00 00 00 00 00",
>   "size": "0x0001fa40",
>   "bitmap": "ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [...]
>   },
>   [...]
>
> Reported-by: Jing Qi 
> Reviewed-by: Dr. David Alan Gilbert 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Juan Quintela

Re: [PATCH 01/14] linux-user/sparc: Raise SIGILL for all unhandled software traps

2023-02-02 Thread Ilya Leoshkevich

On Wed, 2023-02-01 at 14:51 -1000, Richard Henderson wrote:
> The linux kernel's trap tables vector all unassigned trap
> numbers to BAD_TRAP, which then raises SIGILL.
> 
> Reported-by: Ilya Leoshkevich 
> Signed-off-by: Richard Henderson 
> ---
>  linux-user/sparc/cpu_loop.c | 8 
>  1 file changed, 8 insertions(+)

Thanks!

Tested-by: Ilya Leoshkevich

[PATCH v2 00/10] TriCore instruction bugfixes

Hi,

while resolving [1], I noticed a few more bugs in DEXTR and LD_BU_PREINC which
this patch series fixes.

I also included the solo patch [2] into this series.

For v2 I added testcases in tests/tcg/tricore that exercise the errors
that these patches fix.

Cheers,
Bastian

[1] https://gitlab.com/qemu-project/qemu/-/issues/653
[2] 
https://lore.kernel.org/qemu-devel/20230113123759.677960-1-kbast...@mail.uni-paderborn.de/

Bastian Koppelmann (10):
  target/tricore: Fix OPC2_32_RCRW_IMASK translation
  tests/tcg/tricore: Add test for OPC2_32_RCRW_IMASK
  target/tricore: Fix OPC2_32_RCRW_INSERT translation
  tests/tcg/tricore: Add test for OPC2_32_RCRW_INSERT
  target/tricore: Fix RRPW_DEXTR
  tests/tcg/tricore: Add tests for RRPW_DEXTR
  target/tricore: Fix OPC2_32__DEXTR
  tests/tcg/tricore: Add OPC2_32__DEXTR tests
  target/tricore: Fix OPC2_32_BO_LD_BU_PREINC
  tests/tcg/tricore: Add LD.BU tests

 target/tricore/translate.c| 39 ++--
 tests/tcg/tricore/Makefile.softmmu-target |  4 ++
 tests/tcg/tricore/macros.h| 63 +--
 tests/tcg/tricore/test_dextr.S| 75 +++
 tests/tcg/tricore/test_imask.S| 10 +++
 tests/tcg/tricore/test_insert.S   |  9 +++
 tests/tcg/tricore/test_ld_bu.S| 15 +
 7 files changed, 193 insertions(+), 22 deletions(-)
 create mode 100644 tests/tcg/tricore/test_dextr.S
 create mode 100644 tests/tcg/tricore/test_imask.S
 create mode 100644 tests/tcg/tricore/test_insert.S
 create mode 100644 tests/tcg/tricore/test_ld_bu.S

-- 
2.39.1

[PATCH v2 04/10] tests/tcg/tricore: Add test for OPC2_32_RCRW_INSERT

DREG_RS2 and DREG_CALC_RESULT were mapped to the same register which
would not trigger https://gitlab.com/qemu-project/qemu/-/issues/653. So
let's make each register unique.

Signed-off-by: Bastian Koppelmann 
---
 tests/tcg/tricore/Makefile.softmmu-target |  1 +
 tests/tcg/tricore/macros.h| 16 
 tests/tcg/tricore/test_insert.S   |  9 +
 3 files changed, 22 insertions(+), 4 deletions(-)
 create mode 100644 tests/tcg/tricore/test_insert.S

diff --git a/tests/tcg/tricore/Makefile.softmmu-target 
b/tests/tcg/tricore/Makefile.softmmu-target
index bc0cfae8d0..afabba8631 100644
--- a/tests/tcg/tricore/Makefile.softmmu-target
+++ b/tests/tcg/tricore/Makefile.softmmu-target
@@ -11,6 +11,7 @@ TESTS += test_fadd.tst
 TESTS += test_fmul.tst
 TESTS += test_ftoi.tst
 TESTS += test_imask.tst
+TESTS += test_insert.tst
 TESTS += test_madd.tst
 TESTS += test_msub.tst
 TESTS += test_muls.tst
diff --git a/tests/tcg/tricore/macros.h b/tests/tcg/tricore/macros.h
index ceb7e9c0b7..4f2bc3cb0f 100644
--- a/tests/tcg/tricore/macros.h
+++ b/tests/tcg/tricore/macros.h
@@ -9,10 +9,10 @@
 /* Register definitions */
 #define DREG_RS1 %d0
 #define DREG_RS2 %d1
-#define DREG_RS3 %d4
-#define DREG_CALC_RESULT %d1
-#define DREG_CALC_PSW %d2
-#define DREG_CORRECT_PSW %d3
+#define DREG_RS3 %d2
+#define DREG_CALC_RESULT %d3
+#define DREG_CALC_PSW %d4
+#define DREG_CORRECT_PSW %d5
 #define DREG_TEMP_LI %d10
 #define DREG_TEMP %d11
 #define DREG_TEST_NUM %d14
@@ -103,6 +103,14 @@ test_ ## num:  
\
 insn DREG_CALC_RESULT, DREG_RS1, DREG_RS2, imm;   \
 )
 
+#define TEST_D_DIDI(insn, num, result, rs1, imm1, rs2, imm2) \
+TEST_CASE(num, DREG_CALC_RESULT, result, \
+LI(DREG_RS1, rs1);   \
+LI(DREG_RS2, rs1);   \
+rstv;\
+insn DREG_CALC_RESULT, DREG_RS1, imm1, DREG_RS2, imm2;   \
+)
+
 #define TEST_E_ED(insn, num, res_hi, res_lo, rs1_hi, rs1_lo, rs2) \
 TEST_CASE_E(num, res_lo, res_hi,  \
 LI(EREG_RS1_LO, rs1_lo);  \
diff --git a/tests/tcg/tricore/test_insert.S b/tests/tcg/tricore/test_insert.S
new file mode 100644
index 00..d5fd2237e1
--- /dev/null
+++ b/tests/tcg/tricore/test_insert.S
@@ -0,0 +1,9 @@
+#include "macros.h"
+.text
+.global _start
+_start:
+#insn numresultrs1imm1   rs2  imm2
+# | |  ||   | ||
+TEST_D_DIDI(insert, 1, 0x7fff, 0x, 0xa, 0x10, 0x8)
+
+TEST_PASSFAIL
-- 
2.39.1

[PATCH v2 03/10] target/tricore: Fix OPC2_32_RCRW_INSERT translation

we were mixing up the "c" and "d" registers. We used "d" as a
destination register und "c" as the source. According to the TriCore ISA
manual 1.6 vol 2 it is the other way round.

Reviewed-by: Richard Henderson 
Signed-off-by: Bastian Koppelmann 
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/653
---
 target/tricore/translate.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target/tricore/translate.c b/target/tricore/translate.c
index 8de4e56b1f..6149d4f5c0 100644
--- a/target/tricore/translate.c
+++ b/target/tricore/translate.c
@@ -5805,8 +5805,8 @@ static void decode_rcrw_insert(DisasContext *ctx)
 
 tcg_gen_movi_tl(temp, width);
 tcg_gen_movi_tl(temp2, const4);
-tcg_gen_andi_tl(temp3, cpu_gpr_d[r4], 0x1f);
-gen_insert(cpu_gpr_d[r3], cpu_gpr_d[r1], temp2, temp, temp3);
+tcg_gen_andi_tl(temp3, cpu_gpr_d[r3], 0x1f);
+gen_insert(cpu_gpr_d[r4], cpu_gpr_d[r1], temp2, temp, temp3);
 
 tcg_temp_free(temp3);
 break;
-- 
2.39.1

[PATCH v2 07/10] target/tricore: Fix OPC2_32_RRRR_DEXTR

if cpu_gpr_d[r3] == 0 then we were shifting the lower register to the
right by 32 which is undefined behaviour. In this case the TriCore would
do nothing an just return the higher register cpu_reg_d[r1]. We fixed
that by detecting whether cpu_gpr_d[r3] was zero and cleared the lower
register.

Reviewed-by: Richard Henderson 
Signed-off-by: Bastian Koppelmann 
---
v1 -> v2:
  - use tcg_constant_tl(0) for TCGv zero
  - add extra line on top of comment

 target/tricore/translate.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/target/tricore/translate.c b/target/tricore/translate.c
index 3b4ec530b1..8bf78b46d0 100644
--- a/target/tricore/translate.c
+++ b/target/tricore/translate.c
@@ -8245,10 +8245,19 @@ static void decode__extract_insert(DisasContext 
*ctx)
 if (r1 == r2) {
 tcg_gen_rotl_tl(cpu_gpr_d[r4], cpu_gpr_d[r1], tmp_pos);
 } else {
+TCGv msw = tcg_temp_new();
+TCGv zero = tcg_constant_tl(0);
 tcg_gen_shl_tl(tmp_width, cpu_gpr_d[r1], tmp_pos);
-tcg_gen_subfi_tl(tmp_pos, 32, tmp_pos);
-tcg_gen_shr_tl(tmp_pos, cpu_gpr_d[r2], tmp_pos);
-tcg_gen_or_tl(cpu_gpr_d[r4], tmp_width, tmp_pos);
+tcg_gen_subfi_tl(msw, 32, tmp_pos);
+tcg_gen_shr_tl(msw, cpu_gpr_d[r2], msw);
+/*
+ * if pos == 0, then we do cpu_gpr_d[r2] << 32, which is undefined
+ * behaviour. So check that case here and set the low bits to zero
+ * which effectivly returns cpu_gpr_d[r1]
+ */
+tcg_gen_movcond_tl(TCG_COND_EQ, msw, tmp_pos, zero, zero, msw);
+tcg_gen_or_tl(cpu_gpr_d[r4], tmp_width, msw);
+tcg_temp_free(msw);
 }
 break;
 case OPC2_32__EXTR:
-- 
2.39.1

[PATCH v2 09/10] target/tricore: Fix OPC2_32_BO_LD_BU_PREINC

we were sign extending the result of the load, while the instruction
clearly states that the result should be unsigned.

Reviewed-by: Richard Henderson 
Signed-off-by: Bastian Koppelmann 
---
 target/tricore/translate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/tricore/translate.c b/target/tricore/translate.c
index 8bf78b46d0..ab386cef50 100644
--- a/target/tricore/translate.c
+++ b/target/tricore/translate.c
@@ -4964,7 +4964,7 @@ static void 
decode_bo_addrmode_ld_post_pre_base(DisasContext *ctx)
 tcg_gen_addi_tl(cpu_gpr_a[r2], cpu_gpr_a[r2], off10);
 break;
 case OPC2_32_BO_LD_BU_PREINC:
-gen_ld_preincr(ctx, cpu_gpr_d[r1], cpu_gpr_a[r2], off10, MO_SB);
+gen_ld_preincr(ctx, cpu_gpr_d[r1], cpu_gpr_a[r2], off10, MO_UB);
 break;
 case OPC2_32_BO_LD_D_SHORTOFF:
 CHECK_REG_PAIR(r1);
-- 
2.39.1

Re: [PATCH 1/2] tests/avocado: Invert parameter vs. tag precedence during setUp

2023-02-02 Thread Daniel Henrique Barboza





On 1/20/23 19:14, Fabiano Rosas wrote:

We currently never pass parameters to the avocado process via
Makefile. To start doing so we need to invert the precedence between
command line parameters and tags, otherwise a command line parameter
would override values for all the tests, which is unlikely to be a
common use-case.

A more likely use-case is to force certain values for the tests that
have no tags. For instance, if a test has no 'arch' tags and therefore
can run for all targets, one could possibly force it to run on a
certain target with an arch=foo parameter.

This applies to the variables set during setUp(): arch, machine, cpu,
distro_name, distro_version. Parameters used directly in tests or read
via self.params.get are left unchanged.

Signed-off-by: Fabiano Rosas 
---


Reviewed-by: Daniel Henrique Barboza 


  tests/avocado/avocado_qemu/__init__.py | 32 +++---
  1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/tests/avocado/avocado_qemu/__init__.py 
b/tests/avocado/avocado_qemu/__init__.py
index 910f3ba1ea..a181cac383 100644
--- a/tests/avocado/avocado_qemu/__init__.py
+++ b/tests/avocado/avocado_qemu/__init__.py
@@ -240,12 +240,23 @@ def _get_unique_tag_val(self, tag_name):
  return vals.pop()
  return None
  
+def _get_prop(self, name):

+"""
+Infer test properties based on tags. If no tag is present,
+look for a command line parameter of the same name.
+"""
+val = self._get_unique_tag_val(name)
+if not val:
+# If there's no tag, look for a command line
+# parameter. This allows the user to override any defaults
+# the caller of this function would choose if we were to
+# return None.
+val = self.params.get(name)
+return val
+
  def setUp(self, bin_prefix):
-self.arch = self.params.get('arch',
-default=self._get_unique_tag_val('arch'))
-
-self.cpu = self.params.get('cpu',
-   default=self._get_unique_tag_val('cpu'))
+self.arch = self._get_prop('arch')
+self.cpu = self._get_prop('cpu')
  
  default_qemu_bin = pick_default_qemu_bin(bin_prefix, arch=self.arch)

  self.qemu_bin = self.params.get('qemu_bin',
@@ -274,8 +285,7 @@ def setUp(self):
  
  super().setUp('qemu-system-')
  
-self.machine = self.params.get('machine',

-   
default=self._get_unique_tag_val('machine'))
+self.machine = self._get_prop('machine')
  
  def require_accelerator(self, accelerator):

  """
@@ -529,15 +539,11 @@ class LinuxTest(LinuxSSHMixIn, QemuSystemTest):
  memory = '1024'
  
  def _set_distro(self):

-distro_name = self.params.get(
-'distro',
-default=self._get_unique_tag_val('distro'))
+distro_name = self._get_prop('distro')
  if not distro_name:
  distro_name = 'fedora'
  
-distro_version = self.params.get(

-'distro_version',
-default=self._get_unique_tag_val('distro_version'))
+distro_version = self._get_prop('distro_version')
  if not distro_version:
  distro_version = '31'

[PATCH v2 08/10] tests/tcg/tricore: Add OPC2_32_RRRR_DEXTR tests

Signed-off-by: Bastian Koppelmann 
---
 tests/tcg/tricore/macros.h |  9 +
 tests/tcg/tricore/test_dextr.S | 35 ++
 2 files changed, 44 insertions(+)

diff --git a/tests/tcg/tricore/macros.h b/tests/tcg/tricore/macros.h
index 8bc0faf1e4..06bdbf83cb 100644
--- a/tests/tcg/tricore/macros.h
+++ b/tests/tcg/tricore/macros.h
@@ -78,6 +78,15 @@ test_ ## num:
  \
 insn DREG_CORRECT_RESULT, DREG_RS1;   \
 )
 
+#define TEST_D_DDD(insn, num, result, rs1, rs2, rs3)\
+TEST_CASE(num, DREG_CALC_RESULT, result,\
+LI(DREG_RS1, rs1);  \
+LI(DREG_RS2, rs2);  \
+LI(DREG_RS3, rs3);  \
+rstv;   \
+insn DREG_CALC_RESULT, DREG_RS1, DREG_RS2, DREG_RS3; \
+)
+
 #define TEST_D_DD_PSW(insn, num, result, psw, rs1, rs2) \
 TEST_CASE_PSW(num, DREG_CALC_RESULT, result, psw,   \
 LI(DREG_RS1, rs1);  \
diff --git a/tests/tcg/tricore/test_dextr.S b/tests/tcg/tricore/test_dextr.S
index c8a9fc453a..82c8fe5185 100644
--- a/tests/tcg/tricore/test_dextr.S
+++ b/tests/tcg/tricore/test_dextr.S
@@ -37,4 +37,39 @@ _start:
 TEST_D_DDI(dextr, 31, 0x48d159e2, 0xabcdef01, 0x23456789, 30)
 TEST_D_DDI(dextr, 32, 0x91a2b3c4, 0xabcdef01, 0x23456789, 31)
 
+#   insn  numresult  rs1  rs2rs3
+#| |   |  ||  |
+TEST_D_DDD(dextr, 33, 0xabcdef01, 0xabcdef01, 0x23456789, 0)
+TEST_D_DDD(dextr, 34, 0x579bde02, 0xabcdef01, 0x23456789, 1)
+TEST_D_DDD(dextr, 35, 0xaf37bc04, 0xabcdef01, 0x23456789, 2)
+TEST_D_DDD(dextr, 36, 0x5e6f7809, 0xabcdef01, 0x23456789, 3)
+TEST_D_DDD(dextr, 37, 0xbcdef012, 0xabcdef01, 0x23456789, 4)
+TEST_D_DDD(dextr, 38, 0x79bde024, 0xabcdef01, 0x23456789, 5)
+TEST_D_DDD(dextr, 39, 0xf37bc048, 0xabcdef01, 0x23456789, 6)
+TEST_D_DDD(dextr, 40, 0xe6f78091, 0xabcdef01, 0x23456789, 7)
+TEST_D_DDD(dextr, 41, 0xcdef0123, 0xabcdef01, 0x23456789, 8)
+TEST_D_DDD(dextr, 42, 0x9bde0246, 0xabcdef01, 0x23456789, 9)
+TEST_D_DDD(dextr, 43, 0x37bc048d, 0xabcdef01, 0x23456789, 10)
+TEST_D_DDD(dextr, 44, 0x6f78091a, 0xabcdef01, 0x23456789, 11)
+TEST_D_DDD(dextr, 45, 0xdef01234, 0xabcdef01, 0x23456789, 12)
+TEST_D_DDD(dextr, 46, 0xbde02468, 0xabcdef01, 0x23456789, 13)
+TEST_D_DDD(dextr, 47, 0x7bc048d1, 0xabcdef01, 0x23456789, 14)
+TEST_D_DDD(dextr, 48, 0xf78091a2, 0xabcdef01, 0x23456789, 15)
+TEST_D_DDD(dextr, 49, 0xef012345, 0xabcdef01, 0x23456789, 16)
+TEST_D_DDD(dextr, 51, 0xde02468a, 0xabcdef01, 0x23456789, 17)
+TEST_D_DDD(dextr, 52, 0xbc048d15, 0xabcdef01, 0x23456789, 18)
+TEST_D_DDD(dextr, 53, 0x78091a2b, 0xabcdef01, 0x23456789, 19)
+TEST_D_DDD(dextr, 54, 0xf0123456, 0xabcdef01, 0x23456789, 20)
+TEST_D_DDD(dextr, 55, 0xe02468ac, 0xabcdef01, 0x23456789, 21)
+TEST_D_DDD(dextr, 56, 0xc048d159, 0xabcdef01, 0x23456789, 22)
+TEST_D_DDD(dextr, 57, 0x8091a2b3, 0xabcdef01, 0x23456789, 23)
+TEST_D_DDD(dextr, 58, 0x01234567, 0xabcdef01, 0x23456789, 24)
+TEST_D_DDD(dextr, 59, 0x02468acf, 0xabcdef01, 0x23456789, 25)
+TEST_D_DDD(dextr, 60, 0x048d159e, 0xabcdef01, 0x23456789, 26)
+TEST_D_DDD(dextr, 61, 0x091a2b3c, 0xabcdef01, 0x23456789, 27)
+TEST_D_DDD(dextr, 62, 0x12345678, 0xabcdef01, 0x23456789, 28)
+TEST_D_DDD(dextr, 63, 0x2468acf1, 0xabcdef01, 0x23456789, 29)
+TEST_D_DDD(dextr, 64, 0x48d159e2, 0xabcdef01, 0x23456789, 30)
+TEST_D_DDD(dextr, 65, 0x91a2b3c4, 0xabcdef01, 0x23456789, 31)
+
 TEST_PASSFAIL
-- 
2.39.1

[PATCH v2 06/10] tests/tcg/tricore: Add tests for RRPW_DEXTR

Signed-off-by: Bastian Koppelmann 
---
 tests/tcg/tricore/Makefile.softmmu-target |  1 +
 tests/tcg/tricore/macros.h|  8 +
 tests/tcg/tricore/test_dextr.S| 40 +++
 3 files changed, 49 insertions(+)
 create mode 100644 tests/tcg/tricore/test_dextr.S

diff --git a/tests/tcg/tricore/Makefile.softmmu-target 
b/tests/tcg/tricore/Makefile.softmmu-target
index afabba8631..e83cc4b7cd 100644
--- a/tests/tcg/tricore/Makefile.softmmu-target
+++ b/tests/tcg/tricore/Makefile.softmmu-target
@@ -6,6 +6,7 @@ ASFLAGS =
 TESTS += test_abs.tst
 TESTS += test_bmerge.tst
 TESTS += test_clz.tst
+TESTS += test_dextr.tst
 TESTS += test_dvstep.tst
 TESTS += test_fadd.tst
 TESTS += test_fmul.tst
diff --git a/tests/tcg/tricore/macros.h b/tests/tcg/tricore/macros.h
index 4f2bc3cb0f..8bc0faf1e4 100644
--- a/tests/tcg/tricore/macros.h
+++ b/tests/tcg/tricore/macros.h
@@ -95,6 +95,14 @@ test_ ## num:
  \
 insn DREG_CALC_RESULT, DREG_RS1, DREG_RS2, DREG_RS3;  \
 )
 
+#define TEST_D_DDI(insn, num, result, rs1, rs2, imm) \
+TEST_CASE(num, DREG_CALC_RESULT, result, \
+LI(DREG_RS1, rs1);   \
+LI(DREG_RS2, rs2);   \
+rstv;\
+insn DREG_CALC_RESULT, DREG_RS1, DREG_RS2, imm;  \
+)
+
 #define TEST_D_DDI_PSW(insn, num, result, psw, rs1, rs2, imm) \
 TEST_CASE_PSW(num, DREG_CALC_RESULT, result, psw, \
 LI(DREG_RS1, rs1);\
diff --git a/tests/tcg/tricore/test_dextr.S b/tests/tcg/tricore/test_dextr.S
new file mode 100644
index 00..c8a9fc453a
--- /dev/null
+++ b/tests/tcg/tricore/test_dextr.S
@@ -0,0 +1,40 @@
+#include "macros.h"
+.text
+.global _start
+_start:
+#   insn  numresult  rs1  rs2imm
+#| |   |  ||  |
+TEST_D_DDI(dextr,  1, 0xabcdef01, 0xabcdef01, 0x23456789, 0)
+TEST_D_DDI(dextr,  2, 0x579bde02, 0xabcdef01, 0x23456789, 1)
+TEST_D_DDI(dextr,  3, 0xaf37bc04, 0xabcdef01, 0x23456789, 2)
+TEST_D_DDI(dextr,  4, 0x5e6f7809, 0xabcdef01, 0x23456789, 3)
+TEST_D_DDI(dextr,  5, 0xbcdef012, 0xabcdef01, 0x23456789, 4)
+TEST_D_DDI(dextr,  6, 0x79bde024, 0xabcdef01, 0x23456789, 5)
+TEST_D_DDI(dextr,  7, 0xf37bc048, 0xabcdef01, 0x23456789, 6)
+TEST_D_DDI(dextr,  8, 0xe6f78091, 0xabcdef01, 0x23456789, 7)
+TEST_D_DDI(dextr,  9, 0xcdef0123, 0xabcdef01, 0x23456789, 8)
+TEST_D_DDI(dextr, 10, 0x9bde0246, 0xabcdef01, 0x23456789, 9)
+TEST_D_DDI(dextr, 11, 0x37bc048d, 0xabcdef01, 0x23456789, 10)
+TEST_D_DDI(dextr, 12, 0x6f78091a, 0xabcdef01, 0x23456789, 11)
+TEST_D_DDI(dextr, 13, 0xdef01234, 0xabcdef01, 0x23456789, 12)
+TEST_D_DDI(dextr, 14, 0xbde02468, 0xabcdef01, 0x23456789, 13)
+TEST_D_DDI(dextr, 15, 0x7bc048d1, 0xabcdef01, 0x23456789, 14)
+TEST_D_DDI(dextr, 16, 0xf78091a2, 0xabcdef01, 0x23456789, 15)
+TEST_D_DDI(dextr, 17, 0xef012345, 0xabcdef01, 0x23456789, 16)
+TEST_D_DDI(dextr, 18, 0xde02468a, 0xabcdef01, 0x23456789, 17)
+TEST_D_DDI(dextr, 19, 0xbc048d15, 0xabcdef01, 0x23456789, 18)
+TEST_D_DDI(dextr, 20, 0x78091a2b, 0xabcdef01, 0x23456789, 19)
+TEST_D_DDI(dextr, 21, 0xf0123456, 0xabcdef01, 0x23456789, 20)
+TEST_D_DDI(dextr, 22, 0xe02468ac, 0xabcdef01, 0x23456789, 21)
+TEST_D_DDI(dextr, 23, 0xc048d159, 0xabcdef01, 0x23456789, 22)
+TEST_D_DDI(dextr, 24, 0x8091a2b3, 0xabcdef01, 0x23456789, 23)
+TEST_D_DDI(dextr, 25, 0x01234567, 0xabcdef01, 0x23456789, 24)
+TEST_D_DDI(dextr, 26, 0x02468acf, 0xabcdef01, 0x23456789, 25)
+TEST_D_DDI(dextr, 27, 0x048d159e, 0xabcdef01, 0x23456789, 26)
+TEST_D_DDI(dextr, 28, 0x091a2b3c, 0xabcdef01, 0x23456789, 27)
+TEST_D_DDI(dextr, 29, 0x12345678, 0xabcdef01, 0x23456789, 28)
+TEST_D_DDI(dextr, 30, 0x2468acf1, 0xabcdef01, 0x23456789, 29)
+TEST_D_DDI(dextr, 31, 0x48d159e2, 0xabcdef01, 0x23456789, 30)
+TEST_D_DDI(dextr, 32, 0x91a2b3c4, 0xabcdef01, 0x23456789, 31)
+
+TEST_PASSFAIL
-- 
2.39.1

[PATCH v2 01/10] target/tricore: Fix OPC2_32_RCRW_IMASK translation

we were mixing up the "c" and "d" registers. We used "d" as a
destination register und "c" as the source. According to the TriCore ISA
manual 1.6 vol 2 it is the other way round.

Reviewed-by: Richard Henderson 
Signed-off-by: Bastian Koppelmann 
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/653
---
 target/tricore/translate.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/target/tricore/translate.c b/target/tricore/translate.c
index df9e46c649..8de4e56b1f 100644
--- a/target/tricore/translate.c
+++ b/target/tricore/translate.c
@@ -5794,11 +5794,11 @@ static void decode_rcrw_insert(DisasContext *ctx)
 
 switch (op2) {
 case OPC2_32_RCRW_IMASK:
-tcg_gen_andi_tl(temp, cpu_gpr_d[r4], 0x1f);
+tcg_gen_andi_tl(temp, cpu_gpr_d[r3], 0x1f);
 tcg_gen_movi_tl(temp2, (1 << width) - 1);
-tcg_gen_shl_tl(cpu_gpr_d[r3 + 1], temp2, temp);
+tcg_gen_shl_tl(cpu_gpr_d[r4 + 1], temp2, temp);
 tcg_gen_movi_tl(temp2, const4);
-tcg_gen_shl_tl(cpu_gpr_d[r3], temp2, temp);
+tcg_gen_shl_tl(cpu_gpr_d[r4], temp2, temp);
 break;
 case OPC2_32_RCRW_INSERT:
 temp3 = tcg_temp_new();
-- 
2.39.1

[PATCH v2 05/10] target/tricore: Fix RRPW_DEXTR

if we used const16 == 0 we would crash qemu with the error:
../tcg/tcg-op.c:196: tcg_gen_shri_i32: Assertion `arg2 >= 0 && arg2 < 32' failed

This whole instruction can be handled by 'tcg_gen_extract2_tl' which
takes care of this special case as well.

Signed-off-by: Bastian Koppelmann 
---
v1 -> v2:
   Use tcg_gen_extract2_tl and remove all special cases.

 target/tricore/translate.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/target/tricore/translate.c b/target/tricore/translate.c
index 6149d4f5c0..3b4ec530b1 100644
--- a/target/tricore/translate.c
+++ b/target/tricore/translate.c
@@ -8706,15 +8706,9 @@ static void decode_32Bit_opc(DisasContext *ctx)
 r2 = MASK_OP_RRPW_S2(ctx->opcode);
 r3 = MASK_OP_RRPW_D(ctx->opcode);
 const16 = MASK_OP_RRPW_POS(ctx->opcode);
-if (r1 == r2) {
-tcg_gen_rotli_tl(cpu_gpr_d[r3], cpu_gpr_d[r1], const16);
-} else {
-temp = tcg_temp_new();
-tcg_gen_shli_tl(temp, cpu_gpr_d[r1], const16);
-tcg_gen_shri_tl(cpu_gpr_d[r3], cpu_gpr_d[r2], 32 - const16);
-tcg_gen_or_tl(cpu_gpr_d[r3], cpu_gpr_d[r3], temp);
-tcg_temp_free(temp);
-}
+
+tcg_gen_extract2_tl(cpu_gpr_d[r3], cpu_gpr_d[r2], cpu_gpr_d[r1],
+32 - const16);
 break;
 /* RRR Format */
 case OPCM_32_RRR_COND_SELECT:
-- 
2.39.1

[PATCH v2 10/10] tests/tcg/tricore: Add LD.BU tests

Signed-off-by: Bastian Koppelmann 
---
 tests/tcg/tricore/Makefile.softmmu-target |  1 +
 tests/tcg/tricore/macros.h| 23 +++
 tests/tcg/tricore/test_ld_bu.S| 15 +++
 3 files changed, 39 insertions(+)
 create mode 100644 tests/tcg/tricore/test_ld_bu.S

diff --git a/tests/tcg/tricore/Makefile.softmmu-target 
b/tests/tcg/tricore/Makefile.softmmu-target
index e83cc4b7cd..b6c19dbecd 100644
--- a/tests/tcg/tricore/Makefile.softmmu-target
+++ b/tests/tcg/tricore/Makefile.softmmu-target
@@ -13,6 +13,7 @@ TESTS += test_fmul.tst
 TESTS += test_ftoi.tst
 TESTS += test_imask.tst
 TESTS += test_insert.tst
+TESTS += test_ld_bu.tst
 TESTS += test_madd.tst
 TESTS += test_msub.tst
 TESTS += test_muls.tst
diff --git a/tests/tcg/tricore/macros.h b/tests/tcg/tricore/macros.h
index 06bdbf83cb..109ef62a4d 100644
--- a/tests/tcg/tricore/macros.h
+++ b/tests/tcg/tricore/macros.h
@@ -4,6 +4,10 @@
 movh DREG_TEMP_LI, up:val; \
 or reg, reg, DREG_TEMP_LI; \
 
+#define LIA(reg, val)\
+LI(DREG_TEMP, val)   \
+mov.a reg, DREG_TEMP;
+
 /* Address definitions */
 #define TESTDEV_ADDR 0xf000
 /* Register definitions */
@@ -18,6 +22,10 @@
 #define DREG_TEST_NUM %d14
 #define DREG_CORRECT_RESULT %d15
 
+#define AREG_ADDR %a0
+#define AREG_CORRECT_RESULT %a3
+#define MEM_BASE_ADDR 0xd000
+
 #define DREG_DEV_ADDR %a15
 
 #define EREG_RS1 %e6
@@ -60,11 +68,24 @@ test_ ## num:   
   \
 mov DREG_TEST_NUM, num;\
 jne DREG_CALC_PSW, DREG_CORRECT_PSW, fail;
 
+#define TEST_LD(insn, num, result, addr_result, ld_pattern) \
+test_ ## num:   \
+LIA(AREG_ADDR, test_data)   \
+insn DREG_CALC_RESULT, ld_pattern;  \
+LI(DREG_CORRECT_RESULT, result) \
+mov DREG_TEST_NUM, num; \
+jne DREG_CALC_RESULT, DREG_CORRECT_RESULT, fail;\
+mov.d DREG_CALC_RESULT, AREG_ADDR;  \
+LI(DREG_CORRECT_RESULT, addr_result)\
+jne DREG_CALC_RESULT, DREG_CORRECT_RESULT, fail;
+
 /* Actual test case type
  * e.g inst %dX, %dY  -> TEST_D_D
  * inst %dX, %dY, %dZ -> TEST_D_DD
  * inst %eX, %dY, %dZ -> TEST_E_DD
  */
+
+
 #define TEST_D_D(insn, num, result, rs1)  \
 TEST_CASE(num, DREG_CALC_RESULT, result,  \
 LI(DREG_RS1, rs1);\
@@ -143,6 +164,8 @@ test_ ## num:   
   \
 insn EREG_CALC_RESULT, imm1, DREG_RS1, imm2);  \
 )
 
+
+
 /* Pass/Fail handling part */
 #define TEST_PASSFAIL   \
 j pass; \
diff --git a/tests/tcg/tricore/test_ld_bu.S b/tests/tcg/tricore/test_ld_bu.S
new file mode 100644
index 00..ff9dac128b
--- /dev/null
+++ b/tests/tcg/tricore/test_ld_bu.S
@@ -0,0 +1,15 @@
+#include "macros.h"
+.data
+test_data:
+.word 0xaffedead
+.word 0x001122ff
+.text
+.global _start
+_start:
+#expect. addr reg val after load
+#   insn  num  expect. load value |  pattern for loading
+# || ||  |
+TEST_LD(ld.bu, 1, 0xff, MEM_BASE_ADDR + 4, [+AREG_ADDR]4) # pre_inc
+TEST_LD(ld.bu, 2, 0xad, MEM_BASE_ADDR + 4, [AREG_ADDR+]4) # post_inc
+
+TEST_PASSFAIL
-- 
2.39.1

Re: [PATCH 2/2] tests/avocado: Allow passing command line parameters via Makefile

2023-02-02 Thread Daniel Henrique Barboza





On 1/20/23 19:15, Fabiano Rosas wrote:

Add support for the 'avocado run' "-p" option, which allows us to pass
parameters in the form key=value to be applied to all tests selected
for a given run. This is useful to force generic tests to use a
specific machine, cpu or qemu-binary where otherwise the defaults
would be used.

E.g.:
  $ make check-avocado AVOCADO_PARAMS="machine=virt arch=riscv64"

  

Signed-off-by: Fabiano Rosas 
---


Reviewed-by: Daniel Henrique Barboza 


  tests/Makefile.include | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tests/Makefile.include b/tests/Makefile.include
index 9422ddaece..f92e730aa0 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -107,6 +107,10 @@ else
AVOCADO_CMDLINE_TAGS=$(addprefix -t , $(AVOCADO_TAGS))
  endif
  
+ifdef AVOCADO_PARAMS

+   AVOCADO_CMDLINE_PARAMS=$(addprefix -p , $(AVOCADO_PARAMS))
+endif
+
  quiet-venv-pip = $(quiet-@)$(call quiet-command-run, \
  $(TESTS_PYTHON) -m pip -q --disable-pip-version-check $1, \
  "VENVPIP","$1")
@@ -144,7 +148,7 @@ check-avocado: check-venv $(TESTS_RESULTS_DIR) get-vm-images
  --show=$(AVOCADO_SHOW) run --job-results-dir=$(TESTS_RESULTS_DIR) 
\
  $(if $(AVOCADO_TAGS),, --filter-by-tags-include-empty \
--filter-by-tags-include-empty-key) \
-$(AVOCADO_CMDLINE_TAGS) \
+$(AVOCADO_CMDLINE_TAGS) $(AVOCADO_CMDLINE_PARAMS) \
  $(if $(GITLAB_CI),,--failfast) $(AVOCADO_TESTS), \
  "AVOCADO", "tests/avocado")

[PATCH v2 02/10] tests/tcg/tricore: Add test for OPC2_32_RCRW_IMASK

Signed-off-by: Bastian Koppelmann 
---
 tests/tcg/tricore/Makefile.softmmu-target |  1 +
 tests/tcg/tricore/macros.h|  7 +++
 tests/tcg/tricore/test_imask.S| 10 ++
 3 files changed, 18 insertions(+)
 create mode 100644 tests/tcg/tricore/test_imask.S

diff --git a/tests/tcg/tricore/Makefile.softmmu-target 
b/tests/tcg/tricore/Makefile.softmmu-target
index 5007c60ce8..bc0cfae8d0 100644
--- a/tests/tcg/tricore/Makefile.softmmu-target
+++ b/tests/tcg/tricore/Makefile.softmmu-target
@@ -10,6 +10,7 @@ TESTS += test_dvstep.tst
 TESTS += test_fadd.tst
 TESTS += test_fmul.tst
 TESTS += test_ftoi.tst
+TESTS += test_imask.tst
 TESTS += test_madd.tst
 TESTS += test_msub.tst
 TESTS += test_muls.tst
diff --git a/tests/tcg/tricore/macros.h b/tests/tcg/tricore/macros.h
index 0d76fc403a..ceb7e9c0b7 100644
--- a/tests/tcg/tricore/macros.h
+++ b/tests/tcg/tricore/macros.h
@@ -111,6 +111,13 @@ test_ ## num:  
\
 insn EREG_CALC_RESULT, EREG_RS1, DREG_RS2;\
 )
 
+#define TEST_E_IDI(insn, num, res_hi, res_lo, imm1, rs1, imm2) \
+TEST_CASE_E(num, res_lo, res_hi,   \
+LI(DREG_RS1, rs1); \
+rstv;  \
+insn EREG_CALC_RESULT, imm1, DREG_RS1, imm2);  \
+)
+
 /* Pass/Fail handling part */
 #define TEST_PASSFAIL   \
 j pass; \
diff --git a/tests/tcg/tricore/test_imask.S b/tests/tcg/tricore/test_imask.S
new file mode 100644
index 00..356cf398b8
--- /dev/null
+++ b/tests/tcg/tricore/test_imask.S
@@ -0,0 +1,10 @@
+#include "macros.h"
+.text
+.global _start
+_start:
+#   res[31:0]
+#  insn  num  res[63:32]   |imm1  rs1   imm2
+#||   ||  || |
+TEST_E_IDI(imask, 1, 0x000f, 0x0005, 0x5, 0x10, 0x4)
+
+TEST_PASSFAIL
-- 
2.39.1

Lost partition tables on ide-hd + ahci drive

2023-02-02 Thread Fiona Ebner

Hi,
over the years we've got 1-2 dozen reports[0] about suddenly
missing/corrupted MBR/partition tables. The issue seems to be very rare
and there was no success in trying to reproduce it yet. I'm asking here
in the hope that somebody has seen something similar.

The only commonality seems to be the use of an ide-hd drive with ahci bus.

It does seem to happen with both Linux and Windows guests (one of the
reports even mentions FreeBSD) and backing storages for the VMs include
ZFS, RBD, LVM-Thin as well as file-based storages.

Relevant part of an example configuration:

>   -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
>   -drive 
> 'file=/dev/zvol/myzpool/vm-168-disk-0,if=none,id=drive-sata0,format=raw,cache=none,aio=io_uring,detect-zeroes=on'
>  \
>   -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0' \

The first reports are from before io_uring was used and there are also
reports with writeback cache mode and discard=on,detect-zeroes=unmap.

Some reports say that the issue occurred under high IO load.

Many reports suspect backups causing the issue. Our backup mechanism
uses backup_job_create() for each drive and runs the jobs sequentially.
It uses a custom block driver as the backup target which just forwards
the writes to the actual target which can be a file or our backup server.
(If you really want to see the details, apply the patches in [1] and see
pve-backup.c and block/backup-dump.c).

Of course, the backup job will read sector 0 of the source disk, but I
really can't see where a stray write would happen, why the issue would
trigger so rarely or why seemingly only ide-hd+ahci would be affected.

So again, just asking if somebody has seen something similar or has a
hunch of what the cause might be.

[0]: https://bugzilla.proxmox.com/show_bug.cgi?id=2874
[1]: https://git.proxmox.com/?p=pve-qemu.git;a=tree;f=debian/patches;hb=HEAD

Re: [PATCH v6 1/2] io: Add support for MSG_PEEK for socket channel