Re: [PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.
On Sun, 14 Jun 2009 04:15:28 pm Herbert Xu wrote: > On Sat, Jun 13, 2009 at 10:00:37PM +0930, Rusty Russell wrote: > > But re your comment that the 67 drivers using TX_BUSY are doing it > > because of driver bugs, that's hard to believe. It either hardly ever > > happens (in which case just drop the packet), or it happens (in which > > case we should handle it correctly). > > Most of them just do this: > > start_xmit: > > if (unlikely(queue is full)) { > /* This should never happen. */ > return TX_BUSY; > } OK, so I did a rough audit, trying to figure out the "never happens" ones (N, could be kfree_skb(skb); return NETDEV_TX_OK) from the "will actually happen" ones (Y). One question: can netif_queue_stopped() ever be true inside start_xmit? Some drivers test that, like sun3lance.c. Some have multiple returns, and some are unclear, but my best guess on a quickish reading is below. Summary: we still have about 54 in-tree drivers which actually use NETDEV_TX_BUSY for normal paths. Can I fix it now? Thanks, Rusty. sungem.c: Y, N fs_enet: N mace.c: N sh_eth.c: Y de620.c: N acenic.c: N (but goes through stupid hoops to avoid it) e1000_main.c: N, Y korina.c: N (Buggy: frees skb and returns TX_BUSY.) sky2.c: N cassini.c: N ixgbe: N b44.c: N, Y, Y (last two buggy: OOM, does not stop q) macb.c: N niu.c: N smctr.c: N olympic.c: Y tms380tr.c: N 3c359.c: Y lanstreamer.c: Y lib8390.c: N xirc2ps_cs.c: Y smc91c92_cs.c: N fmvj18x_cs.c: N (and buggy: can't happen AFAICT, and return 0 above?) axnet_cs.c: N smsc9420.c: N (and buggy: doesn't stop q) mkiss.c: N (In xmit, can netif_running be false? Can netif_queue_stopped?) skge.c: N qlge: N, N, Y, N (buggy, OOM, does not stop q) chelsio: N s2io.c: Y, Y? macmace.c: N 3c505.c: Y defxx.c: Y myri10ge: N sbni.c: Y wanxl.c: N cycx_x25.c: N, N, Y? dlci.c: Y qla3xxx.c: N, N (buggy, OOM, does not stop q), Y, N, tlan.c: Y skfp.c: N cs89x0.c: N smc9194.c: N fec_mpc52xx.c: N mv643xx_eth.c: N (buggy, OOM, does not stop q) ll_temac_main.c: Y, Y netxen: Y tsi108_eth.c: N, N ni65.c: N sunhme.c: N atl1c.c: Y ps3_gelic_net.c: Y igbvf: N csgb3.c: N ks8695net.c: N, N (buggy, neither stops q, latter OOM) ether3.c: N at91_ether.c: N bnx2x_main.c: N, N dm9000.c: N jme.c: N 3c537.c: Y (plus, leak on skb_padto fail) arcnet.c: N? 3c59x.c: N au1000_eth.c: Y ixgb: N de600.c: N, N, N myri_sbus.c: Y bnx2.c: N atl1e: Y sonic.c: who cares, that won't even compile... (missing semicolon) sun3_82586.c: N 3c515.c: N ibm_newemac.c: Y donaubae.c:Y?, Y?, Y?, Y (but never stops q) sir_dev.c: Y au1k_ir.c: Y, Y cpmac.c: N (no stop q, and leak on skb_padto fail), Y davinci_emac.c: N (no stop q), Y de2104x.c: N uli526x.c: N dmfe.c: N xircom_cb.c: N iwmc3200wifi: Y orinoco: N, N, N, N (no stop q) atmel.c: Y p54common.c: N, Y? arlan-main.c: Y? libipw_tx.c: Y (no stop q), N (alloc failure) hostap_80211_tx.c: Y strip.c: N wavelan.c: N, N, N at76c50x-usb.c: N libertas/tx.c: Y ray_cs.c: N, N airo.c: Y, Y, Y plip.c: N, N, N (starts q, BUSY on too big pkt?) ns83820.c: N, N ehea: Y, Y (no stop q) rionet.c: N enic: N sis900.c: N starfire.c: Y r6040.c: N sun3lance.c: N, N sfc: Y, N, Y mac89x0.c: N sb1250-mac.c: Y pasemi_mac.c: Y 8139cp.c: N e1000e: N r8169.c: N? sis190.c: N e100.c: N tg3.c: N, Y?, N fec.c: N (no stop q), N hamachi.c: N forcedeth.c: Y, Y vxge: Y?, Y? ks8842.c: Y spider_net.c: Y igb: N ewrk3.c: N gianfar.c: Y sunvnet.c: N mlx4: Y atlx: Y, Y ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.
On Thu, Jun 18, 2009 at 04:47:50PM +0930, Rusty Russell wrote: > > One question: can netif_queue_stopped() ever be true inside start_xmit? Some > drivers test that, like sun3lance.c. The driver should never test that because even if it is true due to driver shutdown the xmit function should ignore it as the upper layer will wait for it anyway. > Summary: we still have about 54 in-tree drivers which actually use > NETDEV_TX_BUSY for normal paths. Can I fix it now? You can fix it but I don't quite understand your results below :) > sungem.c: Y, N This driver does the bug check in addition to a race check that should simply drop the packet instead of queueing. In fact chances are the race check is unnecessary anyway. > fs_enet: N This is either just a bug check or the driver is broken in that it should stop the queue when the said condition can be true. > mace.c: N Just a bug check. > sh_eth.c: Y This driver should check the queue after transmitting, just like virtio-net :) So from a totally non-representative sample of 4, my conclusion is that none of them need TX_BUSY. Do you have an example that really needs it? Anyway, I don't think we should reshape our APIs based on how broken the existing users are. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCHv5 02/13] qemu: capability bits in pci save/restore
Add support for capability bits in save/restore for pci. These will be used for MSI, where the capability might be present or not as requested by user, which does not map well into a single version number. Signed-off-by: Michael S. Tsirkin --- hw/pci.c | 17 ++--- hw/pci.h |4 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 840986f..d42fcd9 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -126,9 +126,13 @@ int pci_bus_num(PCIBus *s) void pci_device_save(PCIDevice *s, QEMUFile *f) { +int version = s->cap_present ? 3 : 2; int i; -qemu_put_be32(f, 2); /* PCI device version */ +/* PCI device version and capabilities */ +qemu_put_be32(f, version); +if (version >= 3) +qemu_put_be32(f, s->cap_present); qemu_put_buffer(f, s->config, 256); for (i = 0; i < 4; i++) qemu_put_be32(f, s->irq_state[i]); @@ -140,15 +144,22 @@ int pci_device_load(PCIDevice *s, QEMUFile *f) int i; version_id = qemu_get_be32(f); -if (version_id > 2) +if (version_id > 3) return -EINVAL; +if (version_id >= 3) +s->cap_present = qemu_get_be32(f); +else +s->cap_present = 0; + +if (s->cap_present & ~s->cap_supported) +return -EINVAL; + qemu_get_buffer(f, s->config, 256); pci_update_mappings(s); if (version_id >= 2) for (i = 0; i < 4; i ++) s->irq_state[i] = qemu_get_be32(f); - return 0; } diff --git a/hw/pci.h b/hw/pci.h index 44aa61b..1226846 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -174,6 +174,10 @@ struct PCIDevice { /* Current IRQ levels. Used internally by the generic PCI code. */ int irq_state[4]; + +/* Capability bits for save/load */ +uint32_t cap_supported; +uint32_t cap_present; }; PCIDevice *pci_register_device(PCIBus *bus, const char *name, -- 1.6.2.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCHv5 03/13] qemu: add routines to manage PCI capabilities
Add routines to manage PCI capability list. First user will be MSI-X. Signed-off-by: Michael S. Tsirkin --- hw/pci.c | 79 ++ hw/pci.h | 18 +- 2 files changed, 96 insertions(+), 1 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index d42fcd9..e3f80d0 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -160,6 +160,12 @@ int pci_device_load(PCIDevice *s, QEMUFile *f) if (version_id >= 2) for (i = 0; i < 4; i ++) s->irq_state[i] = qemu_get_be32(f); +/* Clear wmask and used bits for capabilities. + Must be restored separately, since capabilities can + be placed anywhere in config space. */ +memset(s->used, 0, PCI_CONFIG_SPACE_SIZE); +for (i = PCI_CONFIG_HEADER_SIZE; i < PCI_CONFIG_SPACE_SIZE; ++i) +s->wmask[i] = 0xff; return 0; } @@ -867,3 +873,76 @@ PCIDevice *pci_create_simple(PCIBus *bus, int devfn, const char *name) return (PCIDevice *)dev; } + +static int pci_find_space(PCIDevice *pdev, uint8_t size) +{ +int offset = PCI_CONFIG_HEADER_SIZE; +int i; +for (i = PCI_CONFIG_HEADER_SIZE; i < PCI_CONFIG_SPACE_SIZE; ++i) +if (pdev->used[i]) +offset = i + 1; +else if (i - offset + 1 == size) +return offset; +return 0; +} + +static uint8_t pci_find_capability_list(PCIDevice *pdev, uint8_t cap_id, +uint8_t *prev_p) +{ +uint8_t next, prev; + +if (!(pdev->config[PCI_STATUS] & PCI_STATUS_CAP_LIST)) +return 0; + +for (prev = PCI_CAPABILITY_LIST; (next = pdev->config[prev]); + prev = next + PCI_CAP_LIST_NEXT) +if (pdev->config[next + PCI_CAP_LIST_ID] == cap_id) +break; + +if (prev_p) +*prev_p = prev; +return next; +} + +/* Reserve space and add capability to the linked list in pci config space */ +int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size) +{ +uint8_t offset = pci_find_space(pdev, size); +uint8_t *config = pdev->config + offset; +if (!offset) +return -ENOSPC; +config[PCI_CAP_LIST_ID] = cap_id; +config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST]; +pdev->config[PCI_CAPABILITY_LIST] = offset; +pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; +memset(pdev->used + offset, 0xFF, size); +/* Make capability read-only by default */ +memset(pdev->wmask + offset, 0, size); +return offset; +} + +/* Unlink capability from the pci config space. */ +void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size) +{ +uint8_t prev, offset = pci_find_capability_list(pdev, cap_id, &prev); +if (!offset) +return; +pdev->config[prev] = pdev->config[offset + PCI_CAP_LIST_NEXT]; +/* Make capability writeable again */ +memset(pdev->wmask + offset, 0xff, size); +memset(pdev->used + offset, 0, size); + +if (!pdev->config[PCI_CAPABILITY_LIST]) +pdev->config[PCI_STATUS] &= ~PCI_STATUS_CAP_LIST; +} + +/* Reserve space for capability at a known offset (to call after load). */ +void pci_reserve_capability(PCIDevice *pdev, uint8_t offset, uint8_t size) +{ +memset(pdev->used + offset, 0xff, size); +} + +uint8_t pci_find_capability(PCIDevice *pdev, uint8_t cap_id) +{ +return pci_find_capability_list(pdev, cap_id, NULL); +} diff --git a/hw/pci.h b/hw/pci.h index 1226846..4f90fdc 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -121,6 +121,10 @@ typedef struct PCIIORegion { #define PCI_MIN_GNT0x3e/* 8 bits */ #define PCI_MAX_LAT0x3f/* 8 bits */ +/* Capability lists */ +#define PCI_CAP_LIST_ID0 /* Capability ID */ +#define PCI_CAP_LIST_NEXT 1 /* Next capability in the list */ + #define PCI_REVISION0x08/* obsolete, use PCI_REVISION_ID */ #define PCI_SUBVENDOR_ID0x2c/* obsolete, use PCI_SUBSYSTEM_VENDOR_ID */ #define PCI_SUBDEVICE_ID0x2e/* obsolete, use PCI_SUBSYSTEM_ID */ @@ -128,7 +132,7 @@ typedef struct PCIIORegion { /* Bits in the PCI Status Register (PCI 2.3 spec) */ #define PCI_STATUS_RESERVED1 0x007 #define PCI_STATUS_INT_STATUS 0x008 -#define PCI_STATUS_CAPABILITIES0x010 +#define PCI_STATUS_CAP_LIST0x010 #define PCI_STATUS_66MHZ 0x020 #define PCI_STATUS_RESERVED2 0x040 #define PCI_STATUS_FAST_BACK 0x080 @@ -158,6 +162,9 @@ struct PCIDevice { /* Used to implement R/W bytes */ uint8_t wmask[PCI_CONFIG_SPACE_SIZE]; +/* Used to allocate config space for capabilities. */ +uint8_t used[PCI_CONFIG_SPACE_SIZE]; + /* the following fields are read only */ PCIBus *bus; int devfn; @@ -190,6 +197,15 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num, uint32_t size, int type, PCIMapIORegionFunc *map_func); +int pci_add_capability(PCIDevice *pci_dev, uint8_t cap_id, uint8_t cap_size); + +
[PATCHv5 01/13] qemu: make default_write_config use mask table
Change much of hw/pci to use symbolic constants and a table-driven design: add a mask table with writable bits set and readonly bits unset. Detect change by comparing original and new registers. This makes it easy to support capabilities where read-only/writeable bit layout differs between devices, depending on capabilities present. As a result, writing a single byte in BAR registers now works as it should. Writing to upper limit registers in the bridge also works as it should. Code is also shorter. Signed-off-by: Michael S. Tsirkin --- hw/pci.c | 145 - hw/pci.h | 18 +++- 2 files changed, 46 insertions(+), 117 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 0a738db..840986f 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -238,6 +238,17 @@ int pci_assign_devaddr(const char *addr, int *domp, int *busp, unsigned *slotp) return pci_parse_devaddr(devaddr, domp, busp, slotp); } +static void pci_init_mask(PCIDevice *dev) +{ +int i; +dev->wmask[PCI_CACHE_LINE_SIZE] = 0xff; +dev->wmask[PCI_INTERRUPT_LINE] = 0xff; +dev->wmask[PCI_COMMAND] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY + | PCI_COMMAND_MASTER; +for (i = PCI_CONFIG_HEADER_SIZE; i < PCI_CONFIG_SPACE_SIZE; ++i) +dev->wmask[i] = 0xff; +} + /* -1 for devfn means auto assign */ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus, const char *name, int devfn, @@ -257,6 +268,7 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus, pstrcpy(pci_dev->name, sizeof(pci_dev->name), name); memset(pci_dev->irq_state, 0, sizeof(pci_dev->irq_state)); pci_set_default_subsystem_id(pci_dev); +pci_init_mask(pci_dev); if (!config_read) config_read = pci_default_read_config; @@ -328,6 +340,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num, { PCIIORegion *r; uint32_t addr; +uint32_t wmask; if ((unsigned int)region_num >= PCI_NUM_REGIONS) return; @@ -343,12 +356,17 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num, r->size = size; r->type = type; r->map_func = map_func; + +wmask = ~(size - 1); if (region_num == PCI_ROM_SLOT) { addr = 0x30; +/* ROM enable bit is writeable */ +wmask |= 1; } else { addr = 0x10 + region_num * 4; } *(uint32_t *)(pci_dev->config + addr) = cpu_to_le32(type); +*(uint32_t *)(pci_dev->wmask + addr) = cpu_to_le32(wmask); } static void pci_update_mappings(PCIDevice *d) @@ -457,118 +475,21 @@ uint32_t pci_default_read_config(PCIDevice *d, return val; } -void pci_default_write_config(PCIDevice *d, - uint32_t address, uint32_t val, int len) +void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) { -int can_write, i; -uint32_t end, addr; - -if (len == 4 && ((address >= 0x10 && address < 0x10 + 4 * 6) || - (address >= 0x30 && address < 0x34))) { -PCIIORegion *r; -int reg; +uint8_t orig[PCI_CONFIG_SPACE_SIZE]; +int i; -if ( address >= 0x30 ) { -reg = PCI_ROM_SLOT; -}else{ -reg = (address - 0x10) >> 2; -} -r = &d->io_regions[reg]; -if (r->size == 0) -goto default_config; -/* compute the stored value */ -if (reg == PCI_ROM_SLOT) { -/* keep ROM enable bit */ -val &= (~(r->size - 1)) | 1; -} else { -val &= ~(r->size - 1); -val |= r->type; -} -*(uint32_t *)(d->config + address) = cpu_to_le32(val); -pci_update_mappings(d); -return; -} - default_config: /* not efficient, but simple */ -addr = address; -for(i = 0; i < len; i++) { -/* default read/write accesses */ -switch(d->config[0x0e]) { -case 0x00: -case 0x80: -switch(addr) { -case 0x00: -case 0x01: -case 0x02: -case 0x03: -case 0x06: -case 0x07: -case 0x08: -case 0x09: -case 0x0a: -case 0x0b: -case 0x0e: -case 0x10 ... 0x27: /* base */ -case 0x2c ... 0x2f: /* read-only subsystem ID & vendor ID */ -case 0x30 ... 0x33: /* rom */ -case 0x3d: -can_write = 0; -break; -default: -can_write = 1; -break; -} -break; -default: -case 0x01: -switch(addr) { -case 0x00: -case 0x01: -case 0x02: -case 0x03: -case 0x06: -case 0x07: -case 0x08: -case 0x09: -case 0x0a: -case 0x0b:
[PREFIXv5 00/13] qemu: MSI-X support
Here is the port of MSI-X support patches to upstream qemu. Please comment or commit. This patchset adds generic support for MSI-X, adds implementation in APIC, and uses MSI-X in virtio-net. Changelog: - since v4 rebased to latest bits. rename resize_region -> resize_bar - since v3 call to resize_region on load split patches a bit differently to address style comments by Glauber update commit message to clarify what msix_support flag does - since v2 rename mask -> wmask to avoid conflict with work by Yamahata - since v1 At Paul's suggestion, use stl_phy to decouple APIC and MSI-X implementation This uses the mask table patch that I posted previously, and which is included in the series. -- MST Michael S. Tsirkin (13): qemu: make default_write_config use mask table qemu: capability bits in pci save/restore qemu: add routines to manage PCI capabilities qemu: helper routines for pci access qemu: MSI-X support functions qemu: add flag to disable MSI-X by default qemu: minimal MSI/MSI-X implementation for PC qemu: add support for resizing regions qemu: virtio support for many interrupt vectors qemu: MSI-X support in virtio PCI qemu: request 3 vectors in virtio-net qemu: virtio save/load bindings qemu: add pci_get/set_byte Makefile.target|2 +- hw/apic.c | 43 +- hw/msix.c | 420 hw/msix.h | 35 + hw/pci.c | 294 +++- hw/pci.h | 104 - hw/syborg_virtio.c | 13 ++- hw/virtio-net.c|1 + hw/virtio-pci.c| 216 ++- hw/virtio.c| 70 ++--- hw/virtio.h| 14 ++- qemu-options.hx|2 + rules.mak |2 +- vl.c |3 + 14 files changed, 1004 insertions(+), 215 deletions(-) create mode 100644 hw/msix.c create mode 100644 hw/msix.h ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCHv5 05/13] qemu: MSI-X support functions
Add functions implementing MSI-X support. First user will be virtio-pci. Note that platform must set a flag to declare MSI supported: this is a safety measure to avoid breaking platforms which should support MSI-X but currently lack this in the interrupt controller emulation. For PC this will be set by APIC. Signed-off-by: Michael S. Tsirkin --- Makefile.target |2 +- hw/msix.c | 417 +++ hw/msix.h | 35 + hw/pci.h| 20 +++ 4 files changed, 473 insertions(+), 1 deletions(-) create mode 100644 hw/msix.c create mode 100644 hw/msix.h diff --git a/Makefile.target b/Makefile.target index 0159bf7..145b3a9 100644 --- a/Makefile.target +++ b/Makefile.target @@ -494,7 +494,7 @@ endif #CONFIG_BSD_USER ifndef CONFIG_USER_ONLY OBJS=vl.o osdep.o monitor.o pci.o loader.o isa_mmio.o machine.o \ - gdbstub.o gdbstub-xml.o + gdbstub.o gdbstub-xml.o msix.o # virtio has to be here due to weird dependency between PCI and virtio-net. # need to fix this properly OBJS+=virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o diff --git a/hw/msix.c b/hw/msix.c new file mode 100644 index 000..a448eab --- /dev/null +++ b/hw/msix.c @@ -0,0 +1,417 @@ +/* + * MSI-X device support + * + * This module includes support for MSI-X in pci devices. + * + * Author: Michael S. Tsirkin + * + * Copyright (c) 2009, Red Hat Inc, Michael S. Tsirkin (m...@redhat.com) + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include "hw.h" +#include "msix.h" +#include "pci.h" + +/* Declaration from linux/pci_regs.h */ +#define PCI_CAP_ID_MSIX 0x11 /* MSI-X */ +#define PCI_MSIX_FLAGS 2 /* Table at lower 11 bits */ +#define PCI_MSIX_FLAGS_QSIZE 0x7FF +#define PCI_MSIX_FLAGS_ENABLE (1 << 15) +#define PCI_MSIX_FLAGS_BIRMASK(7 << 0) + +/* MSI-X capability structure */ +#define MSIX_TABLE_OFFSET 4 +#define MSIX_PBA_OFFSET 8 +#define MSIX_CAP_LENGTH 12 + +/* MSI enable bit is in byte 1 in FLAGS register */ +#define MSIX_ENABLE_OFFSET (PCI_MSIX_FLAGS + 1) +#define MSIX_ENABLE_MASK (PCI_MSIX_FLAGS_ENABLE >> 8) + +/* MSI-X table format */ +#define MSIX_MSG_ADDR 0 +#define MSIX_MSG_UPPER_ADDR 4 +#define MSIX_MSG_DATA 8 +#define MSIX_VECTOR_CTRL 12 +#define MSIX_ENTRY_SIZE 16 +#define MSIX_VECTOR_MASK 0x1 + +/* How much space does an MSIX table need. */ +/* The spec requires giving the table structure + * a 4K aligned region all by itself. Align it to + * target pages so that drivers can do passthrough + * on the rest of the region. */ +#define MSIX_PAGE_SIZE TARGET_PAGE_ALIGN(0x1000) +/* Reserve second half of the page for pending bits */ +#define MSIX_PAGE_PENDING (MSIX_PAGE_SIZE / 2) +#define MSIX_MAX_ENTRIES 32 + + +#ifdef MSIX_DEBUG +#define DEBUG(fmt, ...) \ +do { \ + fprintf(stderr, "%s: " fmt, __func__ , __VA_ARGS__);\ +} while (0) +#else +#define DEBUG(fmt, ...) do { } while(0) +#endif + +/* Flag to globally disable MSI-X support */ +int msix_disable; + +/* Flag for interrupt controller to declare MSI-X support */ +int msix_supported; + +/* Add MSI-X capability to the config space for the device. */ +/* Given a bar and its size, add MSI-X table on top of it + * and fill MSI-X capability in the config space. + * Original bar size must be a power of 2 or 0. + * New bar size is returned. */ +static int msix_add_config(struct PCIDevice *pdev, unsigned short nentries, + unsigned bar_nr, unsigned bar_size) +{ +int config_offset; +uint8_t *config; +uint32_t new_size; + +if (nentries < 1 || nentries > PCI_MSIX_FLAGS_QSIZE + 1) +return -EINVAL; +if (bar_size > 0x8000) +return -ENOSPC; + +/* Add space for MSI-X structures */ +if (!bar_size) +new_size = MSIX_PAGE_SIZE; +else if (bar_size < MSIX_PAGE_SIZE) { +bar_size = MSIX_PAGE_SIZE; +new_size = MSIX_PAGE_SIZE * 2; +} else +new_size = bar_size * 2; + +pdev->msix_bar_size = new_size; +config_offset = pci_add_capability(pdev, PCI_CAP_ID_MSIX, MSIX_CAP_LENGTH); +if (config_offset < 0) +return config_offset; +config = pdev->config + config_offset; + +pci_set_word(config + PCI_MSIX_FLAGS, nentries - 1); +/* Table on top of BAR */ +pci_set_long(config + MSIX_TABLE_OFFSET, bar_size | bar_nr); +/* Pending bits on top of that */ +pci_set_long(config + MSIX_PBA_OFFSET, (bar_size + MSIX_PAGE_PENDING) | + bar_nr); +pdev->msix_cap = config_offset; +/* Make flags bit writeable. */ +pdev->wmask[config_offset + MSIX_ENABLE_OFFSET] |= MSIX_ENABLE_MASK; +return 0; +} + +static void msix_free_irq_entries(PCIDevice *dev) +{ +int vector; + +for (vector = 0; vector < dev->msix_entries_nr; ++vector) +dev->msix_entry_used[
[PATCHv5 06/13] qemu: add flag to disable MSI-X by default
Add global flag to disable MSI-X by default. This is useful primarily to make images loadable by older qemu (without msix). Even when MSI-X is disabled by flag, you can still load images that have MSI-X enabled. Signed-off-by: Michael S. Tsirkin --- hw/msix.c |5 - qemu-options.hx |2 ++ vl.c|3 +++ 3 files changed, 9 insertions(+), 1 deletions(-) diff --git a/hw/msix.c b/hw/msix.c index a448eab..cc62a83 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -122,6 +122,9 @@ void msix_write_config(PCIDevice *dev, uint32_t addr, uint32_t val, int len) { unsigned enable_pos = dev->msix_cap + MSIX_ENABLE_OFFSET; +if (!(dev->cap_present & QEMU_PCI_CAP_MSIX)) +return; + if (addr + len <= enable_pos || addr > enable_pos) return; @@ -242,7 +245,7 @@ int msix_init(struct PCIDevice *dev, unsigned short nentries, dev->msix_table_page = qemu_mallocz(MSIX_PAGE_SIZE); -dev->msix_mmio_index = cpu_register_io_memory(0, msix_mmio_read, +dev->msix_mmio_index = cpu_register_io_memory(msix_mmio_read, msix_mmio_write, dev); if (dev->msix_mmio_index == -1) { ret = -EBUSY; diff --git a/qemu-options.hx b/qemu-options.hx index 9d5e05a..aa02615 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -1584,3 +1584,5 @@ DEF("semihosting", 0, QEMU_OPTION_semihosting, DEF("old-param", 0, QEMU_OPTION_old_param, "-old-param old param mode\n") #endif +DEF("disable-msix", 0, QEMU_OPTION_disable_msix, +"-disable-msix disable msix support for PCI devices (enabled by default)\n") diff --git a/vl.c b/vl.c index 3242c23..f7c67bb 100644 --- a/vl.c +++ b/vl.c @@ -135,6 +135,7 @@ int main(int argc, char **argv) #include "hw/usb.h" #include "hw/pcmcia.h" #include "hw/pc.h" +#include "hw/msix.h" #include "hw/audiodev.h" #include "hw/isa.h" #include "hw/baum.h" @@ -5699,6 +5700,8 @@ int main(int argc, char **argv, char **envp) xen_mode = XEN_ATTACH; break; #endif +case QEMU_OPTION_disable_msix: +msix_disable = 1; } } } -- 1.6.2.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCHv5 04/13] qemu: helper routines for pci access
Add inline routines for convenient access to pci devices with correct (little) endianness. Will be used by MSI-X support. Signed-off-by: Michael S. Tsirkin --- hw/pci.h | 30 +++--- 1 files changed, 27 insertions(+), 3 deletions(-) diff --git a/hw/pci.h b/hw/pci.h index 4f90fdc..b0a8acf 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -236,21 +236,45 @@ PCIBus *pci_bridge_init(PCIBus *bus, int devfn, uint16_t vid, uint16_t did, pci_map_irq_fn map_irq, const char *name); static inline void +pci_set_word(uint8_t *config, uint16_t val) +{ +cpu_to_le16wu((uint16_t *)config, val); +} + +static inline uint16_t +pci_get_word(uint8_t *config) +{ +return le16_to_cpupu((uint16_t *)config); +} + +static inline void +pci_set_long(uint8_t *config, uint32_t val) +{ +cpu_to_le32wu((uint32_t *)config, val); +} + +static inline uint32_t +pci_get_long(uint8_t *config) +{ +return le32_to_cpupu((uint32_t *)config); +} + +static inline void pci_config_set_vendor_id(uint8_t *pci_config, uint16_t val) { -cpu_to_le16wu((uint16_t *)&pci_config[PCI_VENDOR_ID], val); +pci_set_word(&pci_config[PCI_VENDOR_ID], val); } static inline void pci_config_set_device_id(uint8_t *pci_config, uint16_t val) { -cpu_to_le16wu((uint16_t *)&pci_config[PCI_DEVICE_ID], val); +pci_set_word(&pci_config[PCI_DEVICE_ID], val); } static inline void pci_config_set_class(uint8_t *pci_config, uint16_t val) { -cpu_to_le16wu((uint16_t *)&pci_config[PCI_CLASS_DEVICE], val); +pci_set_word(&pci_config[PCI_CLASS_DEVICE], val); } typedef void (*pci_qdev_initfn)(PCIDevice *dev); -- 1.6.2.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCHv5 07/13] qemu: minimal MSI/MSI-X implementation for PC
Implement MSI support in APIC. Note that MSI and MMIO APIC registers are at the same memory location, but actually not on the global bus: MSI is on PCI bus, APIC is connected directly to the CPU. We map them on the global bus at the same address which happens to work because MSI registers are reserved in APIC MMIO and vice versa. Signed-off-by: Michael S. Tsirkin --- hw/apic.c | 43 +++ 1 files changed, 39 insertions(+), 4 deletions(-) diff --git a/hw/apic.c b/hw/apic.c index 3e04132..3bcab46 100644 --- a/hw/apic.c +++ b/hw/apic.c @@ -19,6 +19,8 @@ */ #include "hw.h" #include "pc.h" +#include "pci.h" +#include "msix.h" #include "qemu-timer.h" #include "host-utils.h" @@ -63,6 +65,19 @@ #define MAX_APICS 255 #define MAX_APIC_WORDS 8 +/* Intel APIC constants: from include/asm/msidef.h */ +#define MSI_DATA_VECTOR_SHIFT 0 +#define MSI_DATA_VECTOR_MASK 0x00ff +#define MSI_DATA_DELIVERY_MODE_SHIFT 8 +#define MSI_DATA_TRIGGER_SHIFT 15 +#define MSI_DATA_LEVEL_SHIFT 14 +#define MSI_ADDR_DEST_MODE_SHIFT 2 +#define MSI_ADDR_DEST_ID_SHIFT 12 +#defineMSI_ADDR_DEST_ID_MASK 0x000 + +#define MSI_ADDR_BASE 0xfee0 +#define MSI_ADDR_SIZE 0x10 + typedef struct APICState { CPUState *cpu_env; uint32_t apicbase; @@ -731,11 +746,31 @@ static uint32_t apic_mem_readl(void *opaque, target_phys_addr_t addr) return val; } +static void apic_send_msi(target_phys_addr_t addr, uint32 data) +{ +uint8_t dest = (addr & MSI_ADDR_DEST_ID_MASK) >> MSI_ADDR_DEST_ID_SHIFT; +uint8_t vector = (data & MSI_DATA_VECTOR_MASK) >> MSI_DATA_VECTOR_SHIFT; +uint8_t dest_mode = (addr >> MSI_ADDR_DEST_MODE_SHIFT) & 0x1; +uint8_t trigger_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1; +uint8_t delivery = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7; +/* XXX: Ignore redirection hint. */ +apic_deliver_irq(dest, dest_mode, delivery, vector, 0, trigger_mode); +} + static void apic_mem_writel(void *opaque, target_phys_addr_t addr, uint32_t val) { CPUState *env; APICState *s; -int index; +int index = (addr >> 4) & 0xff; +if (addr > 0xfff || !index) { +/* MSI and MMIO APIC are at the same memory location, + * but actually not on the global bus: MSI is on PCI bus + * APIC is connected directly to the CPU. + * Mapping them on the global bus happens to work because + * MSI registers are reserved in APIC MMIO and vice versa. */ +apic_send_msi(addr, val); +return; +} env = cpu_single_env; if (!env) @@ -746,7 +781,6 @@ static void apic_mem_writel(void *opaque, target_phys_addr_t addr, uint32_t val) printf("APIC write: %08x = %08x\n", (uint32_t)addr, val); #endif -index = (addr >> 4) & 0xff; switch(index) { case 0x02: s->id = (val >> 24); @@ -931,6 +965,7 @@ int apic_init(CPUState *env) s->cpu_env = env; apic_reset(s); +msix_supported = 1; /* XXX: mapping more APICs at the same memory location */ if (apic_io_memory == 0) { @@ -938,7 +973,8 @@ int apic_init(CPUState *env) on the global memory bus. */ apic_io_memory = cpu_register_io_memory(apic_mem_read, apic_mem_write, NULL); -cpu_register_physical_memory(s->apicbase & ~0xfff, 0x1000, +/* XXX: what if the base changes? */ +cpu_register_physical_memory(MSI_ADDR_BASE, MSI_ADDR_SIZE, apic_io_memory); } s->timer = qemu_new_timer(vm_clock, apic_timer, s); @@ -949,4 +985,3 @@ int apic_init(CPUState *env) local_apics[s->idx] = s; return 0; } - -- 1.6.2.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCHv5 12/13] qemu: virtio save/load bindings
Implement bindings for virtio save/load. Use them in virtio pci. Signed-off-by: Michael S. Tsirkin --- hw/virtio-pci.c | 50 +- hw/virtio.c | 33 - hw/virtio.h |4 3 files changed, 69 insertions(+), 18 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index d3f4884..1057ae1 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -105,6 +105,50 @@ static void virtio_pci_notify(void *opaque, uint16_t vector) qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1); } +static void virtio_pci_save_config(void * opaque, QEMUFile *f) +{ +VirtIOPCIProxy *proxy = opaque; +pci_device_save(&proxy->pci_dev, f); +msix_save(&proxy->pci_dev, f); +if (msix_present(&proxy->pci_dev)) +qemu_put_be16(f, proxy->vdev->config_vector); +} + +static void virtio_pci_save_queue(void * opaque, int n, QEMUFile *f) +{ +VirtIOPCIProxy *proxy = opaque; +if (msix_present(&proxy->pci_dev)) +qemu_put_be16(f, virtio_queue_vector(proxy->vdev, n)); +} + +static int virtio_pci_load_config(void * opaque, QEMUFile *f) +{ +VirtIOPCIProxy *proxy = opaque; +int ret; +ret = pci_device_load(&proxy->pci_dev, f); +if (ret) +return ret; +ret = msix_load(&proxy->pci_dev, f); +if (ret) +return ret; +if (msix_present(&proxy->pci_dev)) +qemu_get_be16s(f, &proxy->vdev->config_vector); + +pci_resize_bar(&proxy->pci_dev, 1, msix_bar_size(&proxy->pci_dev)); +return 0; +} + +static int virtio_pci_load_queue(void * opaque, int n, QEMUFile *f) +{ +VirtIOPCIProxy *proxy = opaque; +uint16_t vector; +if (!msix_present(&proxy->pci_dev)) +return 0; +qemu_get_be16s(f, &vector); +virtio_queue_set_vector(proxy->vdev, n, vector); +return 0; +} + static void virtio_pci_reset(void *opaque) { VirtIOPCIProxy *proxy = opaque; @@ -317,7 +361,11 @@ static void virtio_write_config(PCIDevice *pci_dev, uint32_t address, } static const VirtIOBindings virtio_pci_bindings = { -.notify = virtio_pci_notify +.notify = virtio_pci_notify, +.save_config = virtio_pci_save_config, +.load_config = virtio_pci_load_config, +.save_queue = virtio_pci_save_queue, +.load_queue = virtio_pci_load_queue, }; static void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev, diff --git a/hw/virtio.c b/hw/virtio.c index fe9f793..b773dff 100644 --- a/hw/virtio.c +++ b/hw/virtio.c @@ -568,9 +568,8 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f) { int i; -/* FIXME: load/save binding. */ -//pci_device_save(&vdev->pci_dev, f); -//msix_save(&vdev->pci_dev, f); +if (vdev->binding->save_config) +vdev->binding->save_config(vdev->binding_opaque, f); qemu_put_8s(f, &vdev->status); qemu_put_8s(f, &vdev->isr); @@ -596,18 +595,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f) qemu_put_be32(f, vdev->vq[i].vring.num); qemu_put_be64(f, vdev->vq[i].pa); qemu_put_be16s(f, &vdev->vq[i].last_avail_idx); -if (vdev->nvectors) -qemu_put_be16s(f, &vdev->vq[i].vector); +if (vdev->binding->save_queue) +vdev->binding->save_queue(vdev->binding_opaque, i, f); } } -void virtio_load(VirtIODevice *vdev, QEMUFile *f) +int virtio_load(VirtIODevice *vdev, QEMUFile *f) { -int num, i; +int num, i, ret; -/* FIXME: load/save binding. */ -//pci_device_load(&vdev->pci_dev, f); -//r = msix_load(&vdev->pci_dev, f); +if (vdev->binding->load_config) { +ret = vdev->binding->load_config(vdev->binding_opaque, f); +if (ret) +return ret; +} qemu_get_8s(f, &vdev->status); qemu_get_8s(f, &vdev->isr); @@ -616,10 +617,6 @@ void virtio_load(VirtIODevice *vdev, QEMUFile *f) vdev->config_len = qemu_get_be32(f); qemu_get_buffer(f, vdev->config, vdev->config_len); -if (vdev->nvectors) { -qemu_get_be16s(f, &vdev->config_vector); -//msix_vector_use(&vdev->pci_dev, vdev->config_vector); -} num = qemu_get_be32(f); for (i = 0; i < num; i++) { @@ -630,13 +627,15 @@ void virtio_load(VirtIODevice *vdev, QEMUFile *f) if (vdev->vq[i].pa) { virtqueue_init(&vdev->vq[i]); } -if (vdev->nvectors) { -qemu_get_be16s(f, &vdev->vq[i].vector); -//msix_vector_use(&vdev->pci_dev, vdev->config_vector); +if (vdev->binding->load_queue) { +ret = vdev->binding->load_queue(vdev->binding_opaque, i, f); +if (ret) +return ret; } } virtio_notify_vector(vdev, VIRTIO_NO_VECTOR); +return 0; } void virtio_cleanup(VirtIODevice *vdev) diff --git a/hw/virtio.h b/hw/virtio.h index 04a3c3d..ce05517 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -72,6 +72,10 @@ typedef struct VirtQueueElement typedef struct { void (*no
[PATCHv5 08/13] qemu: add support for resizing regions
Make it possible to resize PCI regions. This will be used by virtio with MSI-X, where the region size depends on whether MSI-X is enabled, and can change across load/save. Signed-off-by: Michael S. Tsirkin --- hw/pci.c | 53 +++-- hw/pci.h |2 ++ 2 files changed, 37 insertions(+), 18 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index e3f80d0..aa88a0b 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -386,6 +386,40 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num, *(uint32_t *)(pci_dev->wmask + addr) = cpu_to_le32(wmask); } +static void pci_unmap_region(PCIDevice *d, PCIIORegion *r) +{ +if (r->addr == -1) +return; +if (r->type & PCI_ADDRESS_SPACE_IO) { +int class; +/* NOTE: specific hack for IDE in PC case: + only one byte must be mapped. */ +class = pci_get_word(d->config + PCI_CLASS_DEVICE); +if (class == 0x0101 && r->size == 4) { +isa_unassign_ioport(r->addr + 2, 1); +} else { +isa_unassign_ioport(r->addr, r->size); +} +} else { +cpu_register_physical_memory(pci_to_cpu_addr(r->addr), + r->size, + IO_MEM_UNASSIGNED); +qemu_unregister_coalesced_mmio(r->addr, r->size); +} +} + +void pci_resize_bar(PCIDevice *pci_dev, int region_num, uint32_t size) +{ + +PCIIORegion *r = &pci_dev->io_regions[region_num]; +if (r->size == size) +return; +r->size = size; +pci_unmap_region(pci_dev, r); +r->addr = -1; +pci_update_mappings(pci_dev); +} + static void pci_update_mappings(PCIDevice *d) { PCIIORegion *r; @@ -439,24 +473,7 @@ static void pci_update_mappings(PCIDevice *d) } /* now do the real mapping */ if (new_addr != r->addr) { -if (r->addr != -1) { -if (r->type & PCI_ADDRESS_SPACE_IO) { -int class; -/* NOTE: specific hack for IDE in PC case: - only one byte must be mapped. */ -class = d->config[0x0a] | (d->config[0x0b] << 8); -if (class == 0x0101 && r->size == 4) { -isa_unassign_ioport(r->addr + 2, 1); -} else { -isa_unassign_ioport(r->addr, r->size); -} -} else { -cpu_register_physical_memory(pci_to_cpu_addr(r->addr), - r->size, - IO_MEM_UNASSIGNED); -qemu_unregister_coalesced_mmio(r->addr, r->size); -} -} +pci_unmap_region(d, r); r->addr = new_addr; if (r->addr != -1) { r->map_func(d, i, r->addr, r->size, r->type); diff --git a/hw/pci.h b/hw/pci.h index 25865d7..67e4b4d 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -217,6 +217,8 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num, uint32_t size, int type, PCIMapIORegionFunc *map_func); +void pci_resize_bar(PCIDevice *pci_dev, int region_num, uint32_t size); + int pci_add_capability(PCIDevice *pci_dev, uint8_t cap_id, uint8_t cap_size); void pci_del_capability(PCIDevice *pci_dev, uint8_t cap_id, uint8_t cap_size); -- 1.6.2.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCHv5 13/13] qemu: add pci_get/set_byte
Add pci_get/set_byte to keep *_word and *_long access functions company. They are unused for now. Signed-off-by: Michael S. Tsirkin --- hw/pci.h | 12 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/hw/pci.h b/hw/pci.h index 67e4b4d..303179d 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -258,6 +258,18 @@ PCIBus *pci_bridge_init(PCIBus *bus, int devfn, uint16_t vid, uint16_t did, pci_map_irq_fn map_irq, const char *name); static inline void +pci_set_byte(uint8_t *config, uint8_t val) +{ +*config = val; +} + +static inline uint8_t +pci_get_byte(uint8_t *config) +{ +return *config; +} + +static inline void pci_set_word(uint8_t *config, uint16_t val) { cpu_to_le16wu((uint16_t *)config, val); -- 1.6.2.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCHv5 11/13] qemu: request 3 vectors in virtio-net
Request up to 3 vectors in virtio-net. Actual bindings might supply less. Signed-off-by: Michael S. Tsirkin --- hw/virtio-net.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/hw/virtio-net.c b/hw/virtio-net.c index d584287..c8dabbe 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -709,6 +709,7 @@ VirtIODevice *virtio_net_init(DeviceState *dev) n->mac_table.macs = qemu_mallocz(MAC_TABLE_ENTRIES * ETH_ALEN); n->vlans = qemu_mallocz(MAX_VLAN >> 3); +n->vdev.nvectors = 3; register_savevm("virtio-net", virtio_net_id++, VIRTIO_NET_VM_VERSION, virtio_net_save, virtio_net_load, n); -- 1.6.2.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCHv5 09/13] qemu: virtio support for many interrupt vectors
Extend virtio to support many interrupt vectors, and rearrange code in preparation for multi-vector support (mostly move reset out to bindings, because we will have to reset the vectors in transport-specific code). Actual bindings in pci, and use in net, to follow. Load and save are not connected to bindings yet, so they are left stubbed out for now. Signed-off-by: Michael S. Tsirkin --- hw/syborg_virtio.c | 13 -- hw/virtio-pci.c| 24 hw/virtio.c| 59 ++- hw/virtio.h| 10 +++- 4 files changed, 81 insertions(+), 25 deletions(-) diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c index 8e665c6..108af06 100644 --- a/hw/syborg_virtio.c +++ b/hw/syborg_virtio.c @@ -134,7 +134,10 @@ static void syborg_virtio_writel(void *opaque, target_phys_addr_t offset, vdev->features = value; break; case SYBORG_VIRTIO_QUEUE_BASE: -virtio_queue_set_addr(vdev, vdev->queue_sel, value); +if (value == 0) +virtio_reset(vdev); +else +virtio_queue_set_addr(vdev, vdev->queue_sel, value); break; case SYBORG_VIRTIO_QUEUE_SEL: if (value < VIRTIO_PCI_QUEUE_MAX) @@ -228,7 +231,7 @@ static CPUWriteMemoryFunc *syborg_virtio_writefn[] = { syborg_virtio_writel }; -static void syborg_virtio_update_irq(void *opaque) +static void syborg_virtio_update_irq(void *opaque, uint16_t vector) { SyborgVirtIOProxy *proxy = opaque; int level; @@ -239,7 +242,7 @@ static void syborg_virtio_update_irq(void *opaque) } static VirtIOBindings syborg_virtio_bindings = { -.update_irq = syborg_virtio_update_irq +.notify = syborg_virtio_update_irq }; static void syborg_virtio_init(SyborgVirtIOProxy *proxy, VirtIODevice *vdev) @@ -248,6 +251,8 @@ static void syborg_virtio_init(SyborgVirtIOProxy *proxy, VirtIODevice *vdev) proxy->vdev = vdev; +/* Don't support multiple vectors */ +proxy->vdev->nvectors = 0; sysbus_init_irq(&proxy->busdev, &proxy->irq); iomemtype = cpu_register_io_memory(syborg_virtio_readfn, syborg_virtio_writefn, proxy); @@ -255,6 +260,8 @@ static void syborg_virtio_init(SyborgVirtIOProxy *proxy, VirtIODevice *vdev) proxy->id = ((uint32_t)0x1af4 << 16) | vdev->device_id; +qemu_register_reset(virtio_reset, 0, vdev); + virtio_bind_device(vdev, &syborg_virtio_bindings, proxy); } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 24fe837..d4a134d 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -78,13 +78,19 @@ typedef struct { /* virtio device */ -static void virtio_pci_update_irq(void *opaque) +static void virtio_pci_notify(void *opaque, uint16_t vector) { VirtIOPCIProxy *proxy = opaque; qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1); } +static void virtio_pci_reset(void *opaque) +{ +VirtIOPCIProxy *proxy = opaque; +virtio_reset(proxy->vdev); +} + static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) { VirtIOPCIProxy *proxy = opaque; @@ -108,7 +114,10 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) break; case VIRTIO_PCI_QUEUE_PFN: pa = (target_phys_addr_t)val << VIRTIO_PCI_QUEUE_ADDR_SHIFT; -virtio_queue_set_addr(vdev, vdev->queue_sel, pa); +if (pa == 0) +virtio_pci_reset(proxy); +else +virtio_queue_set_addr(vdev, vdev->queue_sel, pa); break; case VIRTIO_PCI_QUEUE_SEL: if (val < VIRTIO_PCI_QUEUE_MAX) @@ -120,7 +129,7 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) case VIRTIO_PCI_STATUS: vdev->status = val & 0xFF; if (vdev->status == 0) -virtio_reset(vdev); +virtio_pci_reset(proxy); break; } } @@ -158,7 +167,7 @@ static uint32_t virtio_ioport_read(void *opaque, uint32_t addr) /* reading from the ISR also clears it. */ ret = vdev->isr; vdev->isr = 0; -virtio_update_irq(vdev); +qemu_set_irq(proxy->pci_dev.irq[0], 0); break; default: break; @@ -243,7 +252,7 @@ static void virtio_map(PCIDevice *pci_dev, int region_num, } static const VirtIOBindings virtio_pci_bindings = { -.update_irq = virtio_pci_update_irq +.notify = virtio_pci_notify }; static void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev, @@ -255,6 +264,9 @@ static void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev, proxy->vdev = vdev; +/* No support for multiple vectors yet. */ +proxy->vdev->nvectors = 0; + config = proxy->pci_dev.config; pci_config_set_vendor_id(config, vendor); pci_config_set_device_id(config, device); @@ -279,6 +291,8 @@ static void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev, pci_register_bar(&proxy->pc
[PATCHv5 10/13] qemu: MSI-X support in virtio PCI
This enables actual support for MSI-X in virtio PCI. First user will be virtio-net. Signed-off-by: Michael S. Tsirkin --- hw/virtio-pci.c | 152 -- rules.mak |2 +- 2 files changed, 113 insertions(+), 41 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index d4a134d..d3f4884 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -18,6 +18,7 @@ #include "virtio.h" #include "pci.h" //#include "sysemu.h" +#include "msix.h" /* from Linux's linux/virtio_pci.h */ @@ -47,7 +48,24 @@ * a read-and-acknowledge. */ #define VIRTIO_PCI_ISR 19 -#define VIRTIO_PCI_CONFIG 20 +/* MSI-X registers: only enabled if MSI-X is enabled. */ +/* A 16-bit vector for configuration changes. */ +#define VIRTIO_MSI_CONFIG_VECTOR20 +/* A 16-bit vector for selected queue notifications. */ +#define VIRTIO_MSI_QUEUE_VECTOR 22 + +/* Config space size */ +#define VIRTIO_PCI_CONFIG_NOMSI 20 +#define VIRTIO_PCI_CONFIG_MSI 24 +#define VIRTIO_PCI_REGION_SIZE(dev) (msix_present(dev) ? \ + VIRTIO_PCI_CONFIG_MSI : \ + VIRTIO_PCI_CONFIG_NOMSI) + +/* The remaining space is defined by each driver as the per-driver + * configuration space */ +#define VIRTIO_PCI_CONFIG(dev) (msix_enabled(dev) ? \ + VIRTIO_PCI_CONFIG_MSI : \ + VIRTIO_PCI_CONFIG_NOMSI) /* Virtio ABI version, if we increment this, we break the guest driver. */ #define VIRTIO_PCI_ABI_VERSION 0 @@ -81,14 +99,17 @@ typedef struct { static void virtio_pci_notify(void *opaque, uint16_t vector) { VirtIOPCIProxy *proxy = opaque; - -qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1); +if (msix_enabled(&proxy->pci_dev)) +msix_notify(&proxy->pci_dev, vector); +else +qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1); } static void virtio_pci_reset(void *opaque) { VirtIOPCIProxy *proxy = opaque; virtio_reset(proxy->vdev); +msix_reset(&proxy->pci_dev); } static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) @@ -97,8 +118,6 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) VirtIODevice *vdev = proxy->vdev; target_phys_addr_t pa; -addr -= proxy->addr; - switch (addr) { case VIRTIO_PCI_GUEST_FEATURES: /* Guest does not negotiate properly? We have to assume nothing. */ @@ -131,17 +150,33 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) if (vdev->status == 0) virtio_pci_reset(proxy); break; +case VIRTIO_MSI_CONFIG_VECTOR: +msix_vector_unuse(&proxy->pci_dev, vdev->config_vector); +/* Make it possible for guest to discover an error took place. */ +if (msix_vector_use(&proxy->pci_dev, val) < 0) +val = VIRTIO_NO_VECTOR; +vdev->config_vector = val; +break; +case VIRTIO_MSI_QUEUE_VECTOR: +msix_vector_unuse(&proxy->pci_dev, + virtio_queue_vector(vdev, vdev->queue_sel)); +/* Make it possible for guest to discover an error took place. */ +if (msix_vector_use(&proxy->pci_dev, val) < 0) +val = VIRTIO_NO_VECTOR; +virtio_queue_set_vector(vdev, vdev->queue_sel, val); +break; +default: +fprintf(stderr, "%s: unexpected address 0x%x value 0x%x\n", +__func__, addr, val); +break; } } -static uint32_t virtio_ioport_read(void *opaque, uint32_t addr) +static uint32_t virtio_ioport_read(VirtIOPCIProxy *proxy, uint32_t addr) { -VirtIOPCIProxy *proxy = opaque; VirtIODevice *vdev = proxy->vdev; uint32_t ret = 0x; -addr -= proxy->addr; - switch (addr) { case VIRTIO_PCI_HOST_FEATURES: ret = vdev->get_features(vdev); @@ -169,6 +204,12 @@ static uint32_t virtio_ioport_read(void *opaque, uint32_t addr) vdev->isr = 0; qemu_set_irq(proxy->pci_dev.irq[0], 0); break; +case VIRTIO_MSI_CONFIG_VECTOR: +ret = vdev->config_vector; +break; +case VIRTIO_MSI_QUEUE_VECTOR: +ret = virtio_queue_vector(vdev, vdev->queue_sel); +break; default: break; } @@ -179,42 +220,72 @@ static uint32_t virtio_ioport_read(void *opaque, uint32_t addr) static uint32_t virtio_pci_config_readb(void *opaque, uint32_t addr) { VirtIOPCIProxy *proxy = opaque; -addr -= proxy->addr + VIRTIO_PCI_CONFIG; +uint32_t config = VIRTIO_PCI_CONFIG(&proxy->pci_dev); +addr -= proxy->addr; +if (addr < config) +return virtio_ioport_read(proxy, addr); +addr -= config; return virtio_config_readb(proxy->vdev, addr); } static uint32_t virtio_pci_config_readw(void *opaque, uint32_t addr) {
Re: [Qemu-devel] [PATCHv5 08/13] qemu: add support for resizing regions
On Thursday 18 June 2009, Michael S. Tsirkin wrote: > Make it possible to resize PCI regions. This will be used by virtio > with MSI-X, where the region size depends on whether MSI-X is enabled, > and can change across load/save. I thought we'd agreed we shouldn't be doing this. i.e. if the user tries to load the wrong device, just bail out. Paul ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [PATCHv5 08/13] qemu: add support for resizing regions
On Thu, Jun 18, 2009 at 05:32:11PM +0100, Paul Brook wrote: > On Thursday 18 June 2009, Michael S. Tsirkin wrote: > > Make it possible to resize PCI regions. This will be used by virtio > > with MSI-X, where the region size depends on whether MSI-X is enabled, > > and can change across load/save. > > I thought we'd agreed we shouldn't be doing this. > i.e. if the user tries to load the wrong device, just bail out. > > Paul We need a way to generally prevent load from changing registers in a way that can not be done by a guest. I haven't implemented this feature yet though. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.
On Thu, 18 Jun 2009 05:04:22 pm Herbert Xu wrote: > On Thu, Jun 18, 2009 at 04:47:50PM +0930, Rusty Russell wrote: > > Summary: we still have about 54 in-tree drivers which actually use > > NETDEV_TX_BUSY for normal paths. Can I fix it now? > > You can fix it but I don't quite understand your results below :) You didn't comment on my patch which tried to fix NETDEV_TX_BUSY tho? > > sungem.c: Y, N > > This driver does the bug check in addition to a race check that > should simply drop the packet instead of queueing. In fact chances > are the race check is unnecessary anyway. OK, "N" means "can be simply replaced with kfree_skb(skb); return NETDEV_TX_OK;". "Y" means "driver will break if we do that, needs rewriting". I didn't grade how hard or easy the rewrite would be, but later on I got more picky (I would have said this is N, N: the race can be replaced with a drop). > > fs_enet: N > > This is either just a bug check or the driver is broken in that > it should stop the queue when the said condition can be true. > > > mace.c: N > > Just a bug check. Err, that's why they're N (ie. does not need TX_BUSY). > > sh_eth.c: Y > > This driver should check the queue after transmitting, just like > virtio-net :) > > So from a totally non-representative sample of 4, my conclusion > is that none of them need TX_BUSY. Do you have an example that > really needs it? First you asserted "Most of them just do this:... /* Never happens */". Now I've found ~50 drivers which don't do that, it's "Do any of them really need it?". So, now I'll look at that. Some are just buggy (I'll send patches for those). Most I just have no idea what they're doing; they're pretty ugly. These ones are interesting: e1000/e1000_main.c: fifo bug workaround? ehea/ehea_main.c: ? starfire.c: "we may not have enough slots even when it seems we do."? tg3.c: tg3_gso_bug ISTR at least one driver claimed practice showed it was better to return TX_BUSY, and one insisted it wouldn't wasn't going to waste MAX_FRAGS on the stop-early scheme. > Anyway, I don't think we should reshape our APIs based on how > broken the existing users are. We provided an API, people used it. Constantly trying to disclaim our responsibility for the resulting mess makes me fucking ANGRY. We either remove the API, or fix it. I think fixing it is better, because my driver will be simpler and it's obvious noone wants to rewrite 50 drivers and break several of them. I don't know how many times I can say the same thing... Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.
On Fri, Jun 19, 2009 at 01:07:19PM +0930, Rusty Russell wrote: > > You didn't comment on my patch which tried to fix NETDEV_TX_BUSY tho? I think fixing it would only encourage more drivers to use and abuse TX_BUSY. The fundamental problem with TX_BUSY is that you're doing the check before transmitting a packet instead of after transmitting it. Let me explain why this is wrong, beyond the fact that tcpdump may see the packet twice which you've tried to fix. The problem is that requeueing is fundamentally hard. We use to have this horrible logic in the schedulers to handle this. Thankfully that seems to have been replaced with a single device-level packet holder shared with GSO. However, that is still wrong for many packet schedulers. For example, if the requeued packet is of a lower priority, and a higher priority packet comes along, we want the higher priority packet to preempt the requeued packet. Right now it just doesn't happen. This is not as trivial as it seems because on a busy host this can happen many times a second. With TX_BUSY the QoS guarantees are simply not workable. BTW you pointed out that GSO also uses TX_BUSY, but that is different because the packet schedulers treat a GSO packet as a single entity so there is no issue of preemption. Also tcpdump will never see it twice by design. > e1000/e1000_main.c: fifo bug workaround? The workaround should work just as well as a stop-queue check after packet transmission. > ehea/ehea_main.c: ? Ahh! The bastard LLTX drivers are still around. LLTX was the worst abuse associated with TX_BUSY. Thankfully not many of them are left. The fix is to not use LLTX and use the xmit_lock like normal drivers. > starfire.c: "we may not have enough slots even when it seems we do."? Just replace skb_num_frags with SKB_MAX_FRAGS and move the check after the transmit. > tg3.c: tg3_gso_bug A better solution would in fact be to disable hardware TSO when we encounter such a packet (and drop the first one). Because once you get one you're likely to get a lot more. The difference between hardware TSO and GSO on a card like tg3 is negligible anyway. Alternatively just disable TSO completely on those chips. Ccing the tg3 maintainer. > We provided an API, people used it. Constantly trying to disclaim our > responsibility for the resulting mess makes me fucking ANGRY. Where have I disclaimed responsibility? If we were doing that then we wouldn't be having this discussion. > We either remove the API, or fix it. I think fixing it is better, because my > driver will be simpler and it's obvious noone wants to rewrite 50 drivers and > break several of them. My preference is obviously in the long term removal of TX_BUSY. Due to resource constraints that cannot be done immediately. So at least we should try to stop its proliferation. BTW removing TX_BUSY does not mean that your driver has to stay complicated. As I have said repeatedly your driver should be checking the stop-queue condition after transmission, not before. In fact queueing it in the driver is just as bad as return TX_BUSY! Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization