Re: [PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.

2009-06-18 Thread Rusty Russell
On Sun, 14 Jun 2009 04:15:28 pm Herbert Xu wrote:
> On Sat, Jun 13, 2009 at 10:00:37PM +0930, Rusty Russell wrote:
> > But re your comment that the 67 drivers using TX_BUSY are doing it
> > because of driver bugs, that's hard to believe.  It either hardly ever
> > happens (in which case just drop the packet), or it happens (in which
> > case we should handle it correctly).
>
> Most of them just do this:
>
> start_xmit:
>
> if (unlikely(queue is full)) {
>   /* This should never happen. */
>   return TX_BUSY;
> }

OK, so I did a rough audit, trying to figure out the "never happens" ones (N, 
could be kfree_skb(skb); return NETDEV_TX_OK) from the "will actually happen" 
ones (Y).

One question: can netif_queue_stopped() ever be true inside start_xmit?  Some 
drivers test that, like sun3lance.c.

Some have multiple returns, and some are unclear, but my best guess on a 
quickish reading is below.

Summary: we still have about 54 in-tree drivers which actually use 
NETDEV_TX_BUSY for normal paths.  Can I fix it now?

Thanks,
Rusty.

sungem.c: Y, N
fs_enet: N
mace.c: N
sh_eth.c: Y
de620.c: N
acenic.c: N (but goes through stupid hoops to avoid it)
e1000_main.c: N, Y
korina.c: N (Buggy: frees skb and returns TX_BUSY.)
sky2.c: N
cassini.c: N
ixgbe: N
b44.c: N, Y, Y (last two buggy: OOM, does not stop q)
macb.c: N
niu.c: N
smctr.c: N
olympic.c: Y
tms380tr.c: N
3c359.c: Y
lanstreamer.c: Y
lib8390.c: N
xirc2ps_cs.c: Y
smc91c92_cs.c: N
fmvj18x_cs.c: N (and buggy: can't happen AFAICT, and return 0 above?)
axnet_cs.c: N
smsc9420.c: N (and buggy: doesn't stop q)
mkiss.c: N (In xmit, can netif_running be false? Can netif_queue_stopped?)
skge.c: N
qlge: N, N, Y, N (buggy, OOM, does not stop q) 
chelsio: N
s2io.c: Y, Y?
macmace.c: N
3c505.c: Y
defxx.c: Y
myri10ge: N
sbni.c: Y
wanxl.c: N
cycx_x25.c: N, N, Y?
dlci.c: Y
qla3xxx.c: N, N (buggy, OOM, does not stop q), Y, N, 
tlan.c: Y
skfp.c: N
cs89x0.c: N
smc9194.c: N
fec_mpc52xx.c: N
mv643xx_eth.c: N (buggy, OOM, does not stop q)
ll_temac_main.c: Y, Y
netxen: Y
tsi108_eth.c: N, N
ni65.c: N
sunhme.c: N
atl1c.c: Y
ps3_gelic_net.c: Y
igbvf: N
csgb3.c: N
ks8695net.c: N, N (buggy, neither stops q, latter OOM)
ether3.c: N
at91_ether.c: N
bnx2x_main.c: N, N
dm9000.c: N
jme.c: N
3c537.c: Y (plus, leak on skb_padto fail)
arcnet.c: N?
3c59x.c: N
au1000_eth.c: Y
ixgb: N
de600.c: N, N, N
myri_sbus.c: Y
bnx2.c: N
atl1e: Y
sonic.c: who cares, that won't even compile... (missing semicolon)
sun3_82586.c: N
3c515.c: N
ibm_newemac.c: Y
donaubae.c:Y?, Y?, Y?, Y (but never stops q)
sir_dev.c: Y
au1k_ir.c: Y, Y
cpmac.c: N (no stop q, and leak on skb_padto fail), Y
davinci_emac.c: N (no stop q), Y
de2104x.c: N
uli526x.c: N
dmfe.c: N
xircom_cb.c: N
iwmc3200wifi: Y
orinoco: N, N, N, N (no stop q)
atmel.c: Y
p54common.c: N, Y?
arlan-main.c: Y?
libipw_tx.c: Y (no stop q), N (alloc failure)
hostap_80211_tx.c: Y
strip.c: N
wavelan.c: N, N, N
at76c50x-usb.c: N
libertas/tx.c: Y
ray_cs.c: N, N
airo.c: Y, Y, Y
plip.c: N, N, N (starts q, BUSY on too big pkt?)
ns83820.c: N, N
ehea: Y, Y (no stop q)
rionet.c: N
enic: N
sis900.c: N
starfire.c: Y
r6040.c: N
sun3lance.c: N, N
sfc: Y, N, Y
mac89x0.c: N
sb1250-mac.c: Y
pasemi_mac.c: Y
8139cp.c: N
e1000e: N
r8169.c: N?
sis190.c: N
e100.c: N
tg3.c: N, Y?, N
fec.c: N (no stop q), N
hamachi.c: N
forcedeth.c: Y, Y
vxge: Y?, Y?
ks8842.c: Y
spider_net.c: Y
igb: N
ewrk3.c: N
gianfar.c: Y
sunvnet.c: N
mlx4: Y
atlx: Y, Y
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.

2009-06-18 Thread Herbert Xu
On Thu, Jun 18, 2009 at 04:47:50PM +0930, Rusty Russell wrote:
> 
> One question: can netif_queue_stopped() ever be true inside start_xmit?  Some 
> drivers test that, like sun3lance.c.

The driver should never test that because even if it is true
due to driver shutdown the xmit function should ignore it as
the upper layer will wait for it anyway.

> Summary: we still have about 54 in-tree drivers which actually use 
> NETDEV_TX_BUSY for normal paths.  Can I fix it now?

You can fix it but I don't quite understand your results below :)

> sungem.c: Y, N

This driver does the bug check in addition to a race check that
should simply drop the packet instead of queueing.  In fact chances
are the race check is unnecessary anyway.

> fs_enet: N

This is either just a bug check or the driver is broken in that
it should stop the queue when the said condition can be true.

> mace.c: N

Just a bug check.

> sh_eth.c: Y

This driver should check the queue after transmitting, just like
virtio-net :)

So from a totally non-representative sample of 4, my conclusion
is that none of them need TX_BUSY.  Do you have an example that
really needs it?

Anyway, I don't think we should reshape our APIs based on how
broken the existing users are.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv5 02/13] qemu: capability bits in pci save/restore

2009-06-18 Thread Michael S. Tsirkin
Add support for capability bits in save/restore for pci.
These will be used for MSI, where the capability might
be present or not as requested by user, which does not
map well into a single version number.

Signed-off-by: Michael S. Tsirkin 
---
 hw/pci.c |   17 ++---
 hw/pci.h |4 
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index 840986f..d42fcd9 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -126,9 +126,13 @@ int pci_bus_num(PCIBus *s)
 
 void pci_device_save(PCIDevice *s, QEMUFile *f)
 {
+int version = s->cap_present ? 3 : 2;
 int i;
 
-qemu_put_be32(f, 2); /* PCI device version */
+/* PCI device version and capabilities */
+qemu_put_be32(f, version);
+if (version >= 3)
+qemu_put_be32(f, s->cap_present);
 qemu_put_buffer(f, s->config, 256);
 for (i = 0; i < 4; i++)
 qemu_put_be32(f, s->irq_state[i]);
@@ -140,15 +144,22 @@ int pci_device_load(PCIDevice *s, QEMUFile *f)
 int i;
 
 version_id = qemu_get_be32(f);
-if (version_id > 2)
+if (version_id > 3)
 return -EINVAL;
+if (version_id >= 3)
+s->cap_present = qemu_get_be32(f);
+else
+s->cap_present = 0;
+
+if (s->cap_present & ~s->cap_supported)
+return -EINVAL;
+
 qemu_get_buffer(f, s->config, 256);
 pci_update_mappings(s);
 
 if (version_id >= 2)
 for (i = 0; i < 4; i ++)
 s->irq_state[i] = qemu_get_be32(f);
-
 return 0;
 }
 
diff --git a/hw/pci.h b/hw/pci.h
index 44aa61b..1226846 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -174,6 +174,10 @@ struct PCIDevice {
 
 /* Current IRQ levels.  Used internally by the generic PCI code.  */
 int irq_state[4];
+
+/* Capability bits for save/load */
+uint32_t cap_supported;
+uint32_t cap_present;
 };
 
 PCIDevice *pci_register_device(PCIBus *bus, const char *name,
-- 
1.6.2.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv5 03/13] qemu: add routines to manage PCI capabilities

2009-06-18 Thread Michael S. Tsirkin
Add routines to manage PCI capability list. First user will be MSI-X.

Signed-off-by: Michael S. Tsirkin 
---
 hw/pci.c |   79 ++
 hw/pci.h |   18 +-
 2 files changed, 96 insertions(+), 1 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index d42fcd9..e3f80d0 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -160,6 +160,12 @@ int pci_device_load(PCIDevice *s, QEMUFile *f)
 if (version_id >= 2)
 for (i = 0; i < 4; i ++)
 s->irq_state[i] = qemu_get_be32(f);
+/* Clear wmask and used bits for capabilities.
+   Must be restored separately, since capabilities can
+   be placed anywhere in config space. */
+memset(s->used, 0, PCI_CONFIG_SPACE_SIZE);
+for (i = PCI_CONFIG_HEADER_SIZE; i < PCI_CONFIG_SPACE_SIZE; ++i)
+s->wmask[i] = 0xff;
 return 0;
 }
 
@@ -867,3 +873,76 @@ PCIDevice *pci_create_simple(PCIBus *bus, int devfn, const 
char *name)
 
 return (PCIDevice *)dev;
 }
+
+static int pci_find_space(PCIDevice *pdev, uint8_t size)
+{
+int offset = PCI_CONFIG_HEADER_SIZE;
+int i;
+for (i = PCI_CONFIG_HEADER_SIZE; i < PCI_CONFIG_SPACE_SIZE; ++i)
+if (pdev->used[i])
+offset = i + 1;
+else if (i - offset + 1 == size)
+return offset;
+return 0;
+}
+
+static uint8_t pci_find_capability_list(PCIDevice *pdev, uint8_t cap_id,
+uint8_t *prev_p)
+{
+uint8_t next, prev;
+
+if (!(pdev->config[PCI_STATUS] & PCI_STATUS_CAP_LIST))
+return 0;
+
+for (prev = PCI_CAPABILITY_LIST; (next = pdev->config[prev]);
+ prev = next + PCI_CAP_LIST_NEXT)
+if (pdev->config[next + PCI_CAP_LIST_ID] == cap_id)
+break;
+
+if (prev_p)
+*prev_p = prev;
+return next;
+}
+
+/* Reserve space and add capability to the linked list in pci config space */
+int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size)
+{
+uint8_t offset = pci_find_space(pdev, size);
+uint8_t *config = pdev->config + offset;
+if (!offset)
+return -ENOSPC;
+config[PCI_CAP_LIST_ID] = cap_id;
+config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
+pdev->config[PCI_CAPABILITY_LIST] = offset;
+pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
+memset(pdev->used + offset, 0xFF, size);
+/* Make capability read-only by default */
+memset(pdev->wmask + offset, 0, size);
+return offset;
+}
+
+/* Unlink capability from the pci config space. */
+void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size)
+{
+uint8_t prev, offset = pci_find_capability_list(pdev, cap_id, &prev);
+if (!offset)
+return;
+pdev->config[prev] = pdev->config[offset + PCI_CAP_LIST_NEXT];
+/* Make capability writeable again */
+memset(pdev->wmask + offset, 0xff, size);
+memset(pdev->used + offset, 0, size);
+
+if (!pdev->config[PCI_CAPABILITY_LIST])
+pdev->config[PCI_STATUS] &= ~PCI_STATUS_CAP_LIST;
+}
+
+/* Reserve space for capability at a known offset (to call after load). */
+void pci_reserve_capability(PCIDevice *pdev, uint8_t offset, uint8_t size)
+{
+memset(pdev->used + offset, 0xff, size);
+}
+
+uint8_t pci_find_capability(PCIDevice *pdev, uint8_t cap_id)
+{
+return pci_find_capability_list(pdev, cap_id, NULL);
+}
diff --git a/hw/pci.h b/hw/pci.h
index 1226846..4f90fdc 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -121,6 +121,10 @@ typedef struct PCIIORegion {
 #define PCI_MIN_GNT0x3e/* 8 bits */
 #define PCI_MAX_LAT0x3f/* 8 bits */
 
+/* Capability lists */
+#define PCI_CAP_LIST_ID0   /* Capability ID */
+#define PCI_CAP_LIST_NEXT  1   /* Next capability in the list */
+
 #define PCI_REVISION0x08/* obsolete, use PCI_REVISION_ID */
 #define PCI_SUBVENDOR_ID0x2c/* obsolete, use 
PCI_SUBSYSTEM_VENDOR_ID */
 #define PCI_SUBDEVICE_ID0x2e/* obsolete, use PCI_SUBSYSTEM_ID */
@@ -128,7 +132,7 @@ typedef struct PCIIORegion {
 /* Bits in the PCI Status Register (PCI 2.3 spec) */
 #define PCI_STATUS_RESERVED1   0x007
 #define PCI_STATUS_INT_STATUS  0x008
-#define PCI_STATUS_CAPABILITIES0x010
+#define PCI_STATUS_CAP_LIST0x010
 #define PCI_STATUS_66MHZ   0x020
 #define PCI_STATUS_RESERVED2   0x040
 #define PCI_STATUS_FAST_BACK   0x080
@@ -158,6 +162,9 @@ struct PCIDevice {
 /* Used to implement R/W bytes */
 uint8_t wmask[PCI_CONFIG_SPACE_SIZE];
 
+/* Used to allocate config space for capabilities. */
+uint8_t used[PCI_CONFIG_SPACE_SIZE];
+
 /* the following fields are read only */
 PCIBus *bus;
 int devfn;
@@ -190,6 +197,15 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
 uint32_t size, int type,
 PCIMapIORegionFunc *map_func);
 
+int pci_add_capability(PCIDevice *pci_dev, uint8_t cap_id, uint8_t cap_size);
+
+

[PATCHv5 01/13] qemu: make default_write_config use mask table

2009-06-18 Thread Michael S. Tsirkin
Change much of hw/pci to use symbolic constants and a table-driven
design: add a mask table with writable bits set and readonly bits unset.
Detect change by comparing original and new registers.

This makes it easy to support capabilities where read-only/writeable
bit layout differs between devices, depending on capabilities present.

As a result, writing a single byte in BAR registers now works as
it should. Writing to upper limit registers in the bridge
also works as it should. Code is also shorter.

Signed-off-by: Michael S. Tsirkin 
---
 hw/pci.c |  145 -
 hw/pci.h |   18 +++-
 2 files changed, 46 insertions(+), 117 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index 0a738db..840986f 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -238,6 +238,17 @@ int pci_assign_devaddr(const char *addr, int *domp, int 
*busp, unsigned *slotp)
 return pci_parse_devaddr(devaddr, domp, busp, slotp);
 }
 
+static void pci_init_mask(PCIDevice *dev)
+{
+int i;
+dev->wmask[PCI_CACHE_LINE_SIZE] = 0xff;
+dev->wmask[PCI_INTERRUPT_LINE] = 0xff;
+dev->wmask[PCI_COMMAND] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY
+  | PCI_COMMAND_MASTER;
+for (i = PCI_CONFIG_HEADER_SIZE; i < PCI_CONFIG_SPACE_SIZE; ++i)
+dev->wmask[i] = 0xff;
+}
+
 /* -1 for devfn means auto assign */
 static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus,
  const char *name, int devfn,
@@ -257,6 +268,7 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev, PCIBus *bus,
 pstrcpy(pci_dev->name, sizeof(pci_dev->name), name);
 memset(pci_dev->irq_state, 0, sizeof(pci_dev->irq_state));
 pci_set_default_subsystem_id(pci_dev);
+pci_init_mask(pci_dev);
 
 if (!config_read)
 config_read = pci_default_read_config;
@@ -328,6 +340,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
 {
 PCIIORegion *r;
 uint32_t addr;
+uint32_t wmask;
 
 if ((unsigned int)region_num >= PCI_NUM_REGIONS)
 return;
@@ -343,12 +356,17 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
 r->size = size;
 r->type = type;
 r->map_func = map_func;
+
+wmask = ~(size - 1);
 if (region_num == PCI_ROM_SLOT) {
 addr = 0x30;
+/* ROM enable bit is writeable */
+wmask |= 1;
 } else {
 addr = 0x10 + region_num * 4;
 }
 *(uint32_t *)(pci_dev->config + addr) = cpu_to_le32(type);
+*(uint32_t *)(pci_dev->wmask + addr) = cpu_to_le32(wmask);
 }
 
 static void pci_update_mappings(PCIDevice *d)
@@ -457,118 +475,21 @@ uint32_t pci_default_read_config(PCIDevice *d,
 return val;
 }
 
-void pci_default_write_config(PCIDevice *d,
-  uint32_t address, uint32_t val, int len)
+void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l)
 {
-int can_write, i;
-uint32_t end, addr;
-
-if (len == 4 && ((address >= 0x10 && address < 0x10 + 4 * 6) ||
- (address >= 0x30 && address < 0x34))) {
-PCIIORegion *r;
-int reg;
+uint8_t orig[PCI_CONFIG_SPACE_SIZE];
+int i;
 
-if ( address >= 0x30 ) {
-reg = PCI_ROM_SLOT;
-}else{
-reg = (address - 0x10) >> 2;
-}
-r = &d->io_regions[reg];
-if (r->size == 0)
-goto default_config;
-/* compute the stored value */
-if (reg == PCI_ROM_SLOT) {
-/* keep ROM enable bit */
-val &= (~(r->size - 1)) | 1;
-} else {
-val &= ~(r->size - 1);
-val |= r->type;
-}
-*(uint32_t *)(d->config + address) = cpu_to_le32(val);
-pci_update_mappings(d);
-return;
-}
- default_config:
 /* not efficient, but simple */
-addr = address;
-for(i = 0; i < len; i++) {
-/* default read/write accesses */
-switch(d->config[0x0e]) {
-case 0x00:
-case 0x80:
-switch(addr) {
-case 0x00:
-case 0x01:
-case 0x02:
-case 0x03:
-case 0x06:
-case 0x07:
-case 0x08:
-case 0x09:
-case 0x0a:
-case 0x0b:
-case 0x0e:
-case 0x10 ... 0x27: /* base */
-case 0x2c ... 0x2f: /* read-only subsystem ID & vendor ID */
-case 0x30 ... 0x33: /* rom */
-case 0x3d:
-can_write = 0;
-break;
-default:
-can_write = 1;
-break;
-}
-break;
-default:
-case 0x01:
-switch(addr) {
-case 0x00:
-case 0x01:
-case 0x02:
-case 0x03:
-case 0x06:
-case 0x07:
-case 0x08:
-case 0x09:
-case 0x0a:
-case 0x0b:

[PREFIXv5 00/13] qemu: MSI-X support

2009-06-18 Thread Michael S. Tsirkin
Here is the port of MSI-X support patches to upstream qemu.
Please comment or commit.

This patchset adds generic support for MSI-X, adds implementation in
APIC, and uses MSI-X in virtio-net.

Changelog:
- since v4
  rebased to latest bits. rename resize_region -> resize_bar
- since v3
  call to resize_region on load
  split patches a bit differently to address style comments by Glauber
  update commit message to clarify what msix_support flag does
- since v2
  rename mask -> wmask to avoid conflict with work by Yamahata
- since v1
  At Paul's suggestion, use stl_phy to decouple APIC and MSI-X implementation

This uses the mask table patch that I posted previously, and which is
included in the series.

-- 
MST

Michael S. Tsirkin (13):
  qemu: make default_write_config use mask table
  qemu: capability bits in pci save/restore
  qemu: add routines to manage PCI capabilities
  qemu: helper routines for pci access
  qemu: MSI-X support functions
  qemu: add flag to disable MSI-X by default
  qemu: minimal MSI/MSI-X implementation for PC
  qemu: add support for resizing regions
  qemu: virtio support for many interrupt vectors
  qemu: MSI-X support in virtio PCI
  qemu: request 3 vectors in virtio-net
  qemu: virtio save/load bindings
  qemu: add pci_get/set_byte

 Makefile.target|2 +-
 hw/apic.c  |   43 +-
 hw/msix.c  |  420 
 hw/msix.h  |   35 +
 hw/pci.c   |  294 +++-
 hw/pci.h   |  104 -
 hw/syborg_virtio.c |   13 ++-
 hw/virtio-net.c|1 +
 hw/virtio-pci.c|  216 ++-
 hw/virtio.c|   70 ++---
 hw/virtio.h|   14 ++-
 qemu-options.hx|2 +
 rules.mak  |2 +-
 vl.c   |3 +
 14 files changed, 1004 insertions(+), 215 deletions(-)
 create mode 100644 hw/msix.c
 create mode 100644 hw/msix.h
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv5 05/13] qemu: MSI-X support functions

2009-06-18 Thread Michael S. Tsirkin
Add functions implementing MSI-X support. First user will be virtio-pci.
Note that platform must set a flag to declare MSI supported: this
is a safety measure to avoid breaking platforms which should support
MSI-X but currently lack this in the interrupt controller emulation.
For PC this will be set by APIC.

Signed-off-by: Michael S. Tsirkin 
---
 Makefile.target |2 +-
 hw/msix.c   |  417 +++
 hw/msix.h   |   35 +
 hw/pci.h|   20 +++
 4 files changed, 473 insertions(+), 1 deletions(-)
 create mode 100644 hw/msix.c
 create mode 100644 hw/msix.h

diff --git a/Makefile.target b/Makefile.target
index 0159bf7..145b3a9 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -494,7 +494,7 @@ endif #CONFIG_BSD_USER
 ifndef CONFIG_USER_ONLY
 
 OBJS=vl.o osdep.o monitor.o pci.o loader.o isa_mmio.o machine.o \
- gdbstub.o gdbstub-xml.o
+ gdbstub.o gdbstub-xml.o msix.o
 # virtio has to be here due to weird dependency between PCI and virtio-net.
 # need to fix this properly
 OBJS+=virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o
diff --git a/hw/msix.c b/hw/msix.c
new file mode 100644
index 000..a448eab
--- /dev/null
+++ b/hw/msix.c
@@ -0,0 +1,417 @@
+/*
+ * MSI-X device support
+ *
+ * This module includes support for MSI-X in pci devices.
+ *
+ * Author: Michael S. Tsirkin 
+ *
+ *  Copyright (c) 2009, Red Hat Inc, Michael S. Tsirkin (m...@redhat.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "hw.h"
+#include "msix.h"
+#include "pci.h"
+
+/* Declaration from linux/pci_regs.h */
+#define  PCI_CAP_ID_MSIX 0x11 /* MSI-X */
+#define  PCI_MSIX_FLAGS 2 /* Table at lower 11 bits */
+#define  PCI_MSIX_FLAGS_QSIZE  0x7FF
+#define  PCI_MSIX_FLAGS_ENABLE (1 << 15)
+#define  PCI_MSIX_FLAGS_BIRMASK(7 << 0)
+
+/* MSI-X capability structure */
+#define MSIX_TABLE_OFFSET 4
+#define MSIX_PBA_OFFSET 8
+#define MSIX_CAP_LENGTH 12
+
+/* MSI enable bit is in byte 1 in FLAGS register */
+#define MSIX_ENABLE_OFFSET (PCI_MSIX_FLAGS + 1)
+#define MSIX_ENABLE_MASK (PCI_MSIX_FLAGS_ENABLE >> 8)
+
+/* MSI-X table format */
+#define MSIX_MSG_ADDR 0
+#define MSIX_MSG_UPPER_ADDR 4
+#define MSIX_MSG_DATA 8
+#define MSIX_VECTOR_CTRL 12
+#define MSIX_ENTRY_SIZE 16
+#define MSIX_VECTOR_MASK 0x1
+
+/* How much space does an MSIX table need. */
+/* The spec requires giving the table structure
+ * a 4K aligned region all by itself. Align it to
+ * target pages so that drivers can do passthrough
+ * on the rest of the region. */
+#define MSIX_PAGE_SIZE TARGET_PAGE_ALIGN(0x1000)
+/* Reserve second half of the page for pending bits */
+#define MSIX_PAGE_PENDING (MSIX_PAGE_SIZE / 2)
+#define MSIX_MAX_ENTRIES 32
+
+
+#ifdef MSIX_DEBUG
+#define DEBUG(fmt, ...)   \
+do {  \
+  fprintf(stderr, "%s: " fmt, __func__ , __VA_ARGS__);\
+} while (0)
+#else
+#define DEBUG(fmt, ...) do { } while(0)
+#endif
+
+/* Flag to globally disable MSI-X support */
+int msix_disable;
+
+/* Flag for interrupt controller to declare MSI-X support */
+int msix_supported;
+
+/* Add MSI-X capability to the config space for the device. */
+/* Given a bar and its size, add MSI-X table on top of it
+ * and fill MSI-X capability in the config space.
+ * Original bar size must be a power of 2 or 0.
+ * New bar size is returned. */
+static int msix_add_config(struct PCIDevice *pdev, unsigned short nentries,
+   unsigned bar_nr, unsigned bar_size)
+{
+int config_offset;
+uint8_t *config;
+uint32_t new_size;
+
+if (nentries < 1 || nentries > PCI_MSIX_FLAGS_QSIZE + 1)
+return -EINVAL;
+if (bar_size > 0x8000)
+return -ENOSPC;
+
+/* Add space for MSI-X structures */
+if (!bar_size)
+new_size = MSIX_PAGE_SIZE;
+else if (bar_size < MSIX_PAGE_SIZE) {
+bar_size = MSIX_PAGE_SIZE;
+new_size = MSIX_PAGE_SIZE * 2;
+} else
+new_size = bar_size * 2;
+
+pdev->msix_bar_size = new_size;
+config_offset = pci_add_capability(pdev, PCI_CAP_ID_MSIX, MSIX_CAP_LENGTH);
+if (config_offset < 0)
+return config_offset;
+config = pdev->config + config_offset;
+
+pci_set_word(config + PCI_MSIX_FLAGS, nentries - 1);
+/* Table on top of BAR */
+pci_set_long(config + MSIX_TABLE_OFFSET, bar_size | bar_nr);
+/* Pending bits on top of that */
+pci_set_long(config + MSIX_PBA_OFFSET, (bar_size + MSIX_PAGE_PENDING) |
+ bar_nr);
+pdev->msix_cap = config_offset;
+/* Make flags bit writeable. */
+pdev->wmask[config_offset + MSIX_ENABLE_OFFSET] |= MSIX_ENABLE_MASK;
+return 0;
+}
+
+static void msix_free_irq_entries(PCIDevice *dev)
+{
+int vector;
+
+for (vector = 0; vector < dev->msix_entries_nr; ++vector)
+dev->msix_entry_used[

[PATCHv5 06/13] qemu: add flag to disable MSI-X by default

2009-06-18 Thread Michael S. Tsirkin
Add global flag to disable MSI-X by default.  This is useful primarily
to make images loadable by older qemu (without msix).  Even when MSI-X
is disabled by flag, you can still load images that have MSI-X enabled.

Signed-off-by: Michael S. Tsirkin 
---
 hw/msix.c   |5 -
 qemu-options.hx |2 ++
 vl.c|3 +++
 3 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/hw/msix.c b/hw/msix.c
index a448eab..cc62a83 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -122,6 +122,9 @@ void msix_write_config(PCIDevice *dev, uint32_t addr,
uint32_t val, int len)
 {
 unsigned enable_pos = dev->msix_cap + MSIX_ENABLE_OFFSET;
+if (!(dev->cap_present & QEMU_PCI_CAP_MSIX))
+return;
+
 if (addr + len <= enable_pos || addr > enable_pos)
 return;
 
@@ -242,7 +245,7 @@ int msix_init(struct PCIDevice *dev, unsigned short 
nentries,
 
 dev->msix_table_page = qemu_mallocz(MSIX_PAGE_SIZE);
 
-dev->msix_mmio_index = cpu_register_io_memory(0, msix_mmio_read,
+dev->msix_mmio_index = cpu_register_io_memory(msix_mmio_read,
   msix_mmio_write, dev);
 if (dev->msix_mmio_index == -1) {
 ret = -EBUSY;
diff --git a/qemu-options.hx b/qemu-options.hx
index 9d5e05a..aa02615 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -1584,3 +1584,5 @@ DEF("semihosting", 0, QEMU_OPTION_semihosting,
 DEF("old-param", 0, QEMU_OPTION_old_param,
 "-old-param  old param mode\n")
 #endif
+DEF("disable-msix", 0, QEMU_OPTION_disable_msix,
+"-disable-msix disable msix support for PCI devices (enabled by 
default)\n")
diff --git a/vl.c b/vl.c
index 3242c23..f7c67bb 100644
--- a/vl.c
+++ b/vl.c
@@ -135,6 +135,7 @@ int main(int argc, char **argv)
 #include "hw/usb.h"
 #include "hw/pcmcia.h"
 #include "hw/pc.h"
+#include "hw/msix.h"
 #include "hw/audiodev.h"
 #include "hw/isa.h"
 #include "hw/baum.h"
@@ -5699,6 +5700,8 @@ int main(int argc, char **argv, char **envp)
 xen_mode = XEN_ATTACH;
 break;
 #endif
+case QEMU_OPTION_disable_msix:
+msix_disable = 1;
 }
 }
 }
-- 
1.6.2.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv5 04/13] qemu: helper routines for pci access

2009-06-18 Thread Michael S. Tsirkin
Add inline routines for convenient access to pci devices
with correct (little) endianness. Will be used by MSI-X support.

Signed-off-by: Michael S. Tsirkin 
---
 hw/pci.h |   30 +++---
 1 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/hw/pci.h b/hw/pci.h
index 4f90fdc..b0a8acf 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -236,21 +236,45 @@ PCIBus *pci_bridge_init(PCIBus *bus, int devfn, uint16_t 
vid, uint16_t did,
 pci_map_irq_fn map_irq, const char *name);
 
 static inline void
+pci_set_word(uint8_t *config, uint16_t val)
+{
+cpu_to_le16wu((uint16_t *)config, val);
+}
+
+static inline uint16_t
+pci_get_word(uint8_t *config)
+{
+return le16_to_cpupu((uint16_t *)config);
+}
+
+static inline void
+pci_set_long(uint8_t *config, uint32_t val)
+{
+cpu_to_le32wu((uint32_t *)config, val);
+}
+
+static inline uint32_t
+pci_get_long(uint8_t *config)
+{
+return le32_to_cpupu((uint32_t *)config);
+}
+
+static inline void
 pci_config_set_vendor_id(uint8_t *pci_config, uint16_t val)
 {
-cpu_to_le16wu((uint16_t *)&pci_config[PCI_VENDOR_ID], val);
+pci_set_word(&pci_config[PCI_VENDOR_ID], val);
 }
 
 static inline void
 pci_config_set_device_id(uint8_t *pci_config, uint16_t val)
 {
-cpu_to_le16wu((uint16_t *)&pci_config[PCI_DEVICE_ID], val);
+pci_set_word(&pci_config[PCI_DEVICE_ID], val);
 }
 
 static inline void
 pci_config_set_class(uint8_t *pci_config, uint16_t val)
 {
-cpu_to_le16wu((uint16_t *)&pci_config[PCI_CLASS_DEVICE], val);
+pci_set_word(&pci_config[PCI_CLASS_DEVICE], val);
 }
 
 typedef void (*pci_qdev_initfn)(PCIDevice *dev);
-- 
1.6.2.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv5 07/13] qemu: minimal MSI/MSI-X implementation for PC

2009-06-18 Thread Michael S. Tsirkin
Implement MSI support in APIC. Note that MSI and MMIO APIC registers
are at the same memory location, but actually not on the global bus: MSI
is on PCI bus, APIC is connected directly to the CPU. We map them on the
global bus at the same address which happens to work because MSI
registers are reserved in APIC MMIO and vice versa.

Signed-off-by: Michael S. Tsirkin 
---
 hw/apic.c |   43 +++
 1 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index 3e04132..3bcab46 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -19,6 +19,8 @@
  */
 #include "hw.h"
 #include "pc.h"
+#include "pci.h"
+#include "msix.h"
 #include "qemu-timer.h"
 #include "host-utils.h"
 
@@ -63,6 +65,19 @@
 #define MAX_APICS 255
 #define MAX_APIC_WORDS 8
 
+/* Intel APIC constants: from include/asm/msidef.h */
+#define MSI_DATA_VECTOR_SHIFT  0
+#define MSI_DATA_VECTOR_MASK   0x00ff
+#define MSI_DATA_DELIVERY_MODE_SHIFT   8
+#define MSI_DATA_TRIGGER_SHIFT 15
+#define MSI_DATA_LEVEL_SHIFT   14
+#define MSI_ADDR_DEST_MODE_SHIFT   2
+#define MSI_ADDR_DEST_ID_SHIFT 12
+#defineMSI_ADDR_DEST_ID_MASK   0x000
+
+#define MSI_ADDR_BASE   0xfee0
+#define MSI_ADDR_SIZE   0x10
+
 typedef struct APICState {
 CPUState *cpu_env;
 uint32_t apicbase;
@@ -731,11 +746,31 @@ static uint32_t apic_mem_readl(void *opaque, 
target_phys_addr_t addr)
 return val;
 }
 
+static void apic_send_msi(target_phys_addr_t addr, uint32 data)
+{
+uint8_t dest = (addr & MSI_ADDR_DEST_ID_MASK) >> MSI_ADDR_DEST_ID_SHIFT;
+uint8_t vector = (data & MSI_DATA_VECTOR_MASK) >> MSI_DATA_VECTOR_SHIFT;
+uint8_t dest_mode = (addr >> MSI_ADDR_DEST_MODE_SHIFT) & 0x1;
+uint8_t trigger_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
+uint8_t delivery = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7;
+/* XXX: Ignore redirection hint. */
+apic_deliver_irq(dest, dest_mode, delivery, vector, 0, trigger_mode);
+}
+
 static void apic_mem_writel(void *opaque, target_phys_addr_t addr, uint32_t 
val)
 {
 CPUState *env;
 APICState *s;
-int index;
+int index = (addr >> 4) & 0xff;
+if (addr > 0xfff || !index) {
+/* MSI and MMIO APIC are at the same memory location,
+ * but actually not on the global bus: MSI is on PCI bus
+ * APIC is connected directly to the CPU.
+ * Mapping them on the global bus happens to work because
+ * MSI registers are reserved in APIC MMIO and vice versa. */
+apic_send_msi(addr, val);
+return;
+}
 
 env = cpu_single_env;
 if (!env)
@@ -746,7 +781,6 @@ static void apic_mem_writel(void *opaque, 
target_phys_addr_t addr, uint32_t val)
 printf("APIC write: %08x = %08x\n", (uint32_t)addr, val);
 #endif
 
-index = (addr >> 4) & 0xff;
 switch(index) {
 case 0x02:
 s->id = (val >> 24);
@@ -931,6 +965,7 @@ int apic_init(CPUState *env)
 s->cpu_env = env;
 
 apic_reset(s);
+msix_supported = 1;
 
 /* XXX: mapping more APICs at the same memory location */
 if (apic_io_memory == 0) {
@@ -938,7 +973,8 @@ int apic_init(CPUState *env)
on the global memory bus. */
 apic_io_memory = cpu_register_io_memory(apic_mem_read,
 apic_mem_write, NULL);
-cpu_register_physical_memory(s->apicbase & ~0xfff, 0x1000,
+/* XXX: what if the base changes? */
+cpu_register_physical_memory(MSI_ADDR_BASE, MSI_ADDR_SIZE,
  apic_io_memory);
 }
 s->timer = qemu_new_timer(vm_clock, apic_timer, s);
@@ -949,4 +985,3 @@ int apic_init(CPUState *env)
 local_apics[s->idx] = s;
 return 0;
 }
-
-- 
1.6.2.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv5 12/13] qemu: virtio save/load bindings

2009-06-18 Thread Michael S. Tsirkin
Implement bindings for virtio save/load. Use them in virtio pci.

Signed-off-by: Michael S. Tsirkin 
---
 hw/virtio-pci.c |   50 +-
 hw/virtio.c |   33 -
 hw/virtio.h |4 
 3 files changed, 69 insertions(+), 18 deletions(-)

diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index d3f4884..1057ae1 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -105,6 +105,50 @@ static void virtio_pci_notify(void *opaque, uint16_t 
vector)
 qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1);
 }
 
+static void virtio_pci_save_config(void * opaque, QEMUFile *f)
+{
+VirtIOPCIProxy *proxy = opaque;
+pci_device_save(&proxy->pci_dev, f);
+msix_save(&proxy->pci_dev, f);
+if (msix_present(&proxy->pci_dev))
+qemu_put_be16(f, proxy->vdev->config_vector);
+}
+
+static void virtio_pci_save_queue(void * opaque, int n, QEMUFile *f)
+{
+VirtIOPCIProxy *proxy = opaque;
+if (msix_present(&proxy->pci_dev))
+qemu_put_be16(f, virtio_queue_vector(proxy->vdev, n));
+}
+
+static int virtio_pci_load_config(void * opaque, QEMUFile *f)
+{
+VirtIOPCIProxy *proxy = opaque;
+int ret;
+ret = pci_device_load(&proxy->pci_dev, f);
+if (ret)
+return ret;
+ret = msix_load(&proxy->pci_dev, f);
+if (ret)
+return ret;
+if (msix_present(&proxy->pci_dev))
+qemu_get_be16s(f, &proxy->vdev->config_vector);
+
+pci_resize_bar(&proxy->pci_dev, 1, msix_bar_size(&proxy->pci_dev));
+return 0;
+}
+
+static int virtio_pci_load_queue(void * opaque, int n, QEMUFile *f)
+{
+VirtIOPCIProxy *proxy = opaque;
+uint16_t vector;
+if (!msix_present(&proxy->pci_dev))
+return 0;
+qemu_get_be16s(f, &vector);
+virtio_queue_set_vector(proxy->vdev, n, vector);
+return 0;
+}
+
 static void virtio_pci_reset(void *opaque)
 {
 VirtIOPCIProxy *proxy = opaque;
@@ -317,7 +361,11 @@ static void virtio_write_config(PCIDevice *pci_dev, 
uint32_t address,
 }
 
 static const VirtIOBindings virtio_pci_bindings = {
-.notify = virtio_pci_notify
+.notify = virtio_pci_notify,
+.save_config = virtio_pci_save_config,
+.load_config = virtio_pci_load_config,
+.save_queue = virtio_pci_save_queue,
+.load_queue = virtio_pci_load_queue,
 };
 
 static void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev,
diff --git a/hw/virtio.c b/hw/virtio.c
index fe9f793..b773dff 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -568,9 +568,8 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
 {
 int i;
 
-/* FIXME: load/save binding.  */
-//pci_device_save(&vdev->pci_dev, f);
-//msix_save(&vdev->pci_dev, f);
+if (vdev->binding->save_config)
+vdev->binding->save_config(vdev->binding_opaque, f);
 
 qemu_put_8s(f, &vdev->status);
 qemu_put_8s(f, &vdev->isr);
@@ -596,18 +595,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
 qemu_put_be32(f, vdev->vq[i].vring.num);
 qemu_put_be64(f, vdev->vq[i].pa);
 qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
-if (vdev->nvectors)
-qemu_put_be16s(f, &vdev->vq[i].vector);
+if (vdev->binding->save_queue)
+vdev->binding->save_queue(vdev->binding_opaque, i, f);
 }
 }
 
-void virtio_load(VirtIODevice *vdev, QEMUFile *f)
+int virtio_load(VirtIODevice *vdev, QEMUFile *f)
 {
-int num, i;
+int num, i, ret;
 
-/* FIXME: load/save binding.  */
-//pci_device_load(&vdev->pci_dev, f);
-//r = msix_load(&vdev->pci_dev, f);
+if (vdev->binding->load_config) {
+ret = vdev->binding->load_config(vdev->binding_opaque, f);
+if (ret)
+return ret;
+}
 
 qemu_get_8s(f, &vdev->status);
 qemu_get_8s(f, &vdev->isr);
@@ -616,10 +617,6 @@ void virtio_load(VirtIODevice *vdev, QEMUFile *f)
 vdev->config_len = qemu_get_be32(f);
 qemu_get_buffer(f, vdev->config, vdev->config_len);
 
-if (vdev->nvectors) {
-qemu_get_be16s(f, &vdev->config_vector);
-//msix_vector_use(&vdev->pci_dev, vdev->config_vector);
-}
 num = qemu_get_be32(f);
 
 for (i = 0; i < num; i++) {
@@ -630,13 +627,15 @@ void virtio_load(VirtIODevice *vdev, QEMUFile *f)
 if (vdev->vq[i].pa) {
 virtqueue_init(&vdev->vq[i]);
 }
-if (vdev->nvectors) {
-qemu_get_be16s(f, &vdev->vq[i].vector);
-//msix_vector_use(&vdev->pci_dev, vdev->config_vector);
+if (vdev->binding->load_queue) {
+ret = vdev->binding->load_queue(vdev->binding_opaque, i, f);
+if (ret)
+return ret;
 }
 }
 
 virtio_notify_vector(vdev, VIRTIO_NO_VECTOR);
+return 0;
 }
 
 void virtio_cleanup(VirtIODevice *vdev)
diff --git a/hw/virtio.h b/hw/virtio.h
index 04a3c3d..ce05517 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -72,6 +72,10 @@ typedef struct VirtQueueElement
 
 typedef struct {
 void (*no

[PATCHv5 08/13] qemu: add support for resizing regions

2009-06-18 Thread Michael S. Tsirkin
Make it possible to resize PCI regions.  This will be used by virtio
with MSI-X, where the region size depends on whether MSI-X is enabled,
and can change across load/save.

Signed-off-by: Michael S. Tsirkin 
---
 hw/pci.c |   53 +++--
 hw/pci.h |2 ++
 2 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index e3f80d0..aa88a0b 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -386,6 +386,40 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
 *(uint32_t *)(pci_dev->wmask + addr) = cpu_to_le32(wmask);
 }
 
+static void pci_unmap_region(PCIDevice *d, PCIIORegion *r)
+{
+if (r->addr == -1)
+return;
+if (r->type & PCI_ADDRESS_SPACE_IO) {
+int class;
+/* NOTE: specific hack for IDE in PC case:
+   only one byte must be mapped. */
+class = pci_get_word(d->config + PCI_CLASS_DEVICE);
+if (class == 0x0101 && r->size == 4) {
+isa_unassign_ioport(r->addr + 2, 1);
+} else {
+isa_unassign_ioport(r->addr, r->size);
+}
+} else {
+cpu_register_physical_memory(pci_to_cpu_addr(r->addr),
+ r->size,
+ IO_MEM_UNASSIGNED);
+qemu_unregister_coalesced_mmio(r->addr, r->size);
+}
+}
+
+void pci_resize_bar(PCIDevice *pci_dev, int region_num, uint32_t size)
+{
+
+PCIIORegion *r = &pci_dev->io_regions[region_num];
+if (r->size == size)
+return;
+r->size = size;
+pci_unmap_region(pci_dev, r);
+r->addr = -1;
+pci_update_mappings(pci_dev);
+}
+
 static void pci_update_mappings(PCIDevice *d)
 {
 PCIIORegion *r;
@@ -439,24 +473,7 @@ static void pci_update_mappings(PCIDevice *d)
 }
 /* now do the real mapping */
 if (new_addr != r->addr) {
-if (r->addr != -1) {
-if (r->type & PCI_ADDRESS_SPACE_IO) {
-int class;
-/* NOTE: specific hack for IDE in PC case:
-   only one byte must be mapped. */
-class = d->config[0x0a] | (d->config[0x0b] << 8);
-if (class == 0x0101 && r->size == 4) {
-isa_unassign_ioport(r->addr + 2, 1);
-} else {
-isa_unassign_ioport(r->addr, r->size);
-}
-} else {
-cpu_register_physical_memory(pci_to_cpu_addr(r->addr),
- r->size,
- IO_MEM_UNASSIGNED);
-qemu_unregister_coalesced_mmio(r->addr, r->size);
-}
-}
+pci_unmap_region(d, r);
 r->addr = new_addr;
 if (r->addr != -1) {
 r->map_func(d, i, r->addr, r->size, r->type);
diff --git a/hw/pci.h b/hw/pci.h
index 25865d7..67e4b4d 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -217,6 +217,8 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
 uint32_t size, int type,
 PCIMapIORegionFunc *map_func);
 
+void pci_resize_bar(PCIDevice *pci_dev, int region_num, uint32_t size);
+
 int pci_add_capability(PCIDevice *pci_dev, uint8_t cap_id, uint8_t cap_size);
 
 void pci_del_capability(PCIDevice *pci_dev, uint8_t cap_id, uint8_t cap_size);
-- 
1.6.2.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv5 13/13] qemu: add pci_get/set_byte

2009-06-18 Thread Michael S. Tsirkin
Add pci_get/set_byte to keep *_word and *_long access functions company.
They are unused for now.

Signed-off-by: Michael S. Tsirkin 
---
 hw/pci.h |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/hw/pci.h b/hw/pci.h
index 67e4b4d..303179d 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -258,6 +258,18 @@ PCIBus *pci_bridge_init(PCIBus *bus, int devfn, uint16_t 
vid, uint16_t did,
 pci_map_irq_fn map_irq, const char *name);
 
 static inline void
+pci_set_byte(uint8_t *config, uint8_t val)
+{
+*config = val;
+}
+
+static inline uint8_t
+pci_get_byte(uint8_t *config)
+{
+return *config;
+}
+
+static inline void
 pci_set_word(uint8_t *config, uint16_t val)
 {
 cpu_to_le16wu((uint16_t *)config, val);
-- 
1.6.2.2
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv5 11/13] qemu: request 3 vectors in virtio-net

2009-06-18 Thread Michael S. Tsirkin
Request up to 3 vectors in virtio-net. Actual bindings might supply
less.

Signed-off-by: Michael S. Tsirkin 
---
 hw/virtio-net.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index d584287..c8dabbe 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -709,6 +709,7 @@ VirtIODevice *virtio_net_init(DeviceState *dev)
 n->mac_table.macs = qemu_mallocz(MAC_TABLE_ENTRIES * ETH_ALEN);
 
 n->vlans = qemu_mallocz(MAX_VLAN >> 3);
+n->vdev.nvectors = 3;
 
 register_savevm("virtio-net", virtio_net_id++, VIRTIO_NET_VM_VERSION,
 virtio_net_save, virtio_net_load, n);
-- 
1.6.2.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCHv5 09/13] qemu: virtio support for many interrupt vectors

2009-06-18 Thread Michael S. Tsirkin
Extend virtio to support many interrupt vectors, and rearrange code in
preparation for multi-vector support (mostly move reset out to bindings,
because we will have to reset the vectors in transport-specific code).
Actual bindings in pci, and use in net, to follow.
Load and save are not connected to bindings yet, so they are left
stubbed out for now.

Signed-off-by: Michael S. Tsirkin 
---
 hw/syborg_virtio.c |   13 --
 hw/virtio-pci.c|   24 
 hw/virtio.c|   59 ++-
 hw/virtio.h|   10 +++-
 4 files changed, 81 insertions(+), 25 deletions(-)

diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c
index 8e665c6..108af06 100644
--- a/hw/syborg_virtio.c
+++ b/hw/syborg_virtio.c
@@ -134,7 +134,10 @@ static void syborg_virtio_writel(void *opaque, 
target_phys_addr_t offset,
 vdev->features = value;
 break;
 case SYBORG_VIRTIO_QUEUE_BASE:
-virtio_queue_set_addr(vdev, vdev->queue_sel, value);
+if (value == 0)
+virtio_reset(vdev);
+else
+virtio_queue_set_addr(vdev, vdev->queue_sel, value);
 break;
 case SYBORG_VIRTIO_QUEUE_SEL:
 if (value < VIRTIO_PCI_QUEUE_MAX)
@@ -228,7 +231,7 @@ static CPUWriteMemoryFunc *syborg_virtio_writefn[] = {
  syborg_virtio_writel
 };
 
-static void syborg_virtio_update_irq(void *opaque)
+static void syborg_virtio_update_irq(void *opaque, uint16_t vector)
 {
 SyborgVirtIOProxy *proxy = opaque;
 int level;
@@ -239,7 +242,7 @@ static void syborg_virtio_update_irq(void *opaque)
 }
 
 static VirtIOBindings syborg_virtio_bindings = {
-.update_irq = syborg_virtio_update_irq
+.notify = syborg_virtio_update_irq
 };
 
 static void syborg_virtio_init(SyborgVirtIOProxy *proxy, VirtIODevice *vdev)
@@ -248,6 +251,8 @@ static void syborg_virtio_init(SyborgVirtIOProxy *proxy, 
VirtIODevice *vdev)
 
 proxy->vdev = vdev;
 
+/* Don't support multiple vectors */
+proxy->vdev->nvectors = 0;
 sysbus_init_irq(&proxy->busdev, &proxy->irq);
 iomemtype = cpu_register_io_memory(syborg_virtio_readfn,
syborg_virtio_writefn, proxy);
@@ -255,6 +260,8 @@ static void syborg_virtio_init(SyborgVirtIOProxy *proxy, 
VirtIODevice *vdev)
 
 proxy->id = ((uint32_t)0x1af4 << 16) | vdev->device_id;
 
+qemu_register_reset(virtio_reset, 0, vdev);
+
 virtio_bind_device(vdev, &syborg_virtio_bindings, proxy);
 }
 
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 24fe837..d4a134d 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -78,13 +78,19 @@ typedef struct {
 
 /* virtio device */
 
-static void virtio_pci_update_irq(void *opaque)
+static void virtio_pci_notify(void *opaque, uint16_t vector)
 {
 VirtIOPCIProxy *proxy = opaque;
 
 qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1);
 }
 
+static void virtio_pci_reset(void *opaque)
+{
+VirtIOPCIProxy *proxy = opaque;
+virtio_reset(proxy->vdev);
+}
+
 static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
 {
 VirtIOPCIProxy *proxy = opaque;
@@ -108,7 +114,10 @@ static void virtio_ioport_write(void *opaque, uint32_t 
addr, uint32_t val)
 break;
 case VIRTIO_PCI_QUEUE_PFN:
 pa = (target_phys_addr_t)val << VIRTIO_PCI_QUEUE_ADDR_SHIFT;
-virtio_queue_set_addr(vdev, vdev->queue_sel, pa);
+if (pa == 0)
+virtio_pci_reset(proxy);
+else
+virtio_queue_set_addr(vdev, vdev->queue_sel, pa);
 break;
 case VIRTIO_PCI_QUEUE_SEL:
 if (val < VIRTIO_PCI_QUEUE_MAX)
@@ -120,7 +129,7 @@ static void virtio_ioport_write(void *opaque, uint32_t 
addr, uint32_t val)
 case VIRTIO_PCI_STATUS:
 vdev->status = val & 0xFF;
 if (vdev->status == 0)
-virtio_reset(vdev);
+virtio_pci_reset(proxy);
 break;
 }
 }
@@ -158,7 +167,7 @@ static uint32_t virtio_ioport_read(void *opaque, uint32_t 
addr)
 /* reading from the ISR also clears it. */
 ret = vdev->isr;
 vdev->isr = 0;
-virtio_update_irq(vdev);
+qemu_set_irq(proxy->pci_dev.irq[0], 0);
 break;
 default:
 break;
@@ -243,7 +252,7 @@ static void virtio_map(PCIDevice *pci_dev, int region_num,
 }
 
 static const VirtIOBindings virtio_pci_bindings = {
-.update_irq = virtio_pci_update_irq
+.notify = virtio_pci_notify
 };
 
 static void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev,
@@ -255,6 +264,9 @@ static void virtio_init_pci(VirtIOPCIProxy *proxy, 
VirtIODevice *vdev,
 
 proxy->vdev = vdev;
 
+/* No support for multiple vectors yet. */
+proxy->vdev->nvectors = 0;
+
 config = proxy->pci_dev.config;
 pci_config_set_vendor_id(config, vendor);
 pci_config_set_device_id(config, device);
@@ -279,6 +291,8 @@ static void virtio_init_pci(VirtIOPCIProxy *proxy, 
VirtIODevice *vdev,
 pci_register_bar(&proxy->pc

[PATCHv5 10/13] qemu: MSI-X support in virtio PCI

2009-06-18 Thread Michael S. Tsirkin
This enables actual support for MSI-X in virtio PCI.
First user will be virtio-net.

Signed-off-by: Michael S. Tsirkin 
---
 hw/virtio-pci.c |  152 --
 rules.mak   |2 +-
 2 files changed, 113 insertions(+), 41 deletions(-)

diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index d4a134d..d3f4884 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -18,6 +18,7 @@
 #include "virtio.h"
 #include "pci.h"
 //#include "sysemu.h"
+#include "msix.h"
 
 /* from Linux's linux/virtio_pci.h */
 
@@ -47,7 +48,24 @@
  * a read-and-acknowledge. */
 #define VIRTIO_PCI_ISR  19
 
-#define VIRTIO_PCI_CONFIG   20
+/* MSI-X registers: only enabled if MSI-X is enabled. */
+/* A 16-bit vector for configuration changes. */
+#define VIRTIO_MSI_CONFIG_VECTOR20
+/* A 16-bit vector for selected queue notifications. */
+#define VIRTIO_MSI_QUEUE_VECTOR 22
+
+/* Config space size */
+#define VIRTIO_PCI_CONFIG_NOMSI 20
+#define VIRTIO_PCI_CONFIG_MSI   24
+#define VIRTIO_PCI_REGION_SIZE(dev) (msix_present(dev) ? \
+ VIRTIO_PCI_CONFIG_MSI : \
+ VIRTIO_PCI_CONFIG_NOMSI)
+
+/* The remaining space is defined by each driver as the per-driver
+ * configuration space */
+#define VIRTIO_PCI_CONFIG(dev)  (msix_enabled(dev) ? \
+ VIRTIO_PCI_CONFIG_MSI : \
+ VIRTIO_PCI_CONFIG_NOMSI)
 
 /* Virtio ABI version, if we increment this, we break the guest driver. */
 #define VIRTIO_PCI_ABI_VERSION  0
@@ -81,14 +99,17 @@ typedef struct {
 static void virtio_pci_notify(void *opaque, uint16_t vector)
 {
 VirtIOPCIProxy *proxy = opaque;
-
-qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1);
+if (msix_enabled(&proxy->pci_dev))
+msix_notify(&proxy->pci_dev, vector);
+else
+qemu_set_irq(proxy->pci_dev.irq[0], proxy->vdev->isr & 1);
 }
 
 static void virtio_pci_reset(void *opaque)
 {
 VirtIOPCIProxy *proxy = opaque;
 virtio_reset(proxy->vdev);
+msix_reset(&proxy->pci_dev);
 }
 
 static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
@@ -97,8 +118,6 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, 
uint32_t val)
 VirtIODevice *vdev = proxy->vdev;
 target_phys_addr_t pa;
 
-addr -= proxy->addr;
-
 switch (addr) {
 case VIRTIO_PCI_GUEST_FEATURES:
/* Guest does not negotiate properly?  We have to assume nothing. */
@@ -131,17 +150,33 @@ static void virtio_ioport_write(void *opaque, uint32_t 
addr, uint32_t val)
 if (vdev->status == 0)
 virtio_pci_reset(proxy);
 break;
+case VIRTIO_MSI_CONFIG_VECTOR:
+msix_vector_unuse(&proxy->pci_dev, vdev->config_vector);
+/* Make it possible for guest to discover an error took place. */
+if (msix_vector_use(&proxy->pci_dev, val) < 0)
+val = VIRTIO_NO_VECTOR;
+vdev->config_vector = val;
+break;
+case VIRTIO_MSI_QUEUE_VECTOR:
+msix_vector_unuse(&proxy->pci_dev,
+  virtio_queue_vector(vdev, vdev->queue_sel));
+/* Make it possible for guest to discover an error took place. */
+if (msix_vector_use(&proxy->pci_dev, val) < 0)
+val = VIRTIO_NO_VECTOR;
+virtio_queue_set_vector(vdev, vdev->queue_sel, val);
+break;
+default:
+fprintf(stderr, "%s: unexpected address 0x%x value 0x%x\n",
+__func__, addr, val);
+break;
 }
 }
 
-static uint32_t virtio_ioport_read(void *opaque, uint32_t addr)
+static uint32_t virtio_ioport_read(VirtIOPCIProxy *proxy, uint32_t addr)
 {
-VirtIOPCIProxy *proxy = opaque;
 VirtIODevice *vdev = proxy->vdev;
 uint32_t ret = 0x;
 
-addr -= proxy->addr;
-
 switch (addr) {
 case VIRTIO_PCI_HOST_FEATURES:
 ret = vdev->get_features(vdev);
@@ -169,6 +204,12 @@ static uint32_t virtio_ioport_read(void *opaque, uint32_t 
addr)
 vdev->isr = 0;
 qemu_set_irq(proxy->pci_dev.irq[0], 0);
 break;
+case VIRTIO_MSI_CONFIG_VECTOR:
+ret = vdev->config_vector;
+break;
+case VIRTIO_MSI_QUEUE_VECTOR:
+ret = virtio_queue_vector(vdev, vdev->queue_sel);
+break;
 default:
 break;
 }
@@ -179,42 +220,72 @@ static uint32_t virtio_ioport_read(void *opaque, uint32_t 
addr)
 static uint32_t virtio_pci_config_readb(void *opaque, uint32_t addr)
 {
 VirtIOPCIProxy *proxy = opaque;
-addr -= proxy->addr + VIRTIO_PCI_CONFIG;
+uint32_t config = VIRTIO_PCI_CONFIG(&proxy->pci_dev);
+addr -= proxy->addr;
+if (addr < config)
+return virtio_ioport_read(proxy, addr);
+addr -= config;
 return virtio_config_readb(proxy->vdev, addr);
 }
 
 static uint32_t virtio_pci_config_readw(void *opaque, uint32_t addr)
 {
  

Re: [Qemu-devel] [PATCHv5 08/13] qemu: add support for resizing regions

2009-06-18 Thread Paul Brook
On Thursday 18 June 2009, Michael S. Tsirkin wrote:
> Make it possible to resize PCI regions.  This will be used by virtio
> with MSI-X, where the region size depends on whether MSI-X is enabled,
> and can change across load/save.

I thought we'd agreed we shouldn't be doing this.
i.e. if the user tries to load the wrong device, just bail out.

Paul
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [PATCHv5 08/13] qemu: add support for resizing regions

2009-06-18 Thread Michael S. Tsirkin
On Thu, Jun 18, 2009 at 05:32:11PM +0100, Paul Brook wrote:
> On Thursday 18 June 2009, Michael S. Tsirkin wrote:
> > Make it possible to resize PCI regions.  This will be used by virtio
> > with MSI-X, where the region size depends on whether MSI-X is enabled,
> > and can change across load/save.
> 
> I thought we'd agreed we shouldn't be doing this.
> i.e. if the user tries to load the wrong device, just bail out.
> 
> Paul

We need a way to generally prevent load from changing registers in a way
that can not be done by a guest.  I haven't implemented this feature yet
though.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.

2009-06-18 Thread Rusty Russell
On Thu, 18 Jun 2009 05:04:22 pm Herbert Xu wrote:
> On Thu, Jun 18, 2009 at 04:47:50PM +0930, Rusty Russell wrote:
> > Summary: we still have about 54 in-tree drivers which actually use
> > NETDEV_TX_BUSY for normal paths.  Can I fix it now?
>
> You can fix it but I don't quite understand your results below :)

You didn't comment on my patch which tried to fix NETDEV_TX_BUSY tho?

> > sungem.c: Y, N
>
> This driver does the bug check in addition to a race check that
> should simply drop the packet instead of queueing.  In fact chances
> are the race check is unnecessary anyway.

OK, "N" means "can be simply replaced with kfree_skb(skb); return 
NETDEV_TX_OK;".  "Y" means "driver will break if we do that, needs rewriting".
I didn't grade how hard or easy the rewrite would be, but later on I got more 
picky (I would have said this is N, N: the race can be replaced with a drop).

> > fs_enet: N
>
> This is either just a bug check or the driver is broken in that
> it should stop the queue when the said condition can be true.
>
> > mace.c: N
>
> Just a bug check.

Err, that's why they're N (ie. does not need TX_BUSY).

> > sh_eth.c: Y
>
> This driver should check the queue after transmitting, just like
> virtio-net :)
>
> So from a totally non-representative sample of 4, my conclusion
> is that none of them need TX_BUSY.  Do you have an example that
> really needs it?

First you asserted "Most of them just do this:... /* Never happens */".  Now 
I've found ~50 drivers which don't do that, it's "Do any of them really need 
it?".

So, now I'll look at that.  Some are just buggy (I'll send patches for those).  
Most I just have no idea what they're doing; they're pretty ugly.  These ones 
are interesting:

e1000/e1000_main.c: fifo bug workaround?
ehea/ehea_main.c: ?
starfire.c: "we may not have enough slots even when it seems we do."?
tg3.c: tg3_gso_bug

ISTR at least one driver claimed practice showed it was better to return 
TX_BUSY, and one insisted it wouldn't wasn't going to waste MAX_FRAGS on the 
stop-early scheme.

> Anyway, I don't think we should reshape our APIs based on how
> broken the existing users are.

We provided an API, people used it.  Constantly trying to disclaim our 
responsibility for the resulting mess makes me fucking ANGRY.

We either remove the API, or fix it.  I think fixing it is better, because my 
driver will be simpler and it's obvious noone wants to rewrite 50 drivers and 
break several of them.

I don't know how many times I can say the same thing...
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.

2009-06-18 Thread Herbert Xu
On Fri, Jun 19, 2009 at 01:07:19PM +0930, Rusty Russell wrote:
>
> You didn't comment on my patch which tried to fix NETDEV_TX_BUSY tho?

I think fixing it would only encourage more drivers to use and
abuse TX_BUSY.  The fundamental problem with TX_BUSY is that
you're doing the check before transmitting a packet instead of
after transmitting it.

Let me explain why this is wrong, beyond the fact that tcpdump
may see the packet twice which you've tried to fix.  The problem
is that requeueing is fundamentally hard.  We use to have this
horrible logic in the schedulers to handle this.  Thankfully that
seems to have been replaced with a single device-level packet
holder shared with GSO.

However, that is still wrong for many packet schedulers.  For
example, if the requeued packet is of a lower priority, and a
higher priority packet comes along, we want the higher priority
packet to preempt the requeued packet.  Right now it just doesn't
happen.

This is not as trivial as it seems because on a busy host this can
happen many times a second.  With TX_BUSY the QoS guarantees are
simply not workable.

BTW you pointed out that GSO also uses TX_BUSY, but that is
different because the packet schedulers treat a GSO packet as
a single entity so there is no issue of preemption.  Also tcpdump
will never see it twice by design.

> e1000/e1000_main.c: fifo bug workaround?

The workaround should work just as well as a stop-queue check
after packet transmission.

> ehea/ehea_main.c: ?

Ahh! The bastard LLTX drivers are still around.  LLTX was the
worst abuse associated with TX_BUSY.  Thankfully not many of them
are left.

The fix is to not use LLTX and use the xmit_lock like normal
drivers.

> starfire.c: "we may not have enough slots even when it seems we do."?

Just replace skb_num_frags with SKB_MAX_FRAGS and move the check
after the transmit.

> tg3.c: tg3_gso_bug

A better solution would in fact be to disable hardware TSO when
we encounter such a packet (and drop the first one).

Because once you get one you're likely to get a lot more.  The
difference between hardware TSO and GSO on a card like tg3 is
negligible anyway.

Alternatively just disable TSO completely on those chips.

Ccing the tg3 maintainer.
 
> We provided an API, people used it.  Constantly trying to disclaim our 
> responsibility for the resulting mess makes me fucking ANGRY.

Where have I disclaimed responsibility? If we were doing that
then we wouldn't be having this discussion.

> We either remove the API, or fix it.  I think fixing it is better, because my 
> driver will be simpler and it's obvious noone wants to rewrite 50 drivers and 
> break several of them.

My preference is obviously in the long term removal of TX_BUSY.
Due to resource constraints that cannot be done immediately.  So
at least we should try to stop its proliferation.

BTW removing TX_BUSY does not mean that your driver has to stay
complicated.  As I have said repeatedly your driver should be
checking the stop-queue condition after transmission, not before.

In fact queueing it in the driver is just as bad as return TX_BUSY!

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization